🥇Samsung AI Challenge 2021 1등 솔루션입니다🥇

Overview

MoT - Molecular Transformer

Large-scale Pretraining for Molecular Property Prediction

Samsung AI Challenge for Scientific Discovery

This repository is an official implementation of a model which won first place in the Samsung AI Challenge for Scientific Discovery competition and was introduced at SAIF 2021. The result of the challenge was announced at this video.

Introduction

MoT is a transformer-based model for predicting molecular properties from its 3D molecular structure. It was first introduced to calculate the excitation energy gap between S1 and T1 states by the molecular structure.

Requirements

Before running this project, you need to install the below libraries:

  • numpy
  • pandas
  • torch==1.9.0+cu111
  • tqdm
  • wandb
  • dataclasses
  • requests
  • omegaconf
  • pytorch_lightning==1.4.8
  • rdkit-pypi
  • scikit_learn

This project supports NVIDIA Apex. It will be automatically detected and used to accelerate training when installed. apex reduces the training time up to 50%.

setup.sh helps installing necessary libraries, including apex. It installs the requirements and apex at once. You can simply run the script as follows:

$ bash setup.sh

About Molecular Transformer

There are many apporaches to predict the molecular properties. However, for the case of calculating excitation energy gaps (e.g. between S1 to T1 states), it is necessary to consider the entire 3D structure and the charge of atoms in the compound. But many transformer-based molecular models use SMILES (or InChI) format. We also tried text-based methods in the competition, but the graph-based models showed better performance.

The important thing is to consider all connections between the atoms in the compound. However, the atoms are placed in 3D coordinate system, and it is almost impossible to feed 3D positional informations to the model (and adding 3d positional embeddings was worse than the baseline). So we designed new attention method, inspired by disentangled attention in DeBERTa.

First of all, the type of atoms and their charges will be embedded to the vectors and summed. Note that the positional embeddings will not be used to the input because attention layers will calculate the attention scores relatively. And thanks to the absence of the positional embeddings, there is no limit to the number of atoms.

The hidden representations will be attended by the attention layers. Similar to the disentangled attention introduced in DeBERTa, our relative attention is performed not only for contents, but also between relative informations and the contents. The relative informations include relative distances and the type of bonds between the atoms.

The relative information R is calculated as above. The euclidean distances are encoded through sinusoidal encoding, with modified period (from 10000 to 100). The bond type embeddings can be described as below:

The important thing is disconnections (i.e. there is no bond between two certain atoms) should be embedded as index 0, rather than excluded from attention. Also [CLS] tokens are separated from other normal bond-type embeddings on relative attention.

According to the above architecture, the model successfully focuses on the relations of the atoms. And similar to the other transformer-based models, it also shows that pretraining from large-scale dataset achieves better performance, even with few finetuning samples. We pretrained our model with PubChem3D (50M) and PubChemQC (3M). For PubChem3D, the model was trained to predict conformer-RMSD, MMFF94 energy, shape self-overlap, and feature self-overlap. For PubChemQC, the model was trained to predict the singlet excitation energies from S1 to S10 states.

Reproduction

To reproduce our results on the competition or pretrain a new model, you should follow the below steps. A large disk and high-performance GPUs (e.g. A100s) will be required.

Download PubChem3D and PubChemQC

First of all, let's download PubChem3D and PubChemQC datasets. The following commands will download the datasets and format to the specific dataset structure.

$ python utilities/download_pubchem.py
$ python utilities/download_pubchemqc.py

Although we used 50M PubChem3D compounds, you can use full 100M samples if your network status and the client are available while downloading.

After downloading all datasets, we have to create index files which indicate the seeking position of each sample. Because the dataset size is really large, it is impossible to load the entire data to the memory. So our dataset will access the data randomly using this index files.

$ python utilities/create_dataset_index.py pubchem-compound-50m.csv
$ python utilities/create_dataset_index.py pubchemqc-excitations-3m.csv

Check if pubchem-compound-50m.index and pubchemqc-excitations-3m.index are created.

Training and Finetuning

Now we are ready to train MoT. Using the datasets, we are going to pretrain new model. Move the datasets to pretrain directory and also change the working directory to pretrain. And type the below commands to pretrain for PubChem3D and PubChemQC datasets respectively. Note that PubChemQC-pretraining will use PubChem3D-pretrained model weights.

$ python src/train.py config/mot-base-pubchem.yaml
$ python src/train.py config/mot-base-pubchemqc.yaml

Check if mot-base-pubchem.pth and mot-base-pubchemqc.pth are created. Next, move the final output weights file (mot-base-pubchemqc.pth) to finetune directory. Prepare the competition dataset samsung-ai-challenge-for-scientific-discovery to the same directory and start finetuning by using below command:

$ python src/train.py config/train/mot-base-pubchemqc.yaml  \
        data.fold_index=[fold index]                        \
        model.random_seed=[random seed]

We recommend to train the model for 5 folds with various random seeds. It is well known that the random seed is critial to transformer finetuning. You can tune the random seed to achieve better results.

After finetuning the models, use following codes to predict the energy gaps through test dataset.

$ python src/predict.py config/predict/mot-base-pubchemqc.yaml \
        model.pretrained_model_path=[finetuned model path]

And you can see the prediction file of which name is same as the model name. You can submit the single predictions or average them to get ensembled result.

$ python utilities/simple_ensemble.py finetune/*.csv [output file name]

Finetune with custom dataset

If you want to finetune with custom dataset, all you need to do is to rewrite the configuration file. Note that finetune directory is considered only for the competition dataset. So the entire training codes are focused on the competition data structure. Instead, you can finetune the model with your custom dataset on pretrain directory. Let's check the configuration file for PubChemQC dataset which is placed at pretrain/config/mot-base-pubchemqc.yaml.

data:
  dataset_file:
    label: pubchemqc-excitations-3m.csv
    index: pubchemqc-excitations-3m.index
  input_column: structure
  label_columns: [s1_energy, s2_energy, s3_energy, s4_energy, s5_energy, s6_energy, s7_energy, s8_energy, s9_energy, s10_energy]
  labels_mean_std:
    s1_energy: [4.56093558, 0.8947327]
    s2_energy: [4.94014921, 0.8289951]
    s3_energy: [5.19785427, 0.78805644]
    s4_energy: [5.39875606, 0.75659831]
    s5_energy: [5.5709758, 0.73529373]
    s6_energy: [5.71340364, 0.71889017]
    s7_energy: [5.83764871, 0.70644563]
    s8_energy: [5.94665475, 0.6976438]
    s9_energy: [6.04571037, 0.69118142]
    s10_energy: [6.13691953, 0.68664366]
  max_length: 128
  bond_drop_prob: 0.1
  validation_ratio: 0.05
  dataloader_workers: -1

model:
  pretrained_model_path: mot-base-pubchem.pth
  config: ...

In the configuration file, you can see data.dataset_file field. It can be changed to the desired finetuning dataset with its index file. Do not forget to create the index file by utilities/create_dataset_index.py. And you can specify the column name which contains the encoded 3D structures. data.label_columns indicates which columns will be used to predict. The values will be normalized by data.labels_mean_std. Simply copy this file and rename to your own dataset. Change the name and statistics of each label. Here is an example for predicting toxicity values:

data:
  dataset_file:
    label: toxicity.csv
    index: toxicity.index
  input_column: structure
  label_columns: [toxicity]
  labels_mean_std:
    toxicity: [0.92, 1.85]
  max_length: 128
  bond_drop_prob: 0.0
  validation_ratio: 0.1
  dataloader_workers: -1

model:
  pretrained_model_path: mot-base-pubchemqc.pth
  config:
    num_layers: 12
    hidden_dim: 768
    intermediate_dim: 3072
    num_attention_heads: 12
    hidden_dropout_prob: 0.1
    attention_dropout_prob: 0.1
    position_scale: 100.0
    initialize_range: 0.02

train:
  name: mot-base-toxicity
  optimizer:
    lr: 1e-4
    betas: [0.9, 0.999]
    eps: 1e-6
    weight_decay: 0.01
  training_steps: 100000
  warmup_steps: 10000
  batch_size: 256
  accumulate_grads: 1
  max_grad_norm: 1.0
  validation_interval: 1.0
  precision: 16
  gpus: 1

Results on Competition Dataset

Model PubChem PubChemQC Competition LB (Public/Private)
ELECTRA 0.0493 0.1508/−
BERT Regression 0.0074 0.0497 0.1227/−
MoT-Base (w/o PubChem) 0.0188 0.0877/−
MoT-Base (PubChemQC 150k) 0.0086 0.0151 0.0666/−
    + PubChemQC 300k " 0.0917 0.0526/−
    + 5Fold CV " " 0.0507/−
    + Ensemble " " 0.0503/−
    + Increase Maximum Atoms " " 0.0497/0.04931

Description: Comparison results of various models. ELECTRA and BERT Regression are SMILES-based models which are trained with PubChem-100M (and PubChemQC-3M for BERT Regression only). ELECTRA is trained to distinguish fake SMILES tokens (i.e., ELECTRA approach) and BERT Regression is trained to predict the labels, without unsupervised learning. PubChemQC 150k and 300k denote that the model is trained for 150k and 300k steps in PubChemQC stage.

Utilities

This repository provides some useful utility scripts.

  • create_dataset_index.py: As mentioned above, it creates seeking positions of samples in the dataset for random accessing.
  • download_pubchem.py and download_pubchemqc.py: Download PubChem3D and PubChemQC datasets.
  • find_test_compound_cids.py: Find CIDs of the compounds in test dataset to prevent from training the compounds. It may occur data-leakage.
  • simple_ensemble.py: It performs simple ensemble by averaging all predictions from various models.

License

This repository is released under the Apache License 2.0. License can be found in LICENSE file.

Grow Function: Generate 3D Stacked Bifurcating Double Deep Cellular Automata based organisms which differentiate using a Genetic Algorithm...

Grow Function: A 3D Stacked Bifurcating Double Deep Cellular Automata which differentiates using a Genetic Algorithm... TLDR;High Def Trees that you can mint as NFTs on Solana

Nathaniel Gibson 4 Oct 08, 2022
HiFT: Hierarchical Feature Transformer for Aerial Tracking (ICCV2021)

HiFT: Hierarchical Feature Transformer for Aerial Tracking Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li Our paper is Accepted by ICCV 2

Intelligent Vision for Robotics in Complex Environment 55 Nov 23, 2022
Source code for the paper "PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction" in ACL2021

PLOME:Pre-training with Misspelled Knowledge for Chinese Spelling Correction (ACL2021) This repository provides the code and data of the work in ACL20

197 Nov 26, 2022
DARTS-: Robustly Stepping out of Performance Collapse Without Indicators

[ICLR'21] DARTS-: Robustly Stepping out of Performance Collapse Without Indicators [openreview] Authors: Xiangxiang Chu, Xiaoxing Wang, Bo Zhang, Shun

55 Nov 01, 2022
Rainbow is all you need! A step-by-step tutorial from DQN to Rainbow

Do you want a RL agent nicely moving on Atari? Rainbow is all you need! This is a step-by-step tutorial from DQN to Rainbow. Every chapter contains bo

Jinwoo Park (Curt) 1.4k Dec 29, 2022
Liver segmentation using MONAI and pytorch

Machine Learning use case in the field of Healthcare. In this project MONAI and pytorch frameworks are used for 3D Liver segmentation.

Abhishek Gajbhiye 2 May 30, 2022
A time series processing library

Timeseria Timeseria is a time series processing library which aims at making it easy to handle time series data and to build statistical and machine l

Stefano Alberto Russo 11 Aug 08, 2022
E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

End-to-end Music Remastering System This repository includes source code and pre

Junghyun (Tony) Koo 37 Dec 15, 2022
DLL: Direct Lidar Localization

DLL: Direct Lidar Localization Summary This package presents DLL, a direct map-based localization technique using 3D LIDAR for its application to aeri

Service Robotics Lab 127 Dec 16, 2022
Air Quality Prediction Using LSTM

AirQualityPredictionUsingLSTM In this Repo, i present to you the winning solution of smart gujarat hackathon 2019 where the task was to predict the qu

Deepak Nandwani 2 Dec 13, 2022
[ICLR 2021] HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark

HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark Accepted as a spotlight paper at ICLR 2021. Table of content File structure Prerequi

72 Jan 03, 2023
Label-Free Model Evaluation with Semi-Structured Dataset Representations

Label-Free Model Evaluation with Semi-Structured Dataset Representations Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch

8 Oct 06, 2022
An implementation of Fastformer: Additive Attention Can Be All You Need in TensorFlow

Fast Transformer This repo implements Fastformer: Additive Attention Can Be All You Need by Wu et al. in TensorFlow. Fast Transformer is a Transformer

Rishit Dagli 139 Dec 28, 2022
BaseCls BaseCls 是一个基于 MegEngine 的预训练模型库,帮助大家挑选或训练出更适合自己科研或者业务的模型结构

BaseCls BaseCls 是一个基于 MegEngine 的预训练模型库,帮助大家挑选或训练出更适合自己科研或者业务的模型结构。 文档地址:https://basecls.readthedocs.io 安装 安装环境 BaseCls 需要 Python = 3.6。 BaseCls 依赖 M

MEGVII Research 28 Dec 23, 2022
Barlow Twins and HSIC

Barlow Twins and HSIC Unofficial Pytorch implementation for Barlow Twins and HSIC_SSL on small datasets (CIFAR10, STL10, and Tiny ImageNet). Correspon

Yao-Hung Hubert Tsai 49 Nov 24, 2022
Processed, version controlled history of Minecraft's generated data and assets

mcmeta Processed, version controlled history of Minecraft's generated data and assets Repository structure Each of the following branches has a commit

Misode 75 Dec 28, 2022
A flexible framework of neural networks for deep learning

Chainer: A deep learning framework Website | Docs | Install Guide | Tutorials (ja) | Examples (Official, External) | Concepts | ChainerX Forum (en, ja

Chainer 5.8k Jan 06, 2023
3D-aware GANs based on NeRF (arXiv).

CIPS-3D This repository will contain the code of the paper, CIPS-3D: A 3D-Aware Generator of GANs Based on Conditionally-Independent Pixel Synthesis.

Peterou 563 Dec 31, 2022
CVAT is free, online, interactive video and image annotation tool for computer vision

Computer Vision Annotation Tool (CVAT) CVAT is free, online, interactive video and image annotation tool for computer vision. It is being used by our

OpenVINO Toolkit 8.6k Jan 04, 2023
A simple baseline for 3d human pose estimation in PyTorch.

3d_pose_baseline_pytorch A PyTorch implementation of a simple baseline for 3d human pose estimation. You can check the original Tensorflow implementat

weigq 312 Jan 06, 2023