Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Overview

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Build Status PyPI version

Deep generative models are rapidly becoming popular for the discovery of new molecules and materials. Such models learn on a large collection of molecular structures and produce novel compounds. In this work, we introduce Molecular Sets (MOSES), a benchmarking platform to support research on machine learning for drug discovery. MOSES implements several popular molecular generation models and provides a set of metrics to evaluate the quality and diversity of generated molecules. With MOSES, we aim to standardize the research on molecular generation and facilitate the sharing and comparison of new models.

For more details, please refer to the paper.

If you are using MOSES in your research paper, please cite us as

@article{10.3389/fphar.2020.565644,
  title={{M}olecular {S}ets ({MOSES}): {A} {B}enchmarking {P}latform for {M}olecular {G}eneration {M}odels},
  author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and  Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Alan and Zhavoronkov, Alex},
  journal={Frontiers in Pharmacology},
  year={2020}
}

pipeline

Dataset

We propose a benchmarking dataset refined from the ZINC database.

The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters.

The dataset contains 1,936,962 molecular structures. For experiments, we split the dataset into a training, test and scaffold test sets containing around 1.6M, 176k, and 176k molecules respectively. The scaffold test set contains unique Bemis-Murcko scaffolds that were not present in the training and test sets. We use this set to assess how well the model can generate previously unobserved scaffolds.

Models

Metrics

Besides standard uniqueness and validity metrics, MOSES provides other metrics to access the overall quality of generated molecules. Fragment similarity (Frag) and Scaffold similarity (Scaff) are cosine distances between vectors of fragment or scaffold frequencies correspondingly of the generated and test sets. Nearest neighbor similarity (SNN) is the average similarity of generated molecules to the nearest molecule from the test set. Internal diversity (IntDiv) is an average pairwise similarity of generated molecules. Fréchet ChemNet Distance (FCD) measures the difference in distributions of last layer activations of ChemNet. Novelty is a fraction of unique valid generated molecules not present in the training set.

Model Valid (↑) [email protected] (↑) [email protected] (↑) FCD (↓) SNN (↑) Frag (↑) Scaf (↑) IntDiv (↑) IntDiv2 (↑) Filters (↑) Novelty (↑)
Test TestSF Test TestSF Test TestSF Test TestSF
Train 1.0 1.0 1.0 0.008 0.4755 0.6419 0.5859 1.0 0.9986 0.9907 0.0 0.8567 0.8508 1.0 1.0
HMM 0.076±0.0322 0.623±0.1224 0.5671±0.1424 24.4661±2.5251 25.4312±2.5599 0.3876±0.0107 0.3795±0.0107 0.5754±0.1224 0.5681±0.1218 0.2065±0.0481 0.049±0.018 0.8466±0.0403 0.8104±0.0507 0.9024±0.0489 0.9994±0.001
NGram 0.2376±0.0025 0.974±0.0108 0.9217±0.0019 5.5069±0.1027 6.2306±0.0966 0.5209±0.001 0.4997±0.0005 0.9846±0.0012 0.9815±0.0012 0.5302±0.0163 0.0977±0.0142 0.8738±0.0002 0.8644±0.0002 0.9582±0.001 0.9694±0.001
Combinatorial 1.0±0.0 0.9983±0.0015 0.9909±0.0009 4.2375±0.037 4.5113±0.0274 0.4514±0.0003 0.4388±0.0002 0.9912±0.0004 0.9904±0.0003 0.4445±0.0056 0.0865±0.0027 0.8732±0.0002 0.8666±0.0002 0.9557±0.0018 0.9878±0.0008
CharRNN 0.9748±0.0264 1.0±0.0 0.9994±0.0003 0.0732±0.0247 0.5204±0.0379 0.6015±0.0206 0.5649±0.0142 0.9998±0.0002 0.9983±0.0003 0.9242±0.0058 0.1101±0.0081 0.8562±0.0005 0.8503±0.0005 0.9943±0.0034 0.8419±0.0509
AAE 0.9368±0.0341 1.0±0.0 0.9973±0.002 0.5555±0.2033 1.0572±0.2375 0.6081±0.0043 0.5677±0.0045 0.991±0.0051 0.9905±0.0039 0.9022±0.0375 0.0789±0.009 0.8557±0.0031 0.8499±0.003 0.996±0.0006 0.7931±0.0285
VAE 0.9767±0.0012 1.0±0.0 0.9984±0.0005 0.099±0.0125 0.567±0.0338 0.6257±0.0005 0.5783±0.0008 0.9994±0.0001 0.9984±0.0003 0.9386±0.0021 0.0588±0.0095 0.8558±0.0004 0.8498±0.0004 0.997±0.0002 0.6949±0.0069
JTN-VAE 1.0±0.0 1.0±0.0 0.9996±0.0003 0.3954±0.0234 0.9382±0.0531 0.5477±0.0076 0.5194±0.007 0.9965±0.0003 0.9947±0.0002 0.8964±0.0039 0.1009±0.0105 0.8551±0.0034 0.8493±0.0035 0.976±0.0016 0.9143±0.0058
LatentGAN 0.8966±0.0029 1.0±0.0 0.9968±0.0002 0.2968±0.0087 0.8281±0.0117 0.5371±0.0004 0.5132±0.0002 0.9986±0.0004 0.9972±0.0007 0.8867±0.0009 0.1072±0.0098 0.8565±0.0007 0.8505±0.0006 0.9735±0.0006 0.9498±0.0006

For comparison of molecular properties, we computed the Wasserstein-1 distance between distributions of molecules in the generated and test sets. Below, we provide plots for lipophilicity (logP), Synthetic Accessibility (SA), Quantitative Estimation of Drug-likeness (QED) and molecular weight.

logP SA
logP SA
weight QED
weight QED

Installation

PyPi

The simplest way to install MOSES (models and metrics) is to install RDKit: conda install -yq -c rdkit rdkit and then install MOSES (molsets) from pip (pip install molsets). If you want to use LatentGAN, you should also install additional dependencies using bash install_latentgan_dependencies.sh.

If you are using Ubuntu, you should also install sudo apt-get install libxrender1 libxext6 for RDKit.

Docker

  1. Install docker and nvidia-docker.

  2. Pull an existing image (4.1Gb to download) from DockerHub:

docker pull molecularsets/moses

or clone the repository and build it manually:

git clone https://github.com/molecularsets/moses.git
nvidia-docker image build --tag molecularsets/moses moses/
  1. Create a container:
nvidia-docker run -it --name moses --network="host" --shm-size 10G molecularsets/moses
  1. The dataset and source code are available inside the docker container at /moses:
docker exec -it molecularsets/moses bash

Manually

Alternatively, install dependencies and MOSES manually.

  1. Clone the repository:
git lfs install
git clone https://github.com/molecularsets/moses.git
  1. Install RDKit for metrics calculation.

  2. Install MOSES:

python setup.py install
  1. (Optional) Install dependencies for LatentGAN:
bash install_latentgan_dependencies.sh

Benchmarking your models

  • Install MOSES as described in the previous section.

  • Get train, test and test_scaffolds datasets using the following code:

import moses

train = moses.get_dataset('train')
test = moses.get_dataset('test')
test_scaffolds = moses.get_dataset('test_scaffolds')
  • You can use a standard torch DataLoader in your models. We provide a simple StringDataset class for convenience:
from torch.utils.data import DataLoader
from moses import CharVocab, StringDataset

train = moses.get_dataset('train')
vocab = CharVocab.from_data(train)
train_dataset = StringDataset(vocab, train)
train_dataloader = DataLoader(
    train_dataset, batch_size=512,
    shuffle=True, collate_fn=train_dataset.default_collate
)

for with_bos, with_eos, lengths in train_dataloader:
    ...
  • Calculate metrics from your model's samples. We recomend sampling at least 30,000 molecules:
import moses
metrics = moses.get_all_metrics(list_of_generated_smiles)
  • Add generated samples and metrics to your repository. Run the experiment multiple times to estimate the variance of the metrics.

Reproducing the baselines

End-to-End launch

You can run pretty much everything with:

python scripts/run.py

This will split the dataset, train the models, generate new molecules, and calculate the metrics. Evaluation results will be saved in metrics.csv.

You can specify the GPU device index as cuda:n (or cpu for CPU) and/or model by running:

python scripts/run.py --device cuda:1 --model aae

For more details run python scripts/run.py --help.

You can reproduce evaluation of all models with several seeds by running:

sh scripts/run_all_models.sh

Training

python scripts/train.py <model name> \
       --train_load <train dataset> \
       --model_save <path to model> \
       --config_save <path to config> \
       --vocab_save <path to vocabulary>

To get a list of supported models run python scripts/train.py --help.

For more details of certain model run python scripts/train.py <model name> --help.

Generation

python scripts/sample.py <model name> \
       --model_load <path to model> \
       --vocab_load <path to vocabulary> \
       --config_load <path to config> \
       --n_samples <number of samples> \
       --gen_save <path to generated dataset>

To get a list of supported models run python scripts/sample.py --help.

For more details of certain model run python scripts/sample.py <model name> --help.

Evaluation

python scripts/eval.py \
       --ref_path <reference dataset> \
       --gen_path <generated dataset>

For more details run python scripts/eval.py --help.

Owner
MOSES
A Benchmarking Platform for Molecular Generation Models
MOSES
StrongSORT: Make DeepSORT Great Again

StrongSORT StrongSORT: Make DeepSORT Great Again StrongSORT: Make DeepSORT Great Again Yunhao Du, Yang Song, Bo Yang, Yanyun Zhao arxiv 2202.13514 Abs

369 Jan 04, 2023
A PyTorch implementation of the paper "Semantic Image Synthesis via Adversarial Learning" in ICCV 2017

Semantic Image Synthesis via Adversarial Learning This is a PyTorch implementation of the paper Semantic Image Synthesis via Adversarial Learning. Req

Seonghyeon Nam 146 Nov 25, 2022
EEGEyeNet is benchmark to evaluate ET prediction based on EEG measurements with an increasing level of difficulty

Introduction EEGEyeNet EEGEyeNet is a benchmark to evaluate ET prediction based on EEG measurements with an increasing level of difficulty. Overview T

Ard Kastrati 23 Dec 22, 2022
Code from PropMix, accepted at BMVC'21

PropMix: Hard Sample Filtering and Proportional MixUp for Learning with Noisy Labels This repository is the official implementation of Hard Sample Fil

6 Dec 21, 2022
Implementation of Nalbach et al. 2017 paper.

Deep Shading Convolutional Neural Networks for Screen-Space Shading Our project is based on Nalbach et al. 2017 paper. In this project, a set of buffe

Marcel Santana 17 Sep 08, 2022
PyTorch implementation of SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

PyTorch implementation of SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

Thalles Silva 1.7k Dec 28, 2022
Optical Character Recognition + Instance Segmentation for russian and english languages

Распознавание рукописного текста в школьных тетрадях Соревнование, проводимое в рамках олимпиады НТО, разработанное Сбером. Платформа ODS. Результаты

Gerasimov Maxim 21 Dec 19, 2022
PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

Yulun Zhang 1.2k Dec 26, 2022
PyTorch 1.0 inference in C++ on Windows10 platforms

Serving PyTorch Models in C++ on Windows10 platforms How to use Prepare Data examples/data/train/ - 0 - 1 . . . - n examples/data/test/

Henson 88 Oct 15, 2022
RL Algorithms with examples in Python / Pytorch / Unity ML agents

Reinforcement Learning Project This project was created to make it easier to get started with Reinforcement Learning. It now contains: An implementati

Rogier Wachters 3 Aug 19, 2022
Official implementation of the PICASO: Permutation-Invariant Cascaded Attentional Set Operator

PICASO Official PyTorch implemetation for the paper PICASO:Permutation-Invariant Cascaded Attentive Set Operator. Requirements Python 3 torch = 1.0 n

Samira Zare 0 Dec 23, 2021
Hcpy - Interface with Home Connect appliances in Python

Interface with Home Connect appliances in Python This is a very, very beta inter

Trammell Hudson 116 Dec 27, 2022
Compositional Sketch Search

Compositional Sketch Search Official repository for ICIP 2021 Paper: Compositional Sketch Search Requirements Install and activate conda environment c

Alexander Black 8 Sep 06, 2021
Optimized code based on M2 for faster image captioning training

Transformer Captioning This repository contains the code for Transformer-based image captioning. Based on meshed-memory-transformer, we further optimi

lyricpoem 16 Dec 16, 2022
Secure Distributed Training at Scale

Secure Distributed Training at Scale This repository contains the implementation of experiments from the paper "Secure Distributed Training at Scale"

Yandex Research 9 Jul 11, 2022
BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

BossNAS This repository contains PyTorch evaluation code, retraining code and pretrained models of our paper: BossNAS: Exploring Hybrid CNN-transforme

Changlin Li 127 Dec 26, 2022
Given a 2D triangle mesh, we could randomly generate cloud points that fill in the triangle mesh

generate_cloud_points Given a 2D triangle mesh, we could randomly generate cloud points that fill in the triangle mesh. Run python disp_mesh.py Or you

Peng Yu 2 Dec 24, 2021
Answering Open-Domain Questions of Varying Reasoning Steps from Text

This repository contains the authors' implementation of the Iterative Retriever, Reader, and Reranker (IRRR) model in the EMNLP 2021 paper "Answering Open-Domain Questions of Varying Reasoning Steps

26 Dec 22, 2022
A toolset of Python programs for signal modeling and indentification via sparse semilinear autoregressors.

SPAAR Description A toolset of Python programs for signal modeling via sparse semilinear autoregressors. References Vides, F. (2021). Computing Semili

Fredy Vides 0 Oct 30, 2021
Reinforcement Learning for the Blackjack

Reinforcement Learning for Blackjack Author: ZHA Mengyue Math Department of HKUST Problem Statement We study playing Blackjack by reinforcement learni

Dolores 3 Jan 24, 2022