ULMFiT for Genomic Sequence Data

Overview

Genomic ULMFiT

This is an implementation of ULMFiT for genomics classification using Pytorch and Fastai. The model architecture used is based on the AWD-LSTM model, consisting of an embedding, three LSTM layers, and a final set of linear layers.

The ULMFiT approach uses three training phases to produce a classification model:

  1. Train a language model on a large, unlabeled corpus
  2. Fine tune the language model on the classification corpus
  3. Use the fine tuned language model to initialize a classification model

This method is particularly advantageous for genomic data, where large amounts of unlabeled data is abundant and labeled data is scarce. The ULMFiT approach allows us to train a model on a large, unlabeled genomic corpus in an unsupervised fashion. The pre-trained language model serves as a feature extractor for parsing genomic data.

Typical deep learning approaches to genomics classification are highly restricted to whatever labeled data is available. Models are usually trained from scratch on small datasets, leading to problems with overfitting. When unsupervised pre-training is used, it is typically done only on the classification dataset or on synthetically generated data. The Genomic-ULMFiT approach uses genome scale corpuses for pre-training to produce better feature extractors than we would get by training only on the classification corpus.

For a deep dive into the ULMFiT approach, model architectures, regularization and training strategies, see the Methods Long Form document in the Methods section.

Results

Performance of Genomic-ULMFiT relative to other methods

Promoter Classification

E. coli promoters

The Genomic-ULMFiT method performs well at the task of classifying promoter sequences from random sections of the genome. The process of unsupervised pre-training and fine-tuning has a clear impact on the performance of the classification model

Model Accuracy Precision Recall Correlation Coefficient
Naive 0.834 0.847 0.816 0.670
E. coli Genome Pre-Training 0.919 0.941 0.893 0.839
Genomic Ensemble Pre-Training 0.973 0.980 0.966 0.947

Data generation described in notebook

Notebook Directory

Classification performance on human promoters is competitive with published results

Human Promoters (short)

For the short promoter sequences, using data from Recognition of Prokaryotic and Eukaryotic Promoters using Convolutional Deep Learning Neural Networks:

Model DNA Size kmer/stride Accuracy Precision Recall Correlation Coefficient Specificity
Kh et al. -200/50 - - - 0.9 0.89 0.98
Naive Model -200/50 5/2 0.80 0.74 0.80 0.59 0.80
With Pre-Training -200/50 5/2 0.922 0.963 0.849 0.844 0.976
With Pre-Training and Fine Tuning -200/50 5/2 .977 .959 .989 .955 .969
With Pre-Training and Fine Tuning -200/50 5/1 .990 .983 .995 .981 .987
With Pre-Training and Fine Tuning -200/50 3/1 .995 .992 .996 .991 .994

Data Source

Notebook Directory

Human Promoters (long)

For the long promoter sequences, using data from PromID: Human Promoter Prediction by Deep Learning:

Model DNA Size Models Accuracy Precision Recall Correlation Coefficient
Umarov et al. -1000/500 2 Model Ensemble - 0.636 0.802 0.714
Umarov et al. -200/400 2 Model Ensemble - 0.769 0.755 0.762
Naive Model -500/500 Single Model 0.858 0.877 0.772 0.708
With Pre-Training -500/500 Single Model 0.888 0.90 0.824 0.770
With Pre-Training and Fine Tuning -500/500 Single Model 0.892 0.877 0.865 0.778

Data generation described in notebook

Notebook Directory

Other Bacterial Promoters

This table shows results on data from Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. These results show how CNN based methods can sometimes perform better when training on small datasets.

Method Organism Training Examples Accuracy Precision Recall Correlation Coefficient Specificity
Kh et al. E. coli 2936 - - 0.90 0.84 0.96
Genomic-ULMFiT E. coli 2936 0.956 0.917 0.880 0.871 0.977
Kh et al. B. subtilis 1050 - - 0.91 0.86 0.95
Genomic-ULMFiT B. subtilis 1050 0.905 0.857 0.789 0.759 0.95

Data Source

Notebook Directory

Metaganomics Classification

Genomic-ULMFiT shows improved performance on the metagenomics taxonomic dataset from Deep learning models for bacteria taxonomic classification of metagenomic data.

Method Data Source Accuracy Precision Recall F1
Fiannaca et al. Amplicon .9137 .9162 .9137 .9126
Genomic-ULMFiT Amplicon .9239 .9402 .9332 .9306
Fiannaca et al. Shotgun .8550 .8570 .8520 .8511
Genomic-ULMFiT Shotgun .8797 .8824 .8769 .8758

Data Source

Notebook Directory

Enhancer Classification

When trained on a dataset of mammalian enhancer sequences from Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences, Genomic_ULMFiT improves on results from Cohn et al.

Model/ROC-AUC Human Mouse Dog Opossum
Cohn et al. 0.80 0.78 0.77 0.72
Genomic-ULMFiT 5-mer Stride 2 0.812 0.871 0.773 0.787
Genomic-ULMFiT 4-mer Stride 2 0.804 0.876 0.771 0.786
Genomic-ULMFiT 3-mer Stride 1 0.819 0.875 0.788 0.798

Data Source

Notebook Directory

mRNA/lncRNA Classification

This table shows results for training a classification model on a dataset of coding mRNA sequences and long noncoding RNA (lncRNA) sequences. The dataset comes from A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential by Hill et al. The dataset contains two test sets - a standard test set and a challenge test set.

Model Test Set Accuracy Specificity Sensitivity Precision MCC
GRU Ensemble (Hill et al.)* Standard Test Set 0.96 0.97 0.95 0.97 0.92
Genomic ULMFiT (3mer stride 1) Standard Test Set 0.963 0.952 0.974 0.953 0.926
GRU Ensemble (Hill et al.)* Challenge Test Set 0.875 0.95 0.80 0.95 0.75
Genomic ULMFiT (3mer stride 1) Challenge Test Set 0.90 0.944 0.871 0.939 0.817

(*) Hill et al. presented their results as a plot rather than as a data table. Values in the above table are estimated by reading off the plot

Data Source

Notebook Directory

Interpreting Results

One way to gain insight into how the classification model makes decisions is to perturb regions of a given input sequence to see how changing different regions of the sequence impact the classification result. This allows us to create plots like the one below, highlighting important sequence regions for classification. In the plot below, the red line corresponds to a true transcription start site. The plot shows how prediction results are sensitive to changes around that location. More detail on interpretations can be found in the Model Interpretations directory.

Long Sequence Inference

Inference on long, unlabeled sequences can be done by breaking the input sequence into chunks and plotting prediction results as a function of length. The image below shows a sample prediction of promoter locations on a 40,000 bp region of the E. coli genome. True promoter locations are shown in red. More detail can be found in this notebook

Relevant Literature

For a comparison to other published methods, see Section 6 of the Methods notebook. Here are some relevant papers in the deep genomics classification space.

DeepCRISPR: optimized CRISPR guide RNA design by deep learning

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks

PromID: human promoter prediction by deep learning

Deep Learning for Genomics: A Concise Overview

Prediction of deleterious mutations in coding regions of mammals with transfer learning

Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences

PEDLA: predicting enhancers with a deep learning-based algorithmic framework

Predicting enhancers with deep convolutional neural networks

BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone

Deep learning models for bacteria taxonomic classification of metagenomic data

Prediction of enhancer-promoter interactions via natural language processing

A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential

Recurrent Neural Network for Predicting Transcription Factor Binding Sites

Learning the Language of the Genome using RNNs

Owner
Karl
Interested in anything related to deep learning, biotech, energy, materials
Karl
The implementation of our CIKM 2021 paper titled as: "Cross-Market Product Recommendation"

FOREC: A Cross-Market Recommendation System This repository provides the implementation of our CIKM 2021 paper titled as "Cross-Market Product Recomme

Hamed Bonab 16 Sep 12, 2022
Air Pollution Prediction System using Linear Regression and ANN

AirPollution Pollution Weather Prediction System: Smart Outdoor Pollution Monitoring and Prediction for Healthy Breathing and Living Publication Link:

Dr Sharnil Pandya, Associate Professor, Symbiosis International University 19 Feb 07, 2022
An Implementation of Fully Convolutional Networks in Tensorflow.

Update An example on how to integrate this code into your own semantic segmentation pipeline can be found in my KittiSeg project repository. tensorflo

Marvin Teichmann 1.1k Dec 12, 2022
Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks

Adversarially-Robust-Periphery Code + Data from the paper "Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks" by A

Anne Harrington 2 Feb 07, 2022
A synthetic texture-invariant dataset for object detection of UAVs

A synthetic dataset for object detection of UAVs This repository contains a synthetic datasets accompanying the paper Sim2Air - Synthetic aerial datas

LARICS Lab 10 Aug 13, 2022
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN Pytorch implementation Inception score evaluation StackGAN-v2-pytorch Tensorflow implementation for reproducing main results in the paper Sta

Han Zhang 1.8k Dec 21, 2022
REBEL: Relation Extraction By End-to-end Language generation

REBEL: Relation Extraction By End-to-end Language generation This is the repository for the Findings of EMNLP 2021 paper REBEL: Relation Extraction By

Babelscape 222 Jan 06, 2023
PyTorch implementation of an end-to-end Handwritten Text Recognition (HTR) system based on attention encoder-decoder networks

AttentionHTR PyTorch implementation of an end-to-end Handwritten Text Recognition (HTR) system based on attention encoder-decoder networks. Scene Text

Dmitrijs Kass 31 Dec 22, 2022
Container : Context Aggregation Network

Container : Context Aggregation Network If you use this code for a paper please cite: @article{gao2021container, title={Container: Context Aggregati

AI2 47 Dec 16, 2022
Active and Sample-Efficient Model Evaluation

Active Testing: Sample-Efficient Model Evaluation Hi, good to see you here! 👋 This is code for "Active Testing: Sample-Efficient Model Evaluation". P

Jannik Kossen 19 Oct 30, 2022
Official code for NeurIPS 2021 paper "Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN"

Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN Official code for NeurIPS 2021 paper "Towards Scalable Unpaired Virtu

68 Dec 21, 2022
[IROS'21] SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning

SurRoL IROS 2021 SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning Features dVRK compati

<a href=[email protected]"> 55 Jan 03, 2023
[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University) 842 Jan 04, 2023
A comprehensive list of published machine learning applications to cosmology

ml-in-cosmology This github attempts to maintain a comprehensive list of published machine learning applications to cosmology, organized by subject ma

George Stein 290 Dec 29, 2022
EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

EntityQuestions This repository contains the EntityQuestions dataset as well as code to evaluate retrieval results from the the paper Simple Entity-ce

Princeton Natural Language Processing 119 Sep 28, 2022
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning This repository contains the code for our ICCV 202

sangho.lee 28 Nov 08, 2022
SAT: 2D Semantics Assisted Training for 3D Visual Grounding, ICCV 2021 (Oral)

SAT: 2D Semantics Assisted Training for 3D Visual Grounding SAT: 2D Semantics Assisted Training for 3D Visual Grounding by Zhengyuan Yang, Songyang Zh

Zhengyuan Yang 22 Nov 30, 2022
Pytorch implementation of “Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement”

Graph-to-Graph Transformers Self-attention models, such as Transformer, have been hugely successful in a wide range of natural language processing (NL

Idiap Research Institute 40 Aug 14, 2022
Official implementation for NIPS'17 paper: PredRNN: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal LSTMs.

PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning The predictive learning of spatiotemporal sequences aims to generate future

THUML: Machine Learning Group @ THSS 243 Dec 26, 2022