Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

Overview

GNN_PPI

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction
Authors: Guofeng Lv, Zhiqiang Hu, Yanguang Bi, Shaoting Zhang
Arxiv extended verison (arxiv: https://arxiv.org/abs/2105.06709)

Contact: [email protected]. Any questions or discussions are welcomed!

Abstract

The study of multi-type Protein-Protein Interaction (PPI) is fundamental for understanding biological processes from a systematic perspective and revealing disease mechanisms. Existing methods suffer from significant performance degradation when tested in unseen dataset. In this paper, we investigate the problem and find that it is mainly attributed to the poor performance for inter-novel-protein interaction prediction. However, current evaluations overlook the inter-novel-protein interactions, and thus fail to give an instructive assessment. As a result, we propose to address the problem from both the evaluation and the methodology. Firstly, we design a new evaluation framework that fully respects the inter-novel-protein interactions and gives consistent assessment across datasets. Secondly, we argue that correlations between proteins must provide useful information for analysis of novel proteins, and based on this, we propose a graph neural network based method (GNN-PPI) for better inter-novel-protein interaction prediction. Experimental results on real-world datasets of different scales demonstrate that GNN-PPI significantly outperforms state-of-the-art PPI prediction methods, especially for the inter-novel-protein interaction prediction.

Contribution

  1. We design a new evaluation framework that fully respects the inter-novel-protein interactions and give consistent assessment across datasets.

    An example of the testset construction strategies under the new evaluation framework. Random is the current scheme, while Breath-First Search (BFS) and Depth-First Search (DFS) are the proposed schemes.
  2. We propose to incorporate correlation between proteins into the PPI prediction problem. A graph neural network based method is presented to model the correlations.

    Development and evaluation of the GNN-PPI framework. Pairwise interaction data are firstly assembled to build the graph, where proteins serve as the nodes and interactions as the edges. The testset is constructed by firstly selecting the root node and then performing the proposed BFS or DFS strategy. The model is developed by firstly performing embedding for each protein to obtain predefined features, then processed by Convolution, Pooling, BiGRU and FC modules to extract protein-independent encoding (PIE) features, which are finally aggregated by graph convolutions and arrive at protein-graph encoding (PGE) features. Features of the pair proteins in interaction are multiplied and classified, supervised by the trainset labels.
  3. The proposed GNN-PPI model achieves state-of-the-art performance in real datasets of different scales, especially for the inter-novel-protein interaction prediction.

    For further investigation, we divide the testset into BS, ES and NS subsets, where BS denotes Both of the pair proteins in interaction were Seen during training, ES denotes Either (but not both) of the pair proteins was Seen, and NS denotes Neither proteins were Seen during training. We regard ES and NS as inter-novel-protein interactions. Existing methods suffer from significant performance degradation when tested on unseen Protein-protein interaction, especially inter-novel-protein interactions. On the contrary, GNN-PPI can handle this situation well, whether it is BS, ES or NS, the performance will not be greatly reduced.

Experimental Results

We evaluate the multi-label PPI prediction performance using micro-F1. This is because micro-averaging will emphasize the common labels in the dataset, which gives each sample the same importance.

Benchmark

  • Performance of GNN-PPI against comparative methods over different datasets and data partition schemes.

In-depth Analysis

  • In-depth analysis between PIPR and GNN-PPI over BS, ES and NS subsets.

Model Generalization

  • Testing on trainset-homologous testset vs. unseen testset, under different evaluations.

PPI Network Graph Construction

  • The impact of the PPI network graph construction method.

Using GNN_PPI

This repository contains:

  • Environment Setup
  • Data Processing
  • Training
  • Testing
  • Inference

Environment Setup

base environment: python 3.7, cuda 10.2, pytorch 1.6, torchvision 0.7.0, tensorboardX 2.1
pytorch-geometric:
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.6.0+cu102.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.6.0+cu102.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.6.0+cu102.html
pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.6.0+cu102.html
pip install torch-geometric

Data Processing

The data processing codes in gnn_data.py (Class GNN_DATA), including:

  • data reading (def __init__)
  • protein vectorize (def get_feature_origin)
  • generate pyg data (def generate_data)
  • Data partition (def split_dataset)

Training

Training codes in gnn_train.py, and the run script in run.py.

"python -u gnn_train.py \
    --description={} \              # Description of the current training task
    --ppi_path={} \                 # ppi dataset
    --pseq_path={} \                # protein sequence
    --vec_path={} \                 # protein pretrained-embedding
    --split_new={} \                # whether to generate a new data partition, or use the previous
    --split_mode={} \               # data split mode
    --train_valid_index_path={} \   # Data partition json file path
    --use_lr_scheduler={} \         # whether to use training learning rate scheduler
    --save_path={} \                # save model, config and results dir path
    --graph_only_train={} \         # PPI network graph construction method, True: GCT, False: GCA
    --batch_size={} \               # Batch size
    --epochs={} \                   # Train epochs
    ".format(description, ppi_path, pseq_path, vec_path, 
            split_new, split_mode, train_valid_index_path,
            use_lr_scheduler, save_path, graph_only_train, 
            batch_size, epochs)

Dataset Download:

STRING(we use Homo sapiens subset):

SHS27k and SHS148k:

This repositorie uses the processed dataset download path:

Testing

Testing codes in gnn_test.py and gnn_test_bigger.py, and the run script in run_test.py and run_test_bigger.py.

gnn_test.py: It can test the overall performance, and can also make in-depth analysis to test the performance of different test data separately.
gnn_test_bigger.py: It can test the performance between the trainset-homologous testset and the unseen testset.
Running script run_test_bigger.py as above.

Inference

If you have your own dataset or want to save the prediction results, you can use inference.py. After execution, a ppi csv file will be generated to record the predicted PPI type of each pair of interacting proteins.

Running script run_inference.py as above.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@misc{lv2021learning,
    title={Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction}, 
    author={Guofeng Lv and Zhiqiang Hu and Yanguang Bi and Shaoting Zhang},
    year={2021},
    eprint={2105.06709},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
You might also like...
Codes for NAACL 2021 Paper
Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

Unsupervised-Multi-hop-QA This repository contains code and models for the paper: Unsupervised Multi-hop Question Answering by Question Generation (NA

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge Introduction SentiLARE is a sentiment-aware pre-trained language

Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

LADA This repo contains codes for the following paper: Jiaao Chen*, Zhenghui Wang*, Ran Tian, Zichao Yang, Diyi Yang: Local Additivity Based Data Augm

Official codes for the paper
Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

Codes for ACL-IJCNLP 2021 Paper
Codes for ACL-IJCNLP 2021 Paper "Zero-shot Fact Verification by Claim Generation"

Zero-shot-Fact-Verification-by-Claim-Generation This repository contains code and models for the paper: Zero-shot Fact Verification by Claim Generatio

Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

Implementation of CVPR 2021 paper
Implementation of CVPR 2021 paper "Spatially-invariant Style-codes Controlled Makeup Transfer"

SCGAN Implementation of CVPR 2021 paper "Spatially-invariant Style-codes Controlled Makeup Transfer" Prepare The pre-trained model is avaiable at http

Codes accompanying the paper "Learning Nearly Decomposable Value Functions with Communication Minimization" (ICLR 2020)

NDQ: Learning Nearly Decomposable Value Functions with Communication Minimization Note This codebase accompanies paper Learning Nearly Decomposable Va

Codes for CIKM'21 paper 'Self-Supervised Graph Co-Training for Session-based Recommendation'.

COTREC Codes for CIKM'21 paper 'Self-Supervised Graph Co-Training for Session-based Recommendation'. Requirements: Python 3.7, Pytorch 1.6.0 Best Hype

Releases(v1.0)
Owner
Ursa Zrimsek
Ursa Zrimsek
This is RFA-Toolbox, a simple and easy-to-use library that allows you to optimize your neural network architectures using receptive field analysis (RFA) and create graph visualizations of your architecture.

ReceptiveFieldAnalysisToolbox This is RFA-Toolbox, a simple and easy-to-use library that allows you to optimize your neural network architectures usin

84 Nov 23, 2022
Repository for training material for the 2022 SDSC HPC/CI User Training Course

hpc-training-2022 Repository for training material for the 2022 SDSC HPC/CI Training Series HPC/CI Training Series home https://www.sdsc.edu/event_ite

sdsc-hpc-training-org 21 Jul 27, 2022
K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce (EMNLP Founding 2021)

Introduction K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce. Installation PyTor

Xu Song 21 Nov 16, 2022
The world's largest toxicity dataset.

The Toxicity Dataset by Surge AI Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's wh

Surge AI 134 Dec 19, 2022
Source code for the Paper: CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints}

CombOptNet: Fit the Right NP-Hard Problem by Learning Integer Programming Constraints Installation Run pipenv install (at your own risk with --skip-lo

Autonomous Learning Group 65 Dec 27, 2022
Cluttered MNIST Dataset

Cluttered MNIST Dataset A setup script will download MNIST and produce mnist/*.t7 files: luajit download_mnist.lua Example usage: local mnist_clutter

DeepMind 50 Jul 12, 2022
Quick program made to generate alpha and delta tables for Hidden Markov Models

HMM_Calc Functions for generating Alpha and Delta tables from a Hidden Markov Model. Parameters: a: Matrix of transition probabilities. a[i][j] = a_{i

Adem Odza 1 Dec 04, 2021
Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data Christoph Reich, Tim Prangemeier, Özdemir Cetin & Heinz Koeppl | Pr

Christoph Reich 23 Sep 21, 2022
Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022

Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes / 3DCrowdNet News 💪 3DCrowdNet achieves the state-of-the-art accuracy on 3D

Hongsuk Choi 113 Dec 21, 2022
Chainer Implementation of Semantic Segmentation using Adversarial Networks

Semantic Segmentation using Adversarial Networks Requirements Chainer (1.23.0) Differences Use of FCN-VGG16 instead of Dilated8 as Segmentor. Caution

Taiki Oyama 99 Jun 28, 2022
PFLD pytorch Implementation

PFLD-pytorch Implementation of PFLD A Practical Facial Landmark Detector by pytorch. 1. install requirements pip3 install -r requirements.txt 2. Datas

zhaozhichao 669 Jan 02, 2023
Implementation of CVPR'2022:Surface Reconstruction from Point Clouds by Learning Predictive Context Priors

Surface Reconstruction from Point Clouds by Learning Predictive Context Priors (CVPR 2022) Personal Web Pages | Paper | Project Page This repository c

136 Dec 12, 2022
Cross-Modal Contrastive Learning for Text-to-Image Generation

Cross-Modal Contrastive Learning for Text-to-Image Generation This repository hosts the open source JAX implementation of XMC-GAN. Setup instructions

Google Research 94 Nov 12, 2022
OpenDILab RL Kubernetes Custom Resource and Operator Lib

DI Orchestrator DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator. Prerequisites A w

OpenDILab 205 Dec 29, 2022
Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

Softlearning Softlearning is a deep reinforcement learning toolbox for training maximum entropy policies in continuous domains. The implementation is

Robotic AI & Learning Lab Berkeley 997 Dec 30, 2022
PyTorch implementation of ARM-Net: Adaptive Relation Modeling Network for Structured Data.

A ready-to-use framework of latest models for structured (tabular) data learning with PyTorch. Applications include recommendation, CRT prediction, healthcare analytics, and etc.

48 Nov 30, 2022
[CVPR 2021] Counterfactual VQA: A Cause-Effect Look at Language Bias

Counterfactual VQA (CF-VQA) This repository is the Pytorch implementation of our paper "Counterfactual VQA: A Cause-Effect Look at Language Bias" in C

Yulei Niu 94 Dec 03, 2022
Learning cell communication from spatial graphs of cells

ncem Features Repository for the manuscript Fischer, D. S., Schaar, A. C. and Theis, F. Learning cell communication from spatial graphs of cells. 2021

Theis Lab 77 Dec 30, 2022
The implementation of the paper "HIST: A Graph-based Framework for Stock Trend Forecasting via Mining Concept-Oriented Shared Information".

The HIST framework for stock trend forecasting The implementation of the paper "HIST: A Graph-based Framework for Stock Trend Forecasting via Mining C

Wentao Xu 110 Dec 27, 2022
Benchmark for evaluating open-ended generation

OpenMEVA Contributed by Jian Guan, Zhexin Zhang. Thank Jiaxin Wen for DeBugging. OpenMEVA is a benchmark for evaluating open-ended story generation me

25 Nov 15, 2022