Precision Medicine Knowledge Graph (PrimeKG)

Overview

PrimeKG


website GitHub Repo stars GitHub Repo forks License: MIT

Website | bioRxiv Paper | Harvard Dataverse

Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integrates 20 high-quality biomedical resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, considerably expanding previous efforts in disease-rooted knowledge graphs. We accompany PrimeKG’s graph structure with text descriptions of clinical guidelines for drugs and diseases to enable multimodal analyses.

Updates

Unique Features of PrimeKG

  • Diverse coverage of diseases: PrimeKG contains over 17,000 diseases including rare dieases. Disease nodes in PrimeKG are densely connected to other nodes in the graph and have been optimized for clinical relevance in downstream precision medicine tasks.
  • Heterogeneous knowledge graph: PrimeKG contains over 100,000 nodes distributed over various biological scales as depicted below. PrimeKG also contains over 4 million relationships between these nodes distributed over 29 types of edges.
  • Multimodal integration of clinical knowledge: Disease and drug nodes in PrimeKG are augmented with clinical descriptors that come from medical authorities such as Mayo Clinic, Orphanet, Drug Bank, and so forth.
  • Ready-to-use datasets: PrimeKG is minimally dependent on external packages. Our knowledge graph can be retrieved in a ready-to-use format from Harvard Dataverse.
  • Data functions: PrimeKG provides extensive data functions, including processors for primary resources and scripts to build an updated knowledge graph.

overview

PrimeKG-example

Environment setup

Using pip

To install the dependencies required to run the PrimeKG code, use pip:

pip install -r requirements.txt

Or use conda

conda env create --name PrimeKG --file=environments.yml

Building an updated PrimeKG

Downloading primary data resources

All persistent identifiers and weblinks to download the 20 primary data resources used to build PrimeKG are systematically provided in the Data Records section of our article. We have also mentioned the exact filenames that were downloaded from each resource for easy corroboration.

Curating primary data resources

We provide the scripts used to process all primary data resources and the names of the resulting output files generated by those scripts. We would be happy to share the intermediate processing datasets that were used to create PrimeKG on request.

Database Processing scripts Expected script output
Bgee bgee.py anatomy_gene.csv
Comparative Toxicogenomics Database ctd.py exposure_data.csv
DisGeNET - curated_gene_disease_associations.tsv
DrugBank drugbank_drug_drug.py drug_drug.csv
DrugBank parsexml_drugbank.ipynb, Parsed_feature.ipynb 12 drug feature files
DrugBank drugbank_drug_protein.py drug_protein.csv
Drug Central drugcentral_queries.txt drug_disease.csv
Drug Central drugcentral_feature.Rmd dc_features.csv
Entrez Gene ncbigene.py protein_go_associations.csv
Gene Ontology go.py go_terms_info.csv, go_terms_relations.csv
Human Phenotype Ontology hpo.py, hpo_obo_parser.py hp_terms.csv, hp_parents.csv, hp_references.csv
Human Phenotype Ontology hpoa.py disease_phenotype_pos.csv, disease_phenotype_neg.csv
MONDO mondo.py, mondo_obo_parser.py mondo_terms.csv, mondo_parents.csv, mondo_references.csv, mondo_subsets.csv, mondo_definitions.csv
Reactome reactome.py reactome_ncbi.csv, reactome_terms.csv, reactome_relations.csv
SIDER sider.py sider.csv
UBERON uberon.py uberon_terms.csv, uberon_rels.csv, uberon_is_a.csv
UMLS umls.py, map_umls_mondo.py umls_mondo.csv
UMLS umls.ipynb umls_def_disorder_2021.csv, umls_def_disease_2021.csv

Harmonizing datasets into PrimeKG

The code to harmonize datasets and construct PrimeKG is available at build_graph.ipynb. Simply run this jupyter notebook in order to construct the knowledge graph form the outputs of the processing files mentioned above. This jupyter notebook produces all three versions of PrimeKG, kg_raw.csv, kg_giant.csv, and the complete version kg.csv.

Feature extraction

The code required to engineer features can be found at engineer_features.ipynb and mapping_mayo.ipynb.

Cite Us

If you find PrimeKG useful, cite our work:

@article{chandak2022building,
  title={Building a knowledge graph to enable precision medicine},
  author={Chandak, Payal and Huang, Kexin and Zitnik, Marinka},
  journal={bioRxiv},
  doi={10.1101/2022.05.01.489928},
  URL={https://www.biorxiv.org/content/early/2022/05/01/2022.05.01.489928},
  year={2022}
}

Data Server

PrimeKG is hosted on Harvard Dataverse with the following persistent identifier https://doi.org/10.7910/DVN/IXA7BM. When Dataverse is under maintenance, PrimeKG datasets cannot be retrieved. That happens rarely; please check the status on the Dataverse website.

License

PrimeKG codebase is under MIT license. For individual dataset usage, please refer to the dataset license found in the website.

Owner
Machine Learning for Medicine and Science @ Harvard
Machine Learning for Medicine and Science @ Harvard
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Tetsumichi(Telly) Umada 30 Sep 07, 2022
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

18 Nov 28, 2022
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

Jungil Kong 1.1k Jan 02, 2023
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Francis R. Willett 305 Dec 22, 2022
Binaural Speech Synthesis

Binaural Speech Synthesis This repository contains code to train a mono-to-binaural neural sound renderer. If you use this code or the provided datase

Facebook Research 135 Dec 18, 2022
Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

Siva Prakash 3 Apr 05, 2022
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
Pytorch version of BERT-whitening

BERT-whitening This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval". BERT-whitening is

Weijie Liu 255 Dec 27, 2022
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation In this repo you can find the code of the Supervised Hybrid Audio Segmentatio

Machine Translation @ UPC 21 Dec 20, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
List of GSoC organisations with number of times they have been selected.

Welcome to GSoC Organisation Frequency And Details 👋 List of GSoC organisations with number of times they have been selected, techonologies, topics,

Shivam Kumar Jha 41 Oct 01, 2022
Programme de chiffrement et de déchiffrement inverse d'un message en python3.

Chiffrement Inverse En Python3 Programme de chiffrement et de déchiffrement inverse d'un message en python3. Explication du chiffrement inverse avec c

Malik Makkes 2 Mar 26, 2022
Simple, hackable offline speech to text - using the VOSK-API.

Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton 844 Jan 07, 2023
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

0 Jun 20, 2021
Using BERT-based models for toxic span detection

SemEval 2021 Task 5: Toxic Spans Detection: Task: Link to SemEval-2021: Task 5 Toxic Span Detection is https://competitions.codalab.org/competitions/2

Ravika Nagpal 1 Jan 04, 2022
IMDB film review sentiment classification based on BERT's supervised learning model.

IMDB film review sentiment classification based on BERT's supervised learning model. On the other hand, the model can be extended to other natural language multi-classification tasks.

Paris 1 Apr 17, 2022
TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in

Laboratory for Social Machines 84 Dec 20, 2022
This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

Pattern-Exploiting Training (PET) This repository contains the code for Exploiting Cloze Questions for Few-Shot Text Classification and Natural Langua

Timo Schick 1.4k Dec 30, 2022