SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Overview

SNV Pipeline

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38). The pipeline requires user defined datasets & annotation sources, available tools and input set of vcf files. It generates analysis scripts that can be incorporated into high performance cluster (HPC) computing to process the samples. This results in list of filtered variants per family that can be used for interpreation, reporting and further downstream analysis.

For demonstration purpose below example is presented for GRCh37. However, the same can be replicated for GRCh38.

Installation

git clone https://github.com/ajaarma/snv.git

Required Installation packages

Install anaconda v2.0
Follow this link for installation: https://docs.anaconda.com/anaconda/install/linux/
Conda environment commands
$ conda create --name snv
$ source activate snv
$ conda install python=2.7.16
$ pip install xmltodict
$ pip install dicttoxml

$ conda install -c bioconda gvcfgenotyper
$ conda install -c anaconda gawk	
$ conda install samtools=1.3
$ conda install vcftools=0.1.14
$ conda install bcftools=1.9
$ conda install gcc #(OSX)
$ conda install gcc_linux-64 #(Linux)
$ conda install parallel
$ conda install -c mvdbeek ucsc_tools
** conda-develop -n 
    
    
     /demo/softwares/vep/Plugins/

$ conda install -c r r-optparse
$ conda install -c r r-dplyr
$ conda install -c r r-plyr
$ conda install -c r r-data.table
$ conda install -c aakumar r-readbulk
$ conda install -c bioconda ensembl-vep=100.4
$ vep_install -a cf -s homo_sapiens -y GRCh37 -c 
     
      /demo/softwares/vep/grch37 --CONVERT
$ vep_install -a cf -s homo_sapiens -y GRCh38 -c 
      
       /demo/softwares/vep/grch38 --CONVERT

      
     
    
   

Data directory and datasets

Default datasets provided
1. exac_pli: demo/resources/gnomad/grch37/gnomad.v2.1.1.lof_metrics.by_transcript_forVEP.txt
2. ensembl: demo/resources/ensembl/grch37/ensBioMart_grch37_v98_ENST_lengths_191208.txt
3. region-exons: demo/resources/regions/grch37/hg19_refseq_ensembl_exons_50bp_allMT_hgmd_clinvar_20200519.txt
4. region-pseudo-autosomal: demo/resources/regions/grch37/hg19_non_pseudoautosomal_regions_X.txt
5. HPO: demo/resources/hpo/phenotype_to_genes.tar.gz
Other datasets that require no entry to user-configuration file
6. Curated: 
	6.1. Genelist: demo/resources/curated/NGC_genelist_allNamesOnly-20200519.txt
	6.2. Somatic mosaicism genes: demo/resources/curated/haem_somatic_mosaicism_genes_20191015.txt
	6.3. Imprinted gene list: demo/resources/curated/imprinted_genes_20200424.txt
	6.4. Polymorphic gene list: demo/resources/curated/polymorphic_genes_20200509.txt
7. OMIM: demo/resources/omim/omim_20200421_geneInfoBase.txt

Download link for following dataset and place them in corresponding directories as shown

' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c > hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz 8. CLINVAR: Download link: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : clinvar=clinvar/grch37/clinvar_20200506.vcf.gz ">
1. HPO: Extract HPO phenotypes mapping:
	$ cd 
   
    /demo/resources/hpo/
	$ tar -zxvf phenotypes_to_genes.tar.gz 

2. REFERENCE SEQUENCE GENOME (FASTA file alongwith Index)
	Download link: https://drive.google.com/drive/folders/1Ro3pEYhVdYkMmteSr8YRPFeTvb_K0lVf?usp=sharing
	Download file: Homo_sapiens.GRCh37.74.dna.fasta
		Get the corresponding index and dict files: *.fai and *.dict
	Put this in folder: demo/resources/genomes/grch37/Homo_sapiens.GRCh37.74.dna.fasta

3. GNOMAD
	Download link (use wget): 
	Genomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz
	Exomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Put it in this folder: 
		demo/resources/gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		demo/resources/gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Edit User config flat file CONFIG/UserConfig.txt : 
		gnomad_g=gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		gnomad_e=gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz

4. ExAC:
	Download Link: https://drive.google.com/drive/folders/11Ya8XfAxOYmlKZ9mN8A16IDTLHdHba_0?usp=sharing
	Download file: ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		also the index files (*.csi and *.tbi)
	Put it in this folder as: 
		demo/resources/exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
	Edit User config flat file CONFIG/UserConfig.txt : 
		exac=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		exac_t=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz

5. CADD:
	Download link (use wget):
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/whole_genome_SNVs.tsv.gz
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/InDels.tsv.gz
		(Also download the corresponding tabix index files as well)
	Put it in this directory: 
		demo/resources/cadd/grch37/whole_genome_SNVs.tsv.gz
		demo/resource/cadd/grch37/InDels.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		cadd_snv=cadd/grch37/whole_genome_SNVs.tsv.gz
		cadd_indel=cadd/grch37/InDels.tsv.gz

6. REVEL:
	Download link: https://drive.google.com/drive/folders/12Tl1YU5bI-By_VawTPVWHef7AXzn4LuP?usp=sharing
	Download file: new_tabbed_revel.tsv.gz
	         Also the index file: *.tbi
	Put it in this directory: demo/resources/revel/grch37/new_tabbed_revel.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt : 
		revel=revel/grch37/new_tabbed_revel.tsv.gz

7. HGMD:
	Download link: http://www.hgmd.cf.ac.uk/ac/index.php (Require personal access login)
	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19.vcf

	Use this command to process HGMD file inside this directory:
		$ cat hgmd_pro_2019.4_hg19.vcf | sed '/##comment=.*\"/a  ##INFO=
    
     ' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c  > hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz	

	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz

8. CLINVAR:
	Download link: 
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi
	Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		clinvar=clinvar/grch37/clinvar_20200506.vcf.gz

    
   
Customized Curated Annotation sets
Default present with this distribution. Can be found in XML file with these tags:
	(1) GeneList: 
   
    
	(2) Somatic mosaicism genes: 
    
     
	(3) Imprinted genes: 
     
      
	(4) Polymorphic genes: 
      
       
	(5) HPO terms: 
       
         (6) OMIM: 
         
        
       
      
     
    
   

Activate the conda environment

$ source activate snv

Step - 1:

1. Edit CONFIG/UserConfig.txt: 
	(a) Add the absolute path prefix for the resources directory with tag: resourceDir. 
	    An example can be seen in CONFIG/Example-UserConfig.txt file.
	(b) Manually check if datasets corresponding to other field tags are correctly downloaded and 
	    put in respective folders.
2. Create user defined XML file from input User Configuration flat file and Base-XML file
Command:
$ python createAnalysisXML.py -u 
   
     
		    	      -b 
    
      
		              -o 
     

     
    
   
Example:
$ python createAnalysisXML.py -u CONFIG/UserConfig.txt 
		       	      -b CONFIG/Analysis_base_grch37.xml 
		              -o CONFIG/Analysis_user_grch37.xml
Outputs:
CONFIG/Analysis_user_grch37.xml

Step-2:

1. Put the respective vcf files in the directory. For example: demo/example/vcf/ 
2. Create manifest file in same format as shown in demo/example/example_manifest.txt
3. Assign gender to each family members (illumina or sample id). For example: demo/example/example_genders.txt
4. List of all the family ids that needed to be analyzed.
         For e.g: demo/example/manifest/example_family_analysis.txt

Step -3:

Generate all the shell scripts that can be incorporated into user specific HPC cluster network. For e.g: Slurm/PBS/LSF network.

Command:
$ python processSNV.py 	-a 
   
    
	    		-p 
    
     
	      		-m 
     
      
	     		-e 
      
       
			-w 
       
         -g 
        
          -d 
         
           -f 
          
            -s 
           
             -r 
             
            
           
          
         
        
       
      
     
    
   
Example:
$ python processSNV.py 	-a CONFIG/Analysis_user_grch37.xml \ 
			-p 20210326 \
			-m 
   
    /demo/example/example_manifest.txt \
			-e gvcfGT \
			-w 
    
     /demo/example/ \
			-g 
     
      /demo/example/example_genders.txt \
			-d 
      
       /demo/example/exeter_samples_norm.fof \
			-f 
       
        /demo/example/manifest/example_family_analysis.txt \ -s 
        
         /demo/example/manifest/example_family.fof \ -r 
         
          /demo/example/manifest/example_family_header.txt (optional) 
         
        
       
      
     
    
   
Outputs:
Two scripts in the directory: 
   
    /demo/example/20210326/tmp_binaries/
Launch the scripts in these 2 stages sequentially after each of them gets finished.

   (1) genotypeAndAnnotate_chr%.sh where %=1..22,X,Y and MT
	scatter the annotation and frequency filtering per chromosome for all families.
   (2) mergeAndFilter.sh:
	Merge all the chromosome and apply inheritance filtering.

   

Step-4

Final output of list of filtered variant is present in:

   
    /demo/example/20210326/fam_filter/
    
     /
     
      .filt_
      
       .txt

      
     
    
   
For any questions/issues/bugs please mail us at:
Owner
East Genomics
Bringing together genomic medicine across the East Midlands and East of England
East Genomics
Port of dplyr and other related R packages in python, using pipda.

Unlike other similar packages in python that just mimic the piping syntax, datar follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from t

179 Dec 21, 2022
ped-crash-techvol: Texas Ped Crash Tech Volume Pack

ped-crash-techvol: Texas Ped Crash Tech Volume Pack In conjunction with the Final Report "Identifying Risk Factors that Lead to Increase in Fatal Pede

Network Modeling Center; Center for Transportation Research; The University of Texas at Austin 2 Sep 28, 2022
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
DataPrep — The easiest way to prepare data in Python

DataPrep — The easiest way to prepare data in Python

SFU Database Group 1.5k Dec 27, 2022
An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R

largeVis This is an implementation of the largeVis algorithm described in (https://arxiv.org/abs/1602.00370). It also incorporates: A very fast algori

336 May 25, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark PySpark bindings for the H3 core library. For available functions,

Kevin Schaich 12 Dec 24, 2022
Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

tldextract Python Module tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and s

John Kurkowski 1.6k Jan 03, 2023
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 03, 2022
Python script for transferring data between three drives in two separate stages

Waterlock Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs ha

David Swanlund 13 Nov 10, 2021
We're Team Arson and we're using the power of predictive modeling to combat wildfires.

We're Team Arson and we're using the power of predictive modeling to combat wildfires. Arson Map Inspiration There’s been a lot of wildfires in Califo

Jerry Lee 3 Oct 17, 2021
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 05, 2023
WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

Chair for Sys­tems Se­cu­ri­ty 27 Dec 22, 2022
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
Scraping and analysis of leetcode-compensations page.

Leetcode compensations report Scraping and analysis of leetcode-compensations page.

utsav 96 Jan 01, 2023
track your GitHub statistics

GitHub-Stalker track your github statistics 👀 features find new followers or unfollowers find who got a star on your project or remove stars find who

Bahadır Araz 34 Nov 18, 2022
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 07, 2021
AWS Glue ETL Code Samples

AWS Glue ETL Code Samples This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilit

AWS Samples 1.2k Jan 03, 2023