HG002 QC Snakemake

To Run

Resources and data specified within snakefile (hg002QC.smk) for simplicity. Tested with snakemake v6.15.3.

Warning: Several steps of this workflow require minimum coverage. It's recommended that this workflow not be run when yield in base pairs is insufficient to produceat least 15X coverage (i.e. yield/3099922541 >= 15x).

# clone repo
git clone --recursive https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake.git workflow

# make necessary directories
mkdir cluster_logs

# create conda environment
conda env create --file workflow/environment.yaml

# activate conda environment
conda activate pb-human-wgs-workflow

# submit job
sbatch workflow/run_hg002QC.sh

Plots

A list of important stats from target files that would be good for plotting.

targets = [f"conditions/{condition}/{filename}"
                    for condition in ubam_dict.keys()
                    for filename in ["smrtcell_stats/all_movies.read_length_and_quality.tsv",
                                    "hifiasm/asm.p_ctg.fasta.stats.txt",
                                    "hifiasm/asm.a_ctg.fasta.stats.txt",
                                    "hifiasm/asm.p_ctg.qv.txt",
                                    "hifiasm/asm.a_ctg.qv.txt",
                                    "truvari/summary.txt",
                                    "pbsv/all_chroms.pbsv.vcf.gz",
                                    "deepvariant/deepvariant.vcf.stats.txt",
                                    "whatshap/deepvariant.phased.tsv",
                                    "happy/all.summary.csv",
                                    "happy/all.extended.csv",
                                    "happy/cmrg.summary.csv",
                                    "happy/cmrg.extended.csv",
                                    "mosdepth/coverage.mosdepth.summary.txt",
                                    "mosdepth/mosdepth.M2_ratio.txt",
                                    "mosdepth/gc_coverage.summary.txt",
                                    "mosdepth/coverage.thresholds.summary.txt"]]

smrtcell_stats/all_movies.read_length_and_quality.tsv
- outputs 3 columns (read name, read length, read quality)
- boxplots of read length and quality
hifiasm/asm.p_ctg.fasta.stats.txt (primary) + hifiasm/asm.a_ctg.fasta.stats.txt (alternate)
- all stats below should be collected for both primary (p_ctg) and alternate (p_atg) assemblies
- assembly size awk '$1=="SZ" {print $2}' <filename>
- auN (area under the curve) awk '$1=="AU" {print $2}' <filename>
- NGx - line plot of NG10 through NG90 awk '$1=="NL" {print $2 $3}' <filename> ($2 is x-axis, $3 y-axis) like this: example plot
hifiasm/asm.p_ctg.qv.txt + hifiasm/asm.a_ctg.qv.txt
- adjusted assembly quality awk '$1=="QV" {print $3}' <filename> for primary and alternate assemblies
truvari/truvari.summary.txt
- structural variant recall jq .recall <filename>
- structural variant precision jq .precision <filename>
- structural variant f1 jq .f1 <filename>
- number of calls jq '."call cnt"' <filename>
- FP jq .FP <filename>
- TP-call jq .TP-call <filename>
- FN jq .FN <filename>
- TP-base jq .TP-base <filename>
pbsv/all_chroms.pbsv.vcf.gz
- counts of each type of variant bcftools query -i 'FILTER=="PASS"' -f '%INFO/SVTYPE\n' <filename> | awk '{A[$1]++}END{for(i in A)print i,A[i]}'
- can also do size distributions of indels bcftools query -i 'FILTER=="PASS" && (INFO/SVTYPE=="INS" | INFO/SVTYPE=="DEL")' -f '%INFO/SVTYPE\t%INFO/SVLEN\n' <filename>
deepvariant/deepvariant.vcf.stats.txt
- several values in lines starting with 'SN' awk '$1=="SN"' <filename>
  - number of SNPS
  - number INDELs
  - number of multi-allelic sites
  - number of multi-allelic SNP sites
- ratio of transitions to transversions awk '$1=="TSTV" {print$5}' <filename>
- can monitor substitution types awk '$1=="ST"' <filename>
- SNP heterozygous : non-ref homozygous ratio awk '$1=="PSC" {print $6/$5}' <filename>
- SNP transitions : transversions awk '$1=="PSC" {print $7/$8}' <filename>
- Number of heterozygous insertions : number of homozgyous alt insertions awk '$1=="PSI" {print $8/$10}' <filename>
- Number of heterozygous deletions : number of homozgyous alt deletions awk '$1=="PSI" {print $9/$11}' <filename>
- Total INDEL heterozygous:homozygous ratio awk '$1=="PSI" {print ($8+$9)/($10+$11)}' <filename>8+9:10+11 indel het:hom)
whatshap/deepvariant.phased.tsv
- phase block N50 awk '$2=="ALL" {print $22}' <filename>
- bp_per_block_sum (total number of phased bases) awk '$2=="ALL" {print $18}' <filename>
whatshap/deepvariant.phased.blocklist
- calculate phase block size (to - from) and reverse order them (awk 'NR>1 {print $5-$4}' <filename> |sort -nr), then plot as cumulative line graph like for assembly, N_0 to N90 example plot
happy/all.summary.csv + happy/cmrg.summary.csv
- stats should be collected for all variants and cmrg challenging medically relevant genes
  - SNP recall awk -F, '$1=="SNP" && $2=="PASS" {print $10}' <filename>
  - SNP precision awk -F, '$1=="SNP" && $2=="PASS" {print $11}' <filename>
  - SNP F1 awk -F, '$1=="SNP" && $2=="PASS" {print $13}' <filename>
  - INDEL recall awk -F, '$1=="INDEL" && $2=="PASS" {print $10}' <filename>
  - INDEL precision awk -F, '$1=="INDEL" && $2=="PASS" {print $11}' <filename>
  - INDEL F1 awk -F, '$1=="INDEL" && $2=="PASS" {print $13}' <filename>
happy/all.extended.csv + happy/cmrg.extended.csv
- there are many stratifications that can be examined, and Aaron Wenger might have opinionso n which are most important. The below commands are just for one stratification "GRCh38_lowmappabilityall.bed.gz".
- SNP GRCh38_lowmappabilityall recall awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- SNP GRCh38_lowmappabilityall precision awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- SNP GRCh38_lowmappabilityall F1 awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
- INDEL GRCh38_lowmappabilityall recall awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
- INDEL GRCh38_lowmappabilityall precision awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
- INDEL GRCh38_lowmappabilityall F1 awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
mosdepth/coverage.mosdepth.summary.txt
- mean aligned coverage in "coverage.mosdepth.summary.txt" - 4th column of final row, can grep 'total_region'
mosdepth/mosdepth.M2_ratio.txt
- outputs single value: ratio of chr2 coverage to chrM coverage
- bar chart of m2 ratio
mosdepth/gc_coverage.summary.txt
- outputs 5 columns: gc percentage bin, q1 , median , q3 , count
- q1, median, q3 columns are statistics for coverage at different gc percentages (e.g. median cover at 30% GC)
- "count" refers to # of 500 bp windows that fall in that bin
- can pick a couple of key GC coverage bins and make box plots out of them
mosdepth/coverage.thresholds.summary.txt
- outputs 10 columns corresponding to % of genome sequenced to minimum coverage depths (1X - 10X)
- maybe a line chart comparing the different coverage thresholds among conditions

Hg002-qc-snakemake - HG002 QC Snakemake

Related tags

Overview

HG002 QC Snakemake

To Run

Plots

Owner

Juniper A. Lake

Statsmodels: statistical modeling and econometrics in Python

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Convert tables stored as images to an usable .csv file

A variant of LinUCB bandit algorithm with local differential privacy guarantee

Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Methylation/modified base calling separated from basecalling.

Python beta calculator that retrieves stock and market data and provides linear regressions.

Analyzing Covid-19 Outbreaks in Ontario

University Challenge 2021 With Python

A library to create multi-page Streamlit applications with ease.

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

Flood modeling by 2D shallow water equation

BErt-like Neurophysiological Data Representation

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

Churn prediction with PySpark

ETL flow framework based on Yaml configs in Python

A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.