HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

Overview

HiFi DeepVariant + WhatsHap workflow

Workflow steps

  1. align HiFi reads to reference with pbmm2
  2. call small variants with DeepVariant, using two-pass method (DeepVariant ➡️ WhatsHap phase ➡️ WhatsHap haplotag ➡️ DeepVariant)
  3. phase small variants with WhatsHap
  4. haplotag aligned BAMs with WhatsHap and merge

Directory structure within basedir

.
├── cluster_logs  # slurm stderr/stdout logs
├── reference
│   ├── reference.chr_lengths.txt  # cut -f1,2 reference.fasta > reference.chr_lengths.txt
│   ├── reference.fasta
│   └── reference.fasta.fai
├── samples
│   └── 
   
      # sample_id regex: r'[A-Za-z0-9_-]+'
│       ├── whatshap/  # phased small variants; merged haplotagged alignments
│       ├── logs/  # per-rule stdout/stderr logs
│       ├── aligned/  # intermediate
│       ├── deepvariant/  # intermediate
│       ├── deepvariant_intermediate/  # intermediate
│       └── whatshap_intermediate/  # intermediate
├── smrtcells
│   ├── done  # move folders from smrtcells/ready to smrtcells/done to prevent re-processing
│   └── ready
│       └── 
    
       # uBAMs or FASTQs per sample
│                        # filename regex: r'm\d{5}[Ue]?_\d{6}_\d{6}).(ccs|hifi_reads).bam' or r'm\d{5}[Ue]?_\d{6}_\d{6}).fastq.gz'
└── workflow  # clone of this repo

    
   

To run the pipeline

$ conda create \
    --channel bioconda \
    --channel conda-forge \
    --prefix ./conda_env \
    python=3 snakemake mamba lockfile

$ conda activate ./conda_env

$ sbatch workflow/run_snakemake.sh <sample_id>
Owner
William Rowell
Geneticist & 1x programmer on a good day. BFX applications @PacificBiosciences.
William Rowell
Spert NLP Relation Extraction API deployed with torchserve for inference

URLMask Python program for Linux users to change a URL to ANY domain. A program than can take any url and mask it to any domain name you like. E.g. ne

Zichu Chen 1 Nov 24, 2021
GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

GCRC GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Eva

Yunxiao Zhao 5 Nov 04, 2022
English loanwords in the world's languages

Wiktionary as CLDF Content cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages

Viktor Martinović 3 Jan 14, 2022
NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Coursera Natural Language Processing Specialization This repository contains material related to Coursera Natural Language Processing Specialization.

Nishant Sharma 1 Jun 05, 2022
Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

sl 1 Apr 08, 2022
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
A music comments dataset, containing 39,051 comments for 27,384 songs.

Music Comments Dataset A music comments dataset, containing 39,051 comments for 27,384 songs. For academic research use only. Introduction This datase

Zhang Yixiao 2 Jan 10, 2022
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

Tomoki Hayashi 1.2k Dec 23, 2022
A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

A sample Python project A sample project that exists as an aid to the Python Packaging User Guide's Tutorial on Packaging and Distributing Projects. T

Python Packaging Authority 4.5k Dec 30, 2022
Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录 为了促进中文自然语言处理研究的发展,本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现,可以支持数据并行、模型加速、流水并行的代码。 安装 1、首先安装pytorch等基础依赖,再安装APEX以支持fp16。 p

Tsinghua AI 37 Dec 06, 2022
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022
A Streamlit web app that generates Rick and Morty stories using GPT2.

Rick and Morty Story Generator This project uses a pre-trained GPT2 model, which was fine-tuned on Rick and Morty transcripts, to generate new stories

₸ornike 33 Oct 13, 2022
Transformer related optimization, including BERT, GPT

This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

NVIDIA Corporation 1.7k Jan 04, 2023
मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

मुक्त स्त्रोत 20 Oct 11, 2022
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022