Find Transposon Element insertions using long reads (nanopore), by alignment directly. (minimap2)

Overview

find_te_ins

find_te_ins is designed to find Transposon Element (TE) insertions using long reads (nanopore), by alignment directly. (minimap2)

Install

$ git clone https://github.com/bakerwm/find_te_ins.git
$ cd find_te_ins

Change the following variables upon your condition: genome_fa and te_fa in line-10 and line-11;

$ bash run_pipe.sh
run_pipe.sh 
    
    

    
   

Prerequisite

  • minimap2 - 2.17-r974-dirty, align long reads to reference genome
  • featureCounts - v2.0.0, quantification
  • samtools - v1.12, working with BAM files
  • python 3.8+
  • pysam 0.16.0.1, python module, working with BAM files

Getting Started

1 Prepare input files

  • genome_fa - reference genome in fasta format, in script run_pipe.sh, line-10
  • te_fa - TE consensus sequence in fasta format, in script run_pipe.sh, line-11
  • long reads - Long reads from NanoPore or Pacbio, in fasta or fastq format

2 Run pipe

$ cd ~/work/te_ins
# specify the path of long reads data: 
   
    /
   
$ git clone https://github.com/bakerwm/find_te_ins.git 
$ bash find_te_ins/run_pipe.sh <path-to-long-reads>/ results

[1/9] align to reference genome
[2/9] extract raw insertions from BAM, by CIGAR
[3/9] convert raw insertions to fasta format
[4/9] align raw_insertion to transposon
[5/9] extract transposon name for insertions
[6/9] merge raw_insertions by window=100
[7/9] count reads for each insertion
[8/9] save final insertions to file
[9/9] Done!

3 Output

The following files listed below are the output of the pipeline, the TE insertions saved in file *.te_ins.final.bed

$ tree -L 2 results/ONT_sample-1
.
├── ONT_sample-1
│   ├── ONT_sample-1.bam
│   ├── ONT_sample-1.bam.bai
│   ├── ONT_sample-1.raw_ins.bed
│   ├── ONT_sample-1.raw_ins.fa
│   ├── ONT_sample-1.raw_ins.fa.bam
│   ├── ONT_sample-1.raw_ins.fa.bam.bai
│   ├── ONT_sample-1.te_ins.bed
│   ├── ONT_sample-1.te_ins.final.bed
│   ├── ONT_sample-1.te_ins.final.bed6
│   ├── ONT_sample-1.te_ins.gtf
│   ├── ONT_sample-1.te_ins.quant.stderr
│   ├── ONT_sample-1.te_ins.quant.stdout
│   ├── ONT_sample-1.te_ins.quant.txt
│   ├── ONT_sample-1.te_ins.quant.txt.summary
│   ├── ONT_sample-1.te_ins.raw.txt
│   ├── run_minimap2.dm6.stderr
│   └── run_minimap2.dm6_transposon.stderr
...

{sample_name}.te_ins.final.bed

column 1. chr name of reference 
column 2. start pos of Insertion 
column 3. end pos of Insertion 
column 4. insertion name 
column 5. a fixed integer [255]  
column 6. strand # in current version, not consider the dirction of TE insertions !!!
column 7. name of TE consensus 
column 8. length of TE consensus  
column 9. proportion of the TE consensus identified  
column 10. number of supported reads for the insertion 
column 11. number of all reads cover the insertion 
column 12. proportion TE supported reads 
column 13. type of the TE insertions [full, p3, p5]

{sample_name}.te_ins.raw.txt

column 16 (last column), is the type of TE insertions: [full, p3, p5]

  • full, more then cutoff [60%] of the TE consensus were detected
  • p3, only the 3' end of the TE consensus were detected
  • p5, only the 5' end of the TE consensus were detected

In the .final.bed file, ONLY full TE insertions were saved for further analysis

Change criteria

TE types were defined in run_pipe.sh by anno_te.py, the criteria -c 0.6 could be changed to [0-1] float number based on your condition. see line-100 in file run_pipe.sh

# line-100 of run_pipe.sh
[[ ! -f ${te_ins_txt} ]] && python ${src_dir}/anno_te.py -x ${te_fa_fai} ${te_bam} | sort -k4,4 -k5,5n > ${te_ins_txt}

# change criteria to 0.7
[[ ! -f ${te_ins_txt} ]] && python ${src_dir}/anno_te.py -x ${te_fa_fai} -c 0.7 ${te_bam} | sort -k4,4 -k5,5n > ${te_ins_txt}

# remove te_ins files, and run the command again
$ rm results/ONT_sample-1.te_ins*
$ bash find_te_ins/run_pipe.sh 
   
    / results

   

How it works?

  1. extract INSERTIONS
Owner
Ming Wang
Ming Wang
I³ Tracker for Essential Open Innovation Datasets

I³ Tracker for Essential Open Innovation Datasets This repository is set up to track, version, and contribute updates to the I³ Essential Open Innovat

1 Feb 08, 2022
Python most simple|stupid programming language (MSPL)

Most Simple|Stupid Programming language. (MSPL) Stack - Based programming language "written in Python" Features: Interpretate code (Run). Generate gra

Kirill Zhosul 14 Nov 03, 2022
Source-o-grapher is a tool built with the aim to investigate software resilience aspects of Open Source Software (OSS) projects.

Source-o-grapher is a tool built with the aim to investigate software resilience aspects of Open Source Software (OSS) projects.

Aristotle University 5 Jun 28, 2022
A web-based analysis toolkit for the System Usability Scale providing calculation, plotting, interpretation and contextualization utility

System Usability Scale Analysis Toolkit The System Usability Scale (SUS) Analysis Toolkit is a web-based python application that provides a compilatio

Jonas Blattgerste 3 Oct 27, 2022
UF3: a python library for generating ultra-fast interatomic potentials

Ultra-Fast Force Fields (UF3) S. R. Xie, M. Rupp, and R. G. Hennig, "Ultra-fast interpretable machine-learning potentials", preprint arXiv:2110.00624

Ultra-Fast Force Fields 24 Nov 13, 2022
Track testrail productivity in automated reporting to multiple teams

django_web_app_for_testrail testrail is a test case management tool which helps any organization to track all consumption and testing of manual and au

Vignesh 2 Nov 21, 2021
Starscape is a Blender add-on for adding stars to the background of a scene.

Starscape Starscape is a Blender add-on for adding stars to the background of a scene. Features The add-on provides the following features: Procedural

Marco Rossini 5 Jun 24, 2022
Nook is a simple, concatenative programming language written in Python.

Nook Nook is a simple, concatenative programming language written in Python. Status Nook is currently WIP. It lacks a lot of basic feature, and will n

Wumi4 4 Jul 20, 2022
Implementation of the Angular Spectrum method in Python to simulate Diffraction Patterns

Diffraction Simulations - Angular Spectrum Method Implementation of the Angular Spectrum method in Python to simulate Diffraction Patterns with arbitr

Rafael de la Fuente 276 Dec 30, 2022
This is the course repository for the Spring 2022 iteration of MACS 30123 "Large-Scale Computing for the Social Sciences" at the University of Chicago.

Large-Scale Computing for the Social Sciences Spring 2022 - MACS 30123/MAPS 30123/PLSC 30123 Instructor Information TA Information TA Information Cour

6 May 06, 2022
Using graph_nets for pion classification and energy regression. Contributions from LLNL and LBNL

nbdev template Use this template to more easily create your nbdev project. If you are using an older version of this template, and want to upgrade to

3 Nov 23, 2022
🌌 Economics Observatory Visualisation Repository

Economics Observatory Visualisation Repository Website | Visualisations | Data | Here you will find all the data visualisations and infographics attac

Economics Observatory 3 Dec 14, 2022
A python script that will automate the boring task of login to the captive portal again and again

A python script that will automate the boring task of login to the captive portal again and again

Rakib Hasan 2 Feb 09, 2022
Collie is for uncovering RDMA NIC performance anomalies

Collie is for uncovering RDMA NIC performance anomalies. Overview Prerequ

Bytedance Inc. 34 Dec 11, 2022
Werkzeug has a debug console that requires a pin. It's possible to bypass this with an LFI vulnerability or use it as a local privilege escalation vector.

Werkzeug Debug Console Pin Bypass Werkzeug has a debug console that requires a pin by default. It's possible to bypass this with an LFI vulnerability

Wyatt Dahlenburg 23 Dec 17, 2022
Empresas do Brasil (CNPJs)

Biblioteca em Python que coleta informações cadastrais de empresas do Brasil (CNPJ) obtidas de fontes oficiais (Receita Federal) e exporta para um formato legível por humanos (CSV ou JSON).

BR-API: Democratizando dados do Brasil. 8 Aug 17, 2022
Pygments is a generic syntax highlighter written in Python

Welcome to Pygments This is the source of Pygments. It is a generic syntax highlighter written in Python that supports over 500 languages and text for

1.2k Jan 06, 2023
Estimating the potential photovoltaic production of buildings (in Berlin)

The following people contributed equally to this repository (in alphabetical order): Daniel Bumke JJX Corstiaen Versteegh This repository is forked on

Daniel Bumke 6 Feb 18, 2022
RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.

RDFLib RDFLib is a pure Python package for working with RDF. RDFLib contains most things you need to work with RDF, including: parsers and serializers

RDFLib 1.8k Jan 02, 2023
Width-customizer-for-streamlit-apps - Width customizer for Streamlit Apps

🎈 Width customizer for Streamlit Apps As of now, you can only change your Strea

Charly Wargnier 5 Aug 09, 2022