Find Transposon Element insertions using long reads (nanopore), by alignment directly. (minimap2)

Last update: Feb 09, 2022

Related tags

Overview

find_te_ins

find_te_ins is designed to find Transposon Element (TE) insertions using long reads (nanopore), by alignment directly. (minimap2)

Install

$ git clone https://github.com/bakerwm/find_te_ins.git
$ cd find_te_ins

Change the following variables upon your condition: genome_fa and te_fa in line-10 and line-11;

$ bash run_pipe.sh
run_pipe.sh

Prerequisite

minimap2 - 2.17-r974-dirty, align long reads to reference genome
featureCounts - v2.0.0, quantification
samtools - v1.12, working with BAM files
python 3.8+
pysam 0.16.0.1, python module, working with BAM files

Getting Started

1 Prepare input files

genome_fa - reference genome in fasta format, in script run_pipe.sh, line-10
te_fa - TE consensus sequence in fasta format, in script run_pipe.sh, line-11
long reads - Long reads from NanoPore or Pacbio, in fasta or fastq format

2 Run pipe

$ cd ~/work/te_ins
# specify the path of long reads data: 
   
    /
   
$ git clone https://github.com/bakerwm/find_te_ins.git 
$ bash find_te_ins/run_pipe.sh <path-to-long-reads>/ results

[1/9] align to reference genome
[2/9] extract raw insertions from BAM, by CIGAR
[3/9] convert raw insertions to fasta format
[4/9] align raw_insertion to transposon
[5/9] extract transposon name for insertions
[6/9] merge raw_insertions by window=100
[7/9] count reads for each insertion
[8/9] save final insertions to file
[9/9] Done!

3 Output

The following files listed below are the output of the pipeline, the TE insertions saved in file *.te_ins.final.bed

$ tree -L 2 results/ONT_sample-1
.
├── ONT_sample-1
│   ├── ONT_sample-1.bam
│   ├── ONT_sample-1.bam.bai
│   ├── ONT_sample-1.raw_ins.bed
│   ├── ONT_sample-1.raw_ins.fa
│   ├── ONT_sample-1.raw_ins.fa.bam
│   ├── ONT_sample-1.raw_ins.fa.bam.bai
│   ├── ONT_sample-1.te_ins.bed
│   ├── ONT_sample-1.te_ins.final.bed
│   ├── ONT_sample-1.te_ins.final.bed6
│   ├── ONT_sample-1.te_ins.gtf
│   ├── ONT_sample-1.te_ins.quant.stderr
│   ├── ONT_sample-1.te_ins.quant.stdout
│   ├── ONT_sample-1.te_ins.quant.txt
│   ├── ONT_sample-1.te_ins.quant.txt.summary
│   ├── ONT_sample-1.te_ins.raw.txt
│   ├── run_minimap2.dm6.stderr
│   └── run_minimap2.dm6_transposon.stderr
...

{sample_name}.te_ins.final.bed

column 1. chr name of reference 
column 2. start pos of Insertion 
column 3. end pos of Insertion 
column 4. insertion name 
column 5. a fixed integer [255]  
column 6. strand # in current version, not consider the dirction of TE insertions !!!
column 7. name of TE consensus 
column 8. length of TE consensus  
column 9. proportion of the TE consensus identified  
column 10. number of supported reads for the insertion 
column 11. number of all reads cover the insertion 
column 12. proportion TE supported reads 
column 13. type of the TE insertions [full, p3, p5]

{sample_name}.te_ins.raw.txt

column 16 (last column), is the type of TE insertions: [full, p3, p5]

full, more then cutoff [60%] of the TE consensus were detected
p3, only the 3' end of the TE consensus were detected
p5, only the 5' end of the TE consensus were detected

In the .final.bed file, ONLY full TE insertions were saved for further analysis

Change criteria

TE types were defined in run_pipe.sh by anno_te.py, the criteria -c 0.6 could be changed to [0-1] float number based on your condition. see line-100 in file run_pipe.sh

# line-100 of run_pipe.sh
[[ ! -f ${te_ins_txt} ]] && python ${src_dir}/anno_te.py -x ${te_fa_fai} ${te_bam} | sort -k4,4 -k5,5n > ${te_ins_txt}

# change criteria to 0.7
[[ ! -f ${te_ins_txt} ]] && python ${src_dir}/anno_te.py -x ${te_fa_fai} -c 0.7 ${te_bam} | sort -k4,4 -k5,5n > ${te_ins_txt}

# remove te_ins files, and run the command again
$ rm results/ONT_sample-1.te_ins*
$ bash find_te_ins/run_pipe.sh 
   
    / results

How it works?

extract INSERTIONS

Find Transposon Element insertions using long reads (nanopore), by alignment directly. (minimap2)

Related tags

Overview

find_te_ins

Install

Prerequisite

Getting Started

1 Prepare input files

2 Run pipe

3 Output

Change criteria

How it works?

Owner

Ming Wang

A promo calculator for sports betting odds.

API moment - LussovAPI

CountdownTimer - Countdown Timer For Python

Mute your mic while you're typing. An app for Ubuntu.

Hasklig - a code font with monospaced ligatures

Google Scholar App Using Python

Identify unused production dependencies and avoid a bloated virtual environment.

A simple assembly- and brainfuck-inspired stack-based language

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.

A module to develop and apply old-style links

Waydroid is a container-based approach to boot a full Android system on a regular GNU/Linux system like Ubuntu.

Penelope Shell Handler

A simple BrainF**k compiler written in Python

This repository contains the exercices for the robotics class at Supaero, 2022.

Simple Assembler with python

python scripts to perform coin die clustering (performed on Riedones3D).

Jack Morgan's Advent of Code Solutions

ASCII-Wordle - A port of the game Wordle to terminal emulators/CMD

A web interface for a soft serve Git server.

Block fingerprinting for the beacon chain, for client identification & client diversity metrics