AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

Last update: Nov 03, 2022

Overview

AptaMAT

Purpose

AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures. The method is based on the comparison of the matrices representing the two secondary structures to analyze, assimilable to dotplots. The dot-bracket notation of the structure is converted in a half binary matrix showing width equal to structure's length. Each matrix case (i,j) is filled with '1' if the nucleotide in position i is paired with the nucleotide in position j, with '0' otherwise.

The differences between matrices is calculated by applying Manhattan distance on each point in the template matrix against all the points from the compared matrix. This calculation is repeated between compared matrix and template matrix to handle all the differences. Both calculation are then sum up and divided by the sum of all the points in both matrices.

Dependencies

AptaMat have been written in Python 3.8+

Two Python modules are needed :

NumPy
scipy

These can be installed by typing in the command prompt either :

./setup

pip install numpy
pip install scipy

Use of Anaconda is highly recommended.

Usage

AptaMat is a flexible Python script which can take several arguments:

structures followed by secondary structures written in dotbracket format
files followed by path to formatted files containing one, or several secondary structures in dotbracket format

Both structures and files are independent functions in the script and cannot be called at the same time.

usage: AptaMAT.py [-h] [-structures STRUCTURES [STRUCTURES ...]] [-files FILES [FILES ...]]

The structures argument must be a string formatted secondary structures. The first input structure is the template structure for the comparison. The following input are the compared structures. There are no input limitations. Quotes are necessary.

usage: AptaMat.py structures [-h] "struct_1" "struct_2" ["struct_n" ...]

The files argument must be a formatted file. Multiple files can be parsed. The first structure encountered during the parsing is used as the template structure. The others are the compared structures.

usage: AptaMat.py -files [-h] struct_file_1 [struct_file_n ...]

The input must be a text file, containing at least secondary structures, and accept additional information such as Title, Sequence or Structure index. If several files are provided, the function parses the files one by one and always takes the first structure encountered as the template structure. Files must be formatted as follows:

>5HRU
TCGATTGGATTGTGCCGGAAGTGCTGGCTCGA
--Template--
((((.........(((((.....)))))))))
--Compared--
.........(((.(((((.....))))).)))

Examples

structures function

First introducing a simple example with 2 structures:

AptaMat : 0.08 ">

$ AptaMat.py -structures "(((...)))" "((.....))"
 (((...)))
 ((.....))
> AptaMat : 0.08

Then, it is possible to input several structures:

AptaMat : 0.08 (((...))) .(.....). > AptaMat : 0.2 (((...))) (.......) > AptaMat : 0.3 ">

$ AptaMat.py -structures "(((...)))" "((.....))" ".(.....)." "(.......)"
 (((...)))
 ((.....))
> AptaMat : 0.08

 (((...)))
 .(.....).
> AptaMat : 0.2

 (((...)))
 (.......)
> AptaMat : 0.3

files function

Taking the above file example:

$ AptaMat.py -files example.fa
5HRU
Template - Compared
 ((((.........(((((.....)))))))))
 .........(((.(((((.....))))).)))
> AptaMat : 0.1134453781512605

Note

Compared structures need to have the same length as the Template structure.

For the moment, no features have been included to check whether the base pair is able to exist or not, according to literature. You must be careful about the sequence input and the base pairing associate.

The script accepts the extended dotbracket notation useful to compare pseudoknots or Tetrad. However, the resulting distance might not be accurate.

The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

1 Dec 9, 2021

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

10 Dec 8, 2022

Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

79 Sep 20, 2022

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

2 Jul 22, 2022

WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

1 Oct 26, 2021

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment Brief explanation of PT Bukalapak.com Tbk Bukalapak was found

2 Feb 10, 2022

My first Python project is a simple Mad Libs program.

Python CLI Mad Libs Game My first Python project is a simple Mad Libs program. Mad Libs is a phrasal template word game created by Leonard Stern and R

1 Dec 10, 2021

simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

0 Jan 26, 2022

Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

7 Dec 15, 2022

Comments

Allow comparison with not folded secondary structure
User may want to perform quantitative analysis and attribute distance to non folded oligonucleotides against folded anyway for example in pipeline. Different solution can be considered:

Give a default distance value to unfolded vs folded structure (worst solution)

Distance must be equal to the maximum number of base pair observable : len(structrure)//2. Several issues could arise from this:

How to manage with enhancement #7 ? Take the largest ? Shortest ?

It would give abnormally high distance value and will remains constistent even though different structure folding are compared to the same unfolded structure. Considering our main advantage over others algorithm, failed to rank at this point is not good.

Assign Manhattan Distance for each point in matrix ( the one showing folding) the farthest theoretical + 1 in the structure. This may give a large distance between the two structures no matter the size and the + 1 prevent an equality one distance with an actually folded structure showing the same coordinate than the farthest theoretical point. Moreover, we can obtain different score when comparing different folding to the same unfolded structure.

enhancement
opened by GitHuBinet 0
Different length support and optimal alignment

Allow different structure length alignment. This would surely needs an optimal structure alignment to make AptaMat distance the lowest for a shared motif. Maybe we should consider the missing bases in the score calculation.
enhancement

opened by GitHuBinet 0
Is the algorithm time consuming ?
Considering the expected structure size (less than 100n) the calculation run quite fast. However, theoretically the calculation can takes time when the structure is larger with complexity around log(n^2). Possible improvement can be considered as this time complexity is linked with the double browsing of dotbracket input

[ ] Think about the possibility of improving this bracket search.

[ ] Study the .ct notation for ssNA secondary structure (see in ".ct notation" enhancement)

[x] #6

[ ] Test the algorithm with this new feature

question
opened by GEC-git 0
G-quadruplex/pseudoknot comprehension
Add features with G-quadruplex and pseudoknot comprehension. This kind of secondary structures requires extended dotbracket notation. https://www.tbi.univie.ac.at/RNA/ViennaRNA/doc/html/rna_structure_notations.html

The '([{<' & string.ascii_uppercase is already included but some doubt remain about the comparison accuracy because no test have been done on this kind of secondary structure

[ ] Perform some try on Q-quadruplex & pseudoknots and conclude about comparison reliability. /!\ The complexity comes from the G-quadruplex structures. The tetrad can form base pair in many different way and some secondary structure notation can be similar. Here is an exemple of case with the same interacting Guanine GGTTGGTGTGGTTGG ([..[)...(]..]) ((..)(...)(..))

[x] #5

enhancement invalid
opened by GEC-git 0

Releases(v0.9-pre-release)

v0.9-pre-release(Oct 28, 2022)
Pre-release content

https://github.com/GEC-git/AptaMat

Create LICENSE by @GEC-git in https://github.com/GEC-git/AptaMat/pull/2

main script AptaMat.py

README.MD edited and published

Beta AptaMat logo edited and published

Contributors

@GEC-git contributed in https://github.com/GEC-git/AptaMat

@GitHuBinet contributed in https://github.com/GEC-git/AptaMat

Full Changelog: https://github.com/GEC-git/AptaMat/commits/v0.9-pre-release
Source code(tar.gz)
Source code(zip)

Owner

GEC UTC

We are the "Genie Enzymatique et Cellulaire" CNRS UMR 7025 research unit.

GitHub Repository

peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

32 Dec 31, 2022

University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021

Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022

A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

72 Dec 30, 2022

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

5 Sep 28, 2022

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

1 Sep 05, 2021

WAL enables programmable waveform analysis.

This repro introcudes the Waveform Analysis Language (WAL). The initial paper on WAL will appear at ASPDAC'22 and can be downloaded here: https://www.

40 Dec 13, 2022

Python package for processing UC module spectral data.

UC Module Python Package How To Install clone repo. cd UC-module pip install . How to Use uc.module.UC(measurment=str, dark=str, reference=str, heade

1 Oct 20, 2021

NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Genomics Workshop FIXME: overview of workshop Code of Conduct All participants s

2 Jun 13, 2022

Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

6 Jul 01, 2022

Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

WeRateDogs Twitter Data from 2015 to 2017 Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data Table of Contents Introduction Proj

1 Jan 12, 2022

Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

Correlation-Study-Climate-Change-EV-Adoption Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles I

1 Jan 03, 2022

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

2 Jan 06, 2022

Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021

Business Intelligence (BI) in Python, OLAP

Open Mining Business Intelligence (BI) Application Server written in Python Requirements Python 2.7 (Backend) Lua 5.2 or LuaJIT 5.1 (OML backend) Mong

1.2k Dec 27, 2022

For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project Downloads 2. Download all documents,

4 Dec 28, 2021

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

2 Jul 01, 2022

Data exploration done quick.

Pandas Tab Implementation of Stata's tabulate command in Pandas for extremely easy to type one-way and two-way tabulations. Support: Python 3.7 and 3.

20 Aug 27, 2022

OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

167 Dec 13, 2022