Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

kmer-collapse

Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

Given an input set of kmers, find the smallest set of kmers that encapsulates all diversity in the input set using IUPAC degenerate bases. This aims to solve the problem described here: https://www.biostars.org/p/9498272/

Usage

Install the marisa-trie library, if necessary.

Modify the script's input variable to specify desired sequences, and then run the script:

$ python kmer-collapse.py
{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA",
        "ASAAAAAAAA"
    ]
}

Notes

This has not been tested with any kmer sets but those examples provided. However, it aims to be scalable by pruning combinations of sub-kmers along the way, which would otherwise yield incorrect encodings. This also uses a trie for faster prefix testing. If futher performance is needed, some easy wins would be to cache sub-kmer tests, since most of these test outcomes would be redundant.

Additionally, no error checking is done on the input kmer alphabet or on the consistency of kmer lengths. It may be useful to validate input before using this script.

Examples

These examples are available from the script by uncommenting the relevant input.

A

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA"
    ]
}

B

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "GCGAAAAAAA"
    ],
    "encoded_output": [
        "GCGAAAAAAA",
        "WAAAAAAAAA"
    ]
}

C

{
    "input": [
        "AAAAAAAAAA"
    ],
    "encoded_output": [
        "AAAAAAAAAA"
    ]
}

D

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA"
    ]
}

E

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "TTAAAAAAAA",
        "ATAAAAAAAA"
    ],
    "encoded_output": [
        "WWAAAAAAAA"
    ]
}

F

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ]
}

G

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "ASAAAAAAAA",
        "WAAAAAAAAA"
    ]
}
Owner
Alex Reynolds
Pug caregiver, curler, cyclist, gardener, beginning French scholar
Alex Reynolds
Gba-free-fonts - Free font resources for GBA game development

gba-free-fonts Free font resources for GBA game development This repo contains m

28 Dec 30, 2022
Add any Program in any language you like or add a hello world Program ❣️ if you like give us :star:

Welcome to the Hacktoberfest 2018 Hello-world 📋 This Project aims to help you to get started with using Github. You can find a tutorial here What is

Aniket Sharma 1.5k Nov 16, 2022
Fully coded Apps by Codex.

OpenAI-Codex-Code-Generation Fully coded Apps by Codex. How I use Codex in VSCode to generate multiple completions with autosorting by highest "mean p

nanowell 47 Jan 01, 2023
Web interface for browsing, search and filtering recent arxiv submissions

Web interface for browsing, search and filtering recent arxiv submissions

Andrej 4.8k Jan 08, 2023
A calculator to test numbers against the collatz conjecture

The Collatz Calculator This is an algorithm custom built by Kyle Dickey, used to test numbers against the simple rules of the Collatz Conjecture.

Kyle Dickey 2 Jun 14, 2022
Mute your mic while you're typing. An app for Ubuntu.

Hushboard Mute your microphone while typing, for Ubuntu. Install from kryogenix.org/code/hushboard/. Installation We recommend you install Hushboard t

Stuart Langridge 142 Jan 05, 2023
The functions we created are included in a script. The necessary parts for pre-processing were taken. Analysis complete.

Feature-Engineering The functions we created are included in a script. The necessary parts for pre-processing were taken. Analysis complete. Business

Ayşe Nur Türkaslan 4 Oct 17, 2021
aaencode for python,把python代码转换为颜文字

py-aaencode aaencode for python,把python代码转换为颜文字 compile.py: 将python编译成颜文字,编译结果有随机性,可以选择BPE词表压缩代码 compile_min.py: 最小化的编译器 compiled_min.txt: 编译得到的最小的com

11 Dec 30, 2021
A collection of examples of using cocotb for functional verification of VHDL designs with GHDL.

At the moment, this repo is in an early state and serves as a learning tool for me. So it contains a a lot of quirks and code which can be done much better by cocotb-professionals.

T. Meissner 7 Mar 10, 2022
Simple plug-and-play installer for users who want to LineageOS from stock firmware, or from another custom ROM.

LineageOS for the Teracube 2e Simple plug-and-play installer for users who want to LineageOS from stock firmware, or from another custom ROM. Dependen

Gagan Malvi 5 Mar 31, 2022
Better firefox bookmarks script for rofi

rofi-bookmarks Small python script to open firefox bookmarks with rofi. Features Icons! Only show bookmarks in a specified bookmark folder Show entire

32 Nov 10, 2022
Basit bir cc generator'ü.

Basit bir cc generator'ü. Setup What To Do; Python Installation We install python from CLICK Generator Board After installing the file and python, we

Lâving 7 Jan 09, 2022
One-stop-shop for docs and test coverage of dbt projects.

dbt-coverage One-stop-shop for docs and test coverage of dbt projects. Why do I need something like this? dbt-coverage is to dbt what coverage.py and

Slido 106 Dec 27, 2022
Программа для практической работы №12 по дисциплине

Информатика: программа для практической работы №12 Код и блок-схема программы для практической работы №12 по дисциплине "Информатика" (I семестр). Сут

Vladislav 1 Dec 07, 2021
8 Nov 04, 2022
MobaXterm-GenKey

MobaXterm-GenKey 你懂的!! 本地启动 需要安装Python3!!!

malaohu 328 Dec 29, 2022
TickerRain is an open-source web app that stores and analysis Reddit posts in a transparent and semi-interactive manner.

TickerRain is an open-source web app that stores and analysis Reddit posts in a transparent and semi-interactive manner

GonVas 180 Oct 08, 2022
Prophet is a tool to discover resources detailed for cloud migration, cloud backup and disaster recovery

Prophet is a tool to discover resources detailed for cloud migration, cloud backup and disaster recovery

22 May 31, 2022
ClamNotif: A tool to send you ClamAV notifications

A tool to forward notifications to different recipients categorised by two severity levels of the regular health reports produced by `clamscan` bundled with the ClamAV antivirus engine.

PiSoft Company Ltd. 1 Nov 15, 2021
UniPD exam dates finder

UniPD exam dates finder Find dates for exams at UniPD Usage ./finder.py courses.csv It's suggested to save output to a file: ./finder.py courses.csv

Davide Peressoni 1 Jan 25, 2022