ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Overview

Linux version made-with-python Python GitHub license Open Source Love svg1

Description:

ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39 distinct protein feature extraction methods (protein descriptors) using 21 PSSM-based protein descriptors from POSSUM and 18 protein descriptors from iFeature.

POSSUM (Position-Specific Scoring matrix-based feature generator for machine learning), a versatile toolkit with an online web server that can generate 21 types of PSSM-based feature descriptors, thereby addressing a crucial need for bioinformaticians and computational biologists.

iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors.

Installation

ProtFeat is a python package for feature extracting from protein sequences written in Python 3.9. ProtFeat was developed and tested in Ubuntu 20.04 LTS. Please make sure that you have Anaconda installed on your computer and run the below commands to install requirements. Dependencies are available in requirements.txt file.

conda create -n protFeat_env python=3.9
conda activate protFeat_env

How to run ProtFeat to extract the protein features

Run the following commands in the given order:

To use ProtFeat as a python package:

pip install protFeat

Then, you may use protFeat as the following in python:

import protFeat
from protFeat.feature_extracter import extract_protein_feature, usage
usage()
extract_protein_feature(protein_feature, place_protein_id, input_folder, fasta_file_name)

For example,

extract_protein_feature("AAC", 1, "input_folder", "sample")

To use ProtFeat from terminal:

Clone the Git Repository.

git clone https://github.com/gozsari/ProtFeat

In terminal or command line navigate into protFeat folder.

cd ProtFeat

Install the requirements by the running the following command.

pip install -r requirements.txt

Altenatively you may run ProtFeat from the terminal as the following:

cd src
python protFeat_command_line.py --pf protein_feature --ppid place_protein_id --inpf input_folder --fname fasta_file_name

For example,

python protFeat_command_line.py --pf AAC --ppid 1 --inpf input_folder --fname sample

Explanation of Parameters

protein_feature: {string}, (default = 'aac_pssm'): one of the protein descriptors in POSSUM and iFeature.

POSSUM descriptors:

aac_pssm, d_fpssm, smoothed_pssm, ab_pssm, pssm_composition, rpm_pssm,
s_fpssm, dpc_pssm, k_separated_bigrams_pssm, eedp, tpc, edp, rpssm,
pse_pssm, dp_pssm, pssm_ac, pssm_cc, aadp_pssm, aatp, medp , or all_POSSUM

Note: all_POSSUM extracts the features of all (21) POSSUM protein descriptors.

iFeature descriptors:

AAC, PAAC, APAAC, DPC, GAAC, CKSAAP, CKSAAGP, GDPC, Moran, Geary,
NMBroto, CTDC, CTDD, CTDT, CTriad, KSCTriad, SOCNumber, QSOrder, or all_iFeature

Note: all_iFeature extracts the features of all (18) iFeature protein descriptors.

place_protein_id: {int}, (default = 1): It indicates the place of protein id in fasta header. e.g. fasta header: >sp|O27002|....|....|...., seperate the header wrt. '|' then >sp is in the zeroth position, protein id in the first(1) position.

input_folder: {string}, (default = 'input_folder'}: it is the path to the folder that contains the fasta file.

fasta_file_name: {string}, (default ='sample'): it is the name of the fasta file exclude the '.fasta' extension.

Input file

It must be in fasta format.

Output file

The extracted feature files will be located under feature_extraction_output folder with the name: fasta_file_name_protein_feature.txt (e.g. sample_AAC.txt).

The content of the output files:

  • The output file is tab-seperated.
  • Each row corresponds to the extracted features of the protein sequence.
  • The first column of each row is UniProtKB id of the proteins, the rest is extracted features of the protein sequence.

Tables of the available protein descriptors

Table 1: Protein descriptors obtained from the POSSUM tool.

Descriptor group Protein descriptor Number of dimensions
Row Transformations AAC-PSSM
D-FPSSM
smoothed-PSMM
AB-PSSM
PSSM-composition
RPM-PSSM
S-FPSSM
20
20
1000
400
400
400
400
Column Transformation DPC-PSSM
k-seperated-bigrams-PSSM                    
tri-gram-PSSM
EEDP
TPC
400
400
8000
4000
400
Mixture of row and column transformation EDP
RPSSM
Pre-PSSM
DP-PSSM
PSSM-AC
PSSM-CC
20
110
40
240
200
3800
Combination of above descriptors AADP-PSSSM
AATP
MEDP
420
420
420

Table 2: Protein descriptors obtained from the iFeature tool.
Descriptor group Protein descriptor Number of dimensions
Amino acid composition Amino acid composition (AAC)
Composition of k-spaced amino acid pairs (CKSAAP)
Dipeptide composition (DPC)
20
2400
400
Grouped amino acid composition Grouped amino acid composition (GAAC)
Composition of k-spaced amino acid group pairs (CKSAAGP)
Grouped dipeptide composition (GDPC)
5
150
25
Autocorrelation Moran (Moran)
Geary (Geary)
Normalized Moreau-Broto (NMBroto)
240
240
240
C/T/D Composition (CTDC)
Transition (CTDT)
Distribution (CTDD)
39
39
195
Conjoint triad Conjoint triad (CTriad)
Conjoint k-spaced triad (KSCTriad)
343
343*(k+1)
Quasi-sequence-order Sequence-order-coupling number (SOCNumber)
Quasi-sequence-order descriptors (QSOrder)
60
100
Pseudo-amino acid composition Pseudo-amino acid composition (PAAC)
Amphiphilic PAAC (APAAC)
50
80

License

MIT License

ProtFeat Copyright (C) 2020 CanSyL

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

You might also like...
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If you know more datasets, and want to contribute, please, notify me.

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

Utilizing RBERT model for KLUE Relation Extraction task

RBERT for Relation Extraction task for KLUE Project Description Relation Extraction task is one of the task of Korean Language Understanding Evaluatio

Spert NLP Relation Extraction API deployed with torchserve for inference

SpERT torchserve Spert_torchserve is the Relation Extraction model (SpERT)Span-based Entity and Relation Transformer API deployed with pytorch/serve.

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Realistic Few-Shot Relation Extraction This repository contains code to reproduce the results in the paper "Towards Realistic Few-Shot Relation Extrac

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Releases(protein-feature-extraction)
  • protein-feature-extraction(Apr 12, 2022)

    ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes 39 distinct protein feature extraction methods using 21 PSSM-based protein descriptors from POSSUM and 18 protein descriptors from iFeature.

    Source code(tar.gz)
    Source code(zip)
Owner
GOKHAN OZSARI
Research and Teaching Assistant, at CEng, METU
GOKHAN OZSARI
Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

English|简体中文 ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果,并在 G

5.4k Jan 03, 2023
Contains links to publicly available datasets for modeling health outcomes using speech and language.

speech-nlp-datasets Contains links to publicly available datasets for modeling various health outcomes using speech and language. Speech-based Corpora

Tuka Alhanai 77 Dec 07, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
小布助手对话短文本语义匹配的一个baseline

oppo-text-match 小布助手对话短文本语义匹配的一个baseline 模型 参考:https://kexue.fm/archives/8213 base版本线下大概0.952,线上0.866(单模型,没做K-flod融合)。 训练 测试环境:tensorflow 1.15 + keras

苏剑林(Jianlin Su) 132 Dec 14, 2022
Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

Rita Anjana 55 Nov 28, 2022
Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Sonnet finder Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet. Usage This is a Python scrip

Marcel Bollmann 11 Sep 25, 2022
The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Unsupervised technique to Glossary and Definition Extraction Code Files GPT2-DefinitionModel.ipynb - GPT-2 model for definition generation. Data_Gener

Prakhar Mishra 28 May 25, 2021
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

Jash Mota 20 Jul 14, 2022
NLP - Machine learning

Flipkart-product-reviews NLP - Machine learning About Product reviews is an essential part of an online store like Flipkart’s branding and marketing.

Harshith VH 1 Oct 29, 2021
📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

Bardia Shahrestani 2 May 26, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Andrey 9 May 05, 2022
Binaural Speech Synthesis

Binaural Speech Synthesis This repository contains code to train a mono-to-binaural neural sound renderer. If you use this code or the provided datase

Facebook Research 135 Dec 18, 2022
Python SDK for working with Voicegain Speech-to-Text

Voicegain Speech-to-Text Python SDK Python SDK for the Voicegain Speech-to-Text API. This API allows for large vocabulary speech-to-text transcription

Voicegain 3 Dec 14, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 03, 2023