Structural basis for solubility in protein expression systems

Overview

Structural basis for solubility in protein expression systems

Twitter Follow GitHub repo size

Large-scale protein production for biotechnology and biopharmaceutical applications rely on high protein solubility in expression systems. Solubility has been measured for a significant fraction of E. coli and S. cerevisiae proteomes and these datasets are routinely used to train predictors of protein solubility in different organisms. Thanks to continued advances in experimental structure-determination and modelling, many of these solubility measurements can now be paired with accurate structural models.

The challenge is mentored by Christopher Ing and Mark Fingerhuth.

Aim of the challenge

It is the objective of this project to use our provided dataset of protein structure and solubility value pairs in order to produce a solubility predictor with comparable accuracy to sequence-based predictors reported in the literature. The provided dataset to be used in this project is created by following the dataset curation procedure described in the SOLart paper, and this hackathon project has a similar aim to this manuscript.

The dataset

The process of generating the dataset is described in the SOLArt manuscript. At a high level, all experimentally tested E. coli and S. cerevisiae proteins were matched through Uniprot IDs to known crystallographic structures or high sequence similarity homology models. After balancing the fold types using CATH, a dataset containing a balanced spread of solubility values was produced. The resulting proteins for the training and testing of these models were prepared and disclosed in the supplemental material of this paper as a list of (Uniprot,PDB,Chain,Solubility) pairs. The PDB files were not included in this work so we had to re-extract them from SWISS-MODEL. Whenever a crystallographic structure was present, it was used, assuming high coverage over the Uniprot sequence. In some cases, the original PDB templates used within the original SOLArt paper had been superceded by improved templates, and we opted to take the highest resolution, highest sequence identity, models in our updated dataset. We stripped away all irrelevant chains and heteroatoms.

If issues are identified with individual structures, please refer to the Uniprot ID and manually investigate the best template. In some cases, we needed to improve structure correctness by modelling missing atoms/residues inside the Chemical Computing Group software MOE on a case-by-case basis.

The dataset can be found in the data/ subdirectory - it is already divided into training/ and test/ data. The training/ data comes with solubility_values.csv and solublity_values.yaml (same content just different format) which both contain the solubility target values for all the PDB files provided in that directory. Note that each PDB file is named after the Uniprot identifier of the respective protein and the protein column in the solubility_values.csv also contains the Uniprot identifiers.

The test/ dataset consists of three different subdirectories (protein structures derived from different organisms and with different approaches) and you should NOT use them for any training. Only the yeast_crystal_structs/ directory contains solubility_values.csv and solublity_values.yaml (same content just different format) files which you can use for some local testing & validation. In order to find out your performance on the entire test dataset you need to use the automated benchmarking system (see below).

Example output

Your code should output a file called predictions.csv in the following format:

protein,solubility
P69829,83
P31133,62

whereby the protein column contains the Uniprot ID (corresponds to the filename of the PDB files) and the solubility column contains the predicted solubility value (can be int or float).

Note, that there are three (!) test subsets but you are expected to submit all the predictions in one file (not three) for the benchmarking system to work.

Automated benchmarking system

The continuous integration script in .github/workflows/ci.yml will automatically build the Dockerfile on every commit to the main branch. This docker image will be published as your hackathon submission to https://biolib.com//. For this to work, make sure you set the BIOLIB_TOKEN and BIOLIB_PROJECT_URI accordingly as repository secrets.

To read more about the benchmarking system click here.

Say thanks

Give this repo a star: GitHub Repo stars

Star the ProteinQure org on Github: GitHub Org's stars

Owner
ProteinQure
ProteinQure
Ferramenta de monitoramento do risco de colapso no sistema de saúde em municípios brasileiros com a Covid-19.

FarolCovid 🚦 Ferramenta de monitoramento do risco de colapso no sistema de saúde em municípios brasileiros com a Covid-19. Monitoring tool & simulati

Impulso 49 Jul 10, 2022
Aero is an open source airplane intelligence tool. Aero supports more than 13,000 airlines and 250 countries. Any flight worldwide at your fingertips.

Aero Aero supports more than 13,000 airlines and 250 countries. Any flight worldwide at your fingertips. Features Main : Flight lookup Aircraft lookup

Vickey 비키 4 Oct 27, 2021
A tool to help the Poly copy-reading process! :D

PolyBot A tool to help the Poly copy-reading process! :D Let's face it-computers are better are repeatitive tasks. And, in spite of what one may want

1 Jan 10, 2022
A framework to create reusable Dash layout.

dash_component_template A framework to create reusable Dash layout.

The TolTEC Project 4 Aug 04, 2022
This program generates automatically new folders containing old version of program

Automated Folder Versions Generator by Sergiy Grimoldi - V.0.0.2 This program generates automatically new folders containing old version of something

Sergiy Grimoldi 1 Dec 23, 2021
This library is an ongoing effort towards bringing the data exchanging ability between Java/Scala and Python

PyJava This library is an ongoing effort towards bringing the data exchanging ability between Java/Scala and Python

Byzer 6 Oct 17, 2022
Exam assignment for Laboratory of Bioinformatics 2

Exam assignment for Laboratory of Bioinformatics 2 (Alma Mater University of Bologna, Master in Bioinformatics)

2 Oct 22, 2022
jmespath.rs Python binding

rjmespath-py jmespath.rs Python binding.

messense 3 Dec 14, 2022
Reactjs web app written entirely in python, using transcrypt compiler.

Reactjs web app written entirely in python, using transcrypt compiler.

Dan Shai 22 Nov 27, 2022
Huggingface package for the discrete VAE used for DALL-E.

DALL-E-Tokenizer Huggingface package for the discrete VAE used for DALL-E.

MyungHoon Jin 5 Sep 01, 2021
In the works, creating a new Chess Board and way to Play...

sWJz4Chess date started on github.com 11-13-2021 In the works, creating a new Chess Board and way to Play... starting to write this in Pygame, any ind

Shawn 2 Nov 18, 2021
Blender addon that enables exporting of xmodels from blender. Great for custom asset creation for cod games

Birdman's XModel Tools For Blender Greetings everyone in the custom cod community. This blender addon should finally enable exporting of custom assets

wast 2 Jul 02, 2022
Fix Eitaa Messenger's Font Problem on Linux

Fix Eitaa Messenger's Font Problem on Linux

6 Oct 15, 2022
CALPHAD tools for designing thermodynamic models, calculating phase diagrams and investigating phase equilibria.

CALPHAD tools for designing thermodynamic models, calculating phase diagrams and investigating phase equilibria.

pycalphad 189 Dec 13, 2022
All exercises done during the Python 3 course in the Video Course (World 1, 2 and 3)

Python3-cursoemvideo-exercises - All exercises done during the Python 3 course in the Video Course (World 1, 2 and 3)

Renan Barbosa 3 Jan 17, 2022
Some out-of-the-box hooks for pre-commit

pre-commit-hooks Some out-of-the-box hooks for pre-commit. See also: https://github.com/pre-commit/pre-commit Using pre-commit-hooks with pre-commit A

pre-commit 3.6k Dec 29, 2022
Fixes your Microphone Level to one specific value.

MicLeveler Fixes your Microphone Level to one specific value. Intention A friend of mine has the problem that some programs are setting his microphone

Moritz Timpe 2 Oct 14, 2021
Customizable-menu-python - User customizable menu in Python

Menu personalizável pelo usuário em Python A minha ideia com esse projeto pessoa

Renan Barbosa 4 Oct 28, 2022
firefox session recovery

firefox session recovery

Ahmad Sadraei 5 Nov 29, 2022
This is a small compiler to demonstrate how compilers work.

This is a small compiler to demonstrate how compilers work. It compiles our own dialect to C, while being written in Python.

Md. Tonoy Akando 2 Jul 19, 2022