Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Overview

MOF-Water-Affinity-Prediction-

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge Structural Database and the CoRE_MOF 2019 dataset.

Prediction Model

The prediction model is used to determine whether a given MOF is hydrophobic or hydrophilic. It uses a Random Forest model from the XGBoost library through a scikit-learn interface. The model reads in a .csv file of training data and then predicts the water affinity of a user inputted MOF. The user can specify what input parameters are to be used in the model.

Overfitting/Underfitting

This script was created to investigate how the prediction model’s accuracy and precision vary with the number and combination of inputs. This script allows a user to compare how the different combinations of inputs affect the score and the standard deviation of the model’s results.

It operates by reading in a .csv file of training data containing 13 input parameters. It then generates a list of all the possible combinations of input parameters according to the lengths specified by the user. For example, if the user wants all the combinations of length 3, 4, and 10 possible, the program will generate a list of all combinations of those lengths, and then use each combination as input for the model. Basically, each combination will undergo the same process as in the prediction model above, and then its results will be added into a .csv file for later analysis. Finally, a plot is created with filters for visualization.

.cif to .csv Converter

In order to create a training set for the prediction model, a csv must be created with all the available datapoints. This includes the MOFs and their crystallographic data. The data needed is collected from three different sources: WebCSD, CoRE_MOF 2019 dataset, and the MOF’s .cif files. Furthermore, additional calculations need to be performed from the information collected from the .cif files.

The code works by reading a .txt file, folder, or both, containing the refcodes and .cif files given to the MOF by the Cambrdige Structural Database. It then searches for these refcodes in the CoRE_MOF 2019 dataset, and retrieves the crystallographic data attached to them. Additionally, it uses the .cif files of the MOFs to calculate the atomic mass percentage of the metals contained in the MOF. These calculations are stored in columns 14-17, but are treated as one input parameter in the models in an attempt to relate them to each other. It also states the MOFs in the training set as hydrophobic and hydrophilic based on previously collected information from the literature. Finally, it produces a .csv file ready for use in the prediction model.

.cif folders

Three different folders are used to store .cif files.

  1. cif: these are hydrophobic MOFs received from Dr. Z. Qiao.
  2. manual hydrophobic: these are hydrophobic MOFs collected from the literature
  3. manual hydrophilic: these are hydrophilic MOFs collected from the literature

To add additional .cif files:

Add additional .cif files into either the manual hydrophobic folder or the manual hydrophilic folder. Make sure the file names represent the CCDC refcodes (including or excluding the CoRE_MOF 2019 name extensions). Finally, add these refcodes into the .txt file available in each folder so that the .cif files can be read by the cif Reader program.

This project is licensed under the terms of the GNU General Public License v3.0

INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022
Vaex library for Big Data Analytics of an Airline dataset

Vaex-Big-Data-Analytics-for-Airline-data A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics

Nikolas Petrou 1 Feb 13, 2022
An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R

largeVis This is an implementation of the largeVis algorithm described in (https://arxiv.org/abs/1602.00370). It also incorporates: A very fast algori

336 May 25, 2022
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

Amanda Warr 14 Aug 30, 2022
Renato 214 Jan 02, 2023
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Spectacular AI SDK examples Spectacular AI SDK fuses data from cameras and IMU sensors (accelerometer and gyroscope) and outputs an accurate 6-degree-

Spectacular AI 94 Jan 04, 2023
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 04, 2023
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
Time ranges with python

timeranges Time ranges. Read the Docs Installation pip timeranges is available on pip: pip install timeranges GitHub You can also install the latest v

Micael Jarniac 2 Sep 01, 2022
An Integrated Experimental Platform for time series data anomaly detection.

Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no

Baidu 486 Dec 21, 2022
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

This project is the implementation template for HW 0 and HW 1 for both the programming and non-programming tracks

Donald F. Ferguson 4 Mar 06, 2022
CRISP: Critical Path Analysis of Microservice Traces

CRISP: Critical Path Analysis of Microservice Traces This repo contains code to compute and present critical path summary from Jaeger microservice tra

Uber Research 110 Jan 06, 2023
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 03, 2023
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge S

1 Jan 09, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022