BMS-Molecular-Translation

Introduction

This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got bronze medals in this competition. Significant part of code was originated from Y.Nakama's notebook

This competition was about image-to-text translation of images with molecular skeletal strucutures to InChI chemical formula identifiers.

InChI=1S/C16H13Cl2NO3/c1-10-2-4-11(5-3-10)16(21)22-9-15(20)19-14-8-12(17)6-7-13(14)18/h2-8H,9H2,1H3,(H,19,20)

Solution

General Encoder-Decoder concept

Most participants used CNN encoder to acquire features with decoder (LSTM/GRU/Transformer) to get text sequences. That's a casual approach to image captioning problem.

Pseudo-labelling with InChI validation using RDKit

RDKit is an open source toolkit for cheminformatics and it was quite useful while solving the problem. When we trained our first model, it scored around 7-8 on public leaderboard and we decided to make pseudo-labelling on test data. However, in common scenario you get a significant amount of wrong predictions in your extended training set from pseudo-labelling. With RDKit we validated all of our predicted formulas and select around 800k correct samples. Lack of wrong labels in pseudo labels improved the score.

Predictions normalization

This notebook tells about InChI normalization

Blending

Finally, we blended ~20 predictions from 2 models (mostly from different epochs) using RDKit validation to choose only formulas which have possible InChI structure.

Pipeline for chemical image-to-text competition

Related tags

Overview

BMS-Molecular-Translation

Introduction

Solution

General Encoder-Decoder concept

Pseudo-labelling with InChI validation using RDKit

Predictions normalization

Blending

Final private LB score 1.79

Owner

Maksim Zhdanov

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

Chinese NER with albert/electra or other bert descendable model (keras)

Augmenty is an augmentation library based on spaCy for augmenting texts.

Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Resources for "Natural Language Processing" Coursera course.

The SVO-Probes Dataset for Verb Understanding

基于“Seq2Seq+前缀树”的知识图谱问答

This repository is home to the Optimus data transformation plugins for various data processing needs.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Need: Image Search With Python

Weaviate demo with the text2vec-openai module

Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

A programming language with logic of Python, and syntax of all languages.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

Textlesslib - Library for Textless Spoken Language Processing

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

Toward Model Interpretability in Medical NLP