Resources related to our paper "CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain"

Related tags

Deep Learningclin_x
Overview

CLIN-X

(CLIN-X-ES) & (CLIN-X-EN)

This repository holds the companion code for the system reported in the paper:

"CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain" by Lukas Lange, Heike Adel, Jannik Strötgen and Dietrich Klakow.

The paper wcan be found here. The code allows the users to reproduce and extend the results reported in the paper. Please cite the above paper when reporting, reproducing or extending the results.

@inproceedings{lange-etal-2021-clin-x,
      author    = {Lukas Lange and
                   Heike Adel and
                   Jannik Str{\"{o}}tgen and
                   Dietrich Klakow},
      title     = {"CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain},
      year={2021},
      url={https://arxiv.org/abs/2112.08754}
}

In case of questions, please contact the authors as listed on the paper.

Purpose of the project

This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.

The CLIN-X language models

As part of this work, two XLM-R were adapted to the clinical domain The models can be found here:

  • CLIN-X ES: Spanish clinical XLM-R (link)
  • CLIN-X EN: English clinical XLM-R (link)

The CLIN-X models are open-sourced under the CC-BY 4.0 license. See the LICENSE_models file for details.

Prepare the conda environment

The code requires some python libraries to work:

conda create -n clin-x python==3.8.5
pip install flair==0.8 transformers==4.6.1 torch==1.8.1 scikit-learn==0.23.1 scipy==1.6.3 numpy==1.20.3 nltk tqdm seaborn matplotlib

Masked-Language-Modeling training

The models were trained using the huggingface MLM script that can be found here. The script was called as follows:

python -m torch.distributed.launch --nproc_per_node 8 run_mlm.py  \
--model_name_or_path xlm-roberta-large  \
--train_file data/spanisch_clinical_train.txt  \
--validation_file data/spanisch_clinical_valid.txt  \
--do_train   --do_eval  \
--output_dir models/xlm-roberta-large-spanisch-clinical-domain/  \
--fp16  \
--per_device_train_batch_size 4 --per_device_eval_batch_size 4  \
--save_strategy steps --save_steps 10000

Using the CLIN-X model with our propose model architecture (as reported in Table 7)

The following will describe our different scripts to reproduce the results. See each of the script files for detailed information on the input arguments.

Tokenize and split the data

python tokenize_files.py --input_path path/to/input/files/ --output_path /path/to/bio_files/
python create_data_splits.py --train_files /path/to/bio_files/ --method random --output_dir /path/to/split_files/

Train the model (using random data splits)

The following command trains on model on four splits (1,2,3,4) and uses the remaining split (5) for validation. For different split combinations adjust the list of --training_files and the --dev_file arguments accordingly.

python train_our_model_architecture.py   \
--data_path /path/to/split_files/  \
--train_files random_split_1.txt,random_split_2.txt,random_split_3.txt,random_split_4.txt  \
--dev_file random_split_5.txt  \
--model xlm-roberta-large-spanish-clinical  \
--name model_name --storage_path models

Get ensemble predictions

For all models, get the predictions on the test set as following:

python get_test_predictions.py --name models/model_name --conll_path /path/to/bio_files/ --out_path predictions/model_name/

Then, combine different models into one ensemble. Arguments: Output path + List of model predictions

python create_ensemble_data.py predictions/ensemble1 predictions/model_name/ predictions/model_name_2/ ...

Using the CLIN-X model (as reported in Table 3)

While we recommand the usage of our model architecture, the CLIN-X models can be used in many other architectures. In the paper, we compare to the standard transformer sequnece labeling models as proposed by Devlin et al. For this, we provide the train_standard_model_architecture.py script

python train_standard_model_architecture.py  \
--data_path /path/to/bio_files/  \
--model xlm-roberta-large-spanish-clinical  \
--name model_name --storage_path models

License

The CLIN-X code is open-sourced under the AGPL-3.0 license. See the LICENSE file for details.

For a list of other open source components included in CLIN-X, see the file 3rd-party-licenses.txt.

Owner
Bosch Research
Bosch Research
Website which uses Deep Learning to generate horror stories.

Creepypasta - Text Generator Website which uses Deep Learning to generate horror stories. View Demo · View Website Repo · Report Bug · Request Feature

Dhairya Sharma 5 Oct 14, 2022
Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021 Official Pytorch implementation of PCME | Paper Sanghyuk Chun1 Seong Joon Oh1 Rafael Sampaio de R

NAVER AI 87 Dec 21, 2022
Source code to accompany Defunctland's video "FASTPASS: A Complicated Legacy"

Shapeland Simulator Source code to accompany Defunctland's video "FASTPASS: A Complicated Legacy" Download the video at https://www.youtube.com/watch?

TouringPlans.com 70 Dec 14, 2022
SFD implement with pytorch

S³FD: Single Shot Scale-invariant Face Detector A PyTorch Implementation of Single Shot Scale-invariant Face Detector Description Meanwhile train hand

Jun Li 251 Dec 22, 2022
iris - Open Source Photos Platform Powered by PyTorch

Open Source Photos Platform Powered by PyTorch. Submission for PyTorch Annual Hackathon 2021.

Omkar Prabhu 137 Sep 10, 2022
Official code for "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021".

Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021. Introduction We proposed a novel model training paradi

Lucas 103 Dec 14, 2022
Universal Probability Distributions with Optimal Transport and Convex Optimization

Sylvester normalizing flows for variational inference Pytorch implementation of Sylvester normalizing flows, based on our paper: Sylvester normalizing

Rianne van den Berg 172 Dec 13, 2022
OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.

English | 简体中文 Documentation: https://mmtracking.readthedocs.io/ Introduction MMTracking is an open source video perception toolbox based on PyTorch.

OpenMMLab 2.7k Jan 08, 2023
A denoising diffusion probabilistic model synthesises galaxies that are qualitatively and physically indistinguishable from the real thing.

Realistic galaxy simulation via score-based generative models Official code for 'Realistic galaxy simulation via score-based generative models'. We us

Michael Smith 32 Dec 20, 2022
AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

AMTML-KD: Adaptive Multi-teacher Multi-level Knowledge Distillation

Frank Liu 26 Oct 13, 2022
Using Clinical Drug Representations for Improving Mortality and Length of Stay Predictions

Using Clinical Drug Representations for Improving Mortality and Length of Stay Predictions Usage Clone the code to local. https://github.com/tanlab/MI

Computational Biology and Machine Learning lab @ TOBB ETU 3 Oct 18, 2022
Lightweight Python library for adding real-time object tracking to any detector.

Norfair is a customizable lightweight Python library for real-time 2D object tracking. Using Norfair, you can add tracking capabilities to any detecto

Tryolabs 1.7k Jan 05, 2023
Official PaddlePaddle implementation of Paint Transformer

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [Paddle Implementation] Update We have optimized the serial inference p

TianweiLin 284 Dec 31, 2022
Meandering In Networks of Entities to Reach Verisimilar Answers

MINERVA Meandering In Networks of Entities to Reach Verisimilar Answers Code and models for the paper Go for a Walk and Arrive at the Answer - Reasoni

Shehzaad Dhuliawala 271 Dec 13, 2022
Temporal-Relational CrossTransformers

Temporal-Relational Cross-Transformers (TRX) This repo contains code for the method introduced in the paper: Temporal-Relational CrossTransformers for

83 Dec 12, 2022
OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network

Stock Price Prediction of Apple Inc. Using Recurrent Neural Network OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network Dataset:

Nouroz Rahman 410 Jan 05, 2023
FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes This repository contains the source code accompanying the paper: FlexConv: C

Robert-Jan Bruintjes 96 Dec 12, 2022
The Unsupervised Reinforcement Learning Benchmark (URLB)

The Unsupervised Reinforcement Learning Benchmark (URLB) URLB provides a set of leading algorithms for unsupervised reinforcement learning where agent

259 Dec 26, 2022
A simple program for training and testing vit

Vit This is a simple program for training and testing vit. Key requirements: torch, torchvision and timm. Dataset I put 5 categories of the cub classi

xiezhenyu 2 Oct 11, 2022