Implementation of BI-RADS-BERT & The Advantages of Section Tokenization.

Overview

BI-RADS BERT

Implementation of BI-RADS-BERT & The Advantages of Section Tokenization.

This implementation could be used on other radiology in house corpus as well. Labelling your own data should take the same form as reports and dataframes in './mockdata'.

Conda Environment setup

This project was developed using conda environments. To build the conda environment use the line of code below from the command line

conda create --name NLPenv --file requirements.txt --channel default --channel conda-forge --channel huggingface --channel pytorch

Dataset Organization

Two datasets are needed to build BERT embeddings and fine tuned Field Extractors. 1. dataframe of SQL data, 2. labeled data for field extraction.

Dataframe of SQL data: example file './mock_data/sql_dataframe.csv'. This file was efficiently made by producing a spreadsheet of all entries in the sql table and saving them as a csv file. It will require that each line of the report be split and coordinated with a SequenceNumber column to combine all the reports. Then continue to the 'How to Run BERT Pretraining' Section.

Labeled data for Field Extraction: example of files in './mock_data/labaled_data'. Exach txt file is a save dict object with fields:

example = {
    'original_report': original text report unprocessed from the exam_dataframe.csv, 
    'sectionized': dict example of the report in sections, ex. {'Title': '...', 'Hx': '...', ...}
    'PID': patient identification number,
    'date': date of the exam,
    'field_name1': name of a field you wish to classify, vlaue is the label, 
    'field_name2': more labeled fields are an option, 
    ...
}

How to Run BERT Pretraining

Step 1: SQLtoDataFrame.py

This script can be ran to convert SQL data from a hospital records system to a dataframe for all exams. Hospital records keep each individual report line as a separate SQL entry, so by using 'SequenceNumber' we can assemble them in order.

python ./examples/SQLtoDataFrame.py 
--input_sql ./mock_data/sql_dataframe.csv 
--save_name /folder/to/save/exam_dataframe/save_file.csv

This will output an 'exam_dataframe.csv' file that can be used in the next step.

Step 2: TextPreProcessingBERTModel.py

This script is ran to convert the exam_dataframe.csv file into a pre_training text file for training and validation, with a vocabulary size. An example of the output can be found in './mock_data/pre_training_data'.

python ./examples/TextPreProcessingBERTModel.py 
--dfolder /folder/that/contains/exam_dataframe 
--ft_folder ./mock_data/labeled_data

Step 3: MLM_Training_transformers.py

This script will now run the BERT pre training with masked language modeling. The Output directory (--output_dir) used is required to be empty; eitherwise the parser parameter --overwrite_output_dir is required to overwrite the files in the output directory.

python ./examples/MLM_Training_transformers.py 
--train_data_file ./mock_data/pre_training_data/VocabOf39_PreTraining_training.txt 
--output_dir /folder/to/save/bert/model
--do_eval 
--eval_data_file ./mock_data/pre_training_data/PreTraining_validation.txt 

How to Run BERT Fine Tuning

--pre_trained_model parsed arugment that can be used for all the follwing scripts to load a pre trained embedding. The default is bert-base-uncased. To get BioClinical BERT use --pre_trained_model emilyalsentzer/Bio_ClinicalBERT.

Step 4: BERTFineTuningSectionTokenization.py

This script will run fine tuning to train a section tokenizer with the option of using auxiliary data.

python ./examples/BERTFineTuningSectionTokenization.py 
--dfolder ./mock_data/labeled_data
--sfolder /folder/to/save/section_tokenizer

Optional parser arguements:

--aux_data If used then the Section Tokenizer will be trained with the auxilliary data.

--k_fold If used then the experiment is run with a 5 fold cross validation.

Step 5: BERTFineTuningFieldExtractionWoutSectionization.py

This script will run fine tuning training of field extraction without section tokenization.

python ./examples/BERTFineTuningFieldExtractionWoutSectionization.py 
--dfolder ./mock_data/labeled_data
--sfolder /folder/to/save/field_extractor_WoutST
--field_name Modality

field_name is a required parsed arguement.

Optional parser arguements:

--k_fold If used then the experiment is run with a 5 fold cross validation.

Step 6: BERTFineTuningFieldExtraction.py

This script will run fine tuning training of field extraction with section tokenization.

python ./examples/BERTFineTuningFieldExtraction.py 
--dfolder ./mock_data/labeled_data
--sfolder /folder/to/save/field_extractor
--field_name Modality
--report_section Title

field_name and report_section is a required parsed arguement.

Optional parser arguements:

--k_fold If used then the experiment is run with a 5 fold cross validation.

Additional Codes

post_ExperimentSummary.py

This code can be used to run statistical analysis of test results that are produced from BERTFineTuning codes.

To determine the best final model, we performed statistical significance testing with a 95% confidence. We used the Mann-Whitney U test to compare the medians of different section tokenizers as the distribution of accuracy and G.F1 performance is skewed to the left (medians closer to 100%). For the field extraction classifiers, we used the McNemar test to compare the agreement between two classifiers. The McNemar test was chosen because it has been robustly proven to have an acceptable probability of Type I errors (not detecting a difference between two classifiers when there is a difference). After evaluating both configurations of field extraction explored in this paper, we performed another McNemar test to assist in choosing the best technique. All statistical tests were performed with p-value adjustments for multiple comparisons testing with Bonferonni correction.

Note: input folder must contain 2 or more .xlsx files of experiemtnal results to perform a statistical test.

python ./examples/post_ExperimentSummary.py --folder /folder/where/xlsx/files/are/located --stat_test MannWhitney

--stat_test options: 'MannWhitney' and 'McNemar'.

'MannWhitney': MannWhitney U-Test. This test was used for the Section Tokenizer experimental results comparing the results from different models. https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

'McNemar' : McNemar's test. This test was used for the Field Extraction experimental results comparing the results from different models. https://en.wikipedia.org/wiki/McNemar%27s_test

Contact

Please post a Github issue if you have any questions.

Graph Convolutional Networks for Temporal Action Localization (ICCV2019)

Graph Convolutional Networks for Temporal Action Localization This repo holds the codes and models for the PGCN framework presented on ICCV 2019 Graph

Runhao Zeng 318 Dec 06, 2022
data/code repository of "C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer"

C2F-FWN data/code repository of "C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer" (https://arxiv.org/abs/

EKILI 46 Dec 14, 2022
OpenMMLab Semantic Segmentation Toolbox and Benchmark.

Documentation: https://mmsegmentation.readthedocs.io/ English | 简体中文 Introduction MMSegmentation is an open source semantic segmentation toolbox based

OpenMMLab 5k Dec 31, 2022
The code is an implementation of Feedback Convolutional Neural Network for Visual Localization and Segmentation.

Feedback Convolutional Neural Network for Visual Localization and Segmentation The code is an implementation of Feedback Convolutional Neural Network

19 Dec 04, 2022
Posterior predictive distributions quantify uncertainties ignored by point estimates.

Posterior predictive distributions quantify uncertainties ignored by point estimates.

DeepMind 177 Dec 06, 2022
FluxTraining.jl gives you an endlessly extensible training loop for deep learning

A flexible neural net training library inspired by fast.ai

86 Dec 31, 2022
Code for Reciprocal Adversarial Learning for Brain Tumor Segmentation: A Solution to BraTS Challenge 2021 Segmentation Task

BRATS 2021 Solution For Segmentation Task This repo contains the supported pytorch code and configuration files to reproduce 3D medical image segmenta

Himashi Amanda Peiris 6 Sep 15, 2022
Meshed-Memory Transformer for Image Captioning. CVPR 2020

M²: Meshed-Memory Transformer This repository contains the reference code for the paper Meshed-Memory Transformer for Image Captioning (CVPR 2020). Pl

AImageLab 422 Dec 28, 2022
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

107 Dec 02, 2022
Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

Jingyun Liang 159 Dec 30, 2022
Experiments and code to generate the GINC small-scale in-context learning dataset from "An Explanation for In-context Learning as Implicit Bayesian Inference"

GINC small-scale in-context learning dataset GINC (Generative In-Context learning Dataset) is a small-scale synthetic dataset for studying in-context

P-Lambda 29 Dec 19, 2022
Rotation Robust Descriptors

RoRD Rotation-Robust Descriptors and Orthographic Views for Local Feature Matching Project Page | Paper link Evaluation and Datasets MMA : Training on

Udit Singh Parihar 25 Nov 15, 2022
Transformer part of 12th place solution in Riiid! Answer Correctness Prediction

kaggle_riiid Transformer part of 12th place solution in Riiid! Answer Correctness Prediction. Please see here for more information. Execution You need

Sakami Kosuke 2 Apr 23, 2022
Deep Convolutional Generative Adversarial Networks

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Alec Radford, Luke Metz, Soumith Chintala All images in t

Alec Radford 3.4k Dec 29, 2022
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 02, 2023
Code for BMVC2021 paper "Boundary Guided Context Aggregation for Semantic Segmentation"

Boundary-Guided-Context-Aggregation Boundary Guided Context Aggregation for Semantic Segmentation Haoxiang Ma, Hongyu Yang, Di Huang In BMVC'2021 Pape

Haoxiang Ma 31 Jan 08, 2023
YouRefIt: Embodied Reference Understanding with Language and Gesture

YouRefIt: Embodied Reference Understanding with Language and Gesture YouRefIt: Embodied Reference Understanding with Language and Gesture by Yixin Che

16 Jul 11, 2022
[NeurIPS 2020] Semi-Supervision (Unlabeled Data) & Self-Supervision Improve Class-Imbalanced / Long-Tailed Learning

Rethinking the Value of Labels for Improving Class-Imbalanced Learning This repository contains the implementation code for paper: Rethinking the Valu

Yuzhe Yang 656 Dec 28, 2022
MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

MSG-Transformer Official implementation of the paper MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens, by Jiemin

Hust Visual Learning Team 68 Nov 16, 2022
A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

One-Stage Visual Grounding ***** New: Our recent work on One-stage VG is available at ReSC.***** A Fast and Accurate One-Stage Approach to Visual Grou

Zhengyuan Yang 118 Dec 05, 2022