SciFive: a text-text transformer model for biomedical literature

Last update: Dec 24, 2022

Overview

SciFive

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.

Google Cloud Storage

Our base Google Cloud Storage URI is at gs://scifive

As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.

Instruction on access Cloud Storage from the command line with python library gsutil is described here

gsutil URI for 6 SciFive models:

SciFive Pubmed+PMC Base: gs://scifive/models/pubmed_pmc/base
SciFive Pubmed+PMC Large: gs://scifive/models/pubmed_pmc/large
SciFive Pubmed Base: gs://scifive/models/pubmed/base
SciFive Pubmed Large: gs://scifive/models/pubmed/large
SciFive PMC Base: gs://scifive/models/pmc/base
SciFive PMC Large: gs://scifive/models/pmc/large

gsutil URI for Pretrain data:

Pubmed: gs://scifive/pretrain/pubmed
PMC: gs://scifive/pretrain/pmc

Example

Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()

sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text =  f"mednli: sentence1: {sent_1} sentence2: {sent_2}"

encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=8,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

HuggingFace

SciFive Pubmed+PMC: Base | Large
SciFive Pubmed: Base | Large
SciFive PMC: Base | Large

Datasets

All of the finetune dataset already pre-procossed into text-text format also availabe at this

📊 Expected Results

Citations

If you use SciFive model or our code for publications, please cite:

@misc{phan2021scifive,
      title={SciFive: a text-to-text transformer model for biomedical literature}, 
      author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
      year={2021},
      eprint={2106.03598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

SciFive: a text-text transformer model for biomedical literature

Related tags

Overview

SciFive

Google Cloud Storage

gsutil URI for 6 SciFive models:

gsutil URI for Pretrain data:

Example

HuggingFace

Datasets

📊 Expected Results

Citations

Owner

Long Phan

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Riemannian Convex Potential Maps

The code of NeurIPS 2021 paper "Scalable Rule-Based Representation Learning for Interpretable Classification".

A plug-and-play library for neural networks written in Python

Customer Segmentation using RFM

Improving Transferability of Representations via Augmentation-Aware Self-Supervision

RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

PyTorch - Python + Nim

Python binding for Khiva library.

BirdCLEF 2021 - Birdcall Identification 4th place solution

Tooling for the Common Objects In 3D dataset.

This repository stores the code to reproduce the results published in "TiWS-iForest: Isolation Forest in Weakly Supervised and Tiny ML scenarios"

GPU-Accelerated Deep Learning Library in Python

(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

Implicit Model Specialization through DAG-based Decentralized Federated Learning

Captcha-tensorflow - Image Captcha Solving Using TensorFlow and CNN Model. Accuracy 90%+

KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Trying to understand alias-free-gan.

Adversarial Color Enhancement: Generating Unrestricted Adversarial Images by Optimizing a Color Filter

Code repository for the work "Multi-Domain Incremental Learning for Semantic Segmentation", accepted at WACV 2022