A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

Overview

MedMCQA

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

A large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

The MedMCQA task can be formulated as X = {Q, O} where Q represents the questions in the text, O represents the candidate options, multiple candidate answers are given for each question O = {O1, O2, ..., On}. The goal is to select the single or multiple answers from the option set.

If you would like to use the data or code, please cite the paper:

@InProceedings{pmlr-v174-pal22a,
  title = 	 {MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering},
  author =       {Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan},
  booktitle = 	 {Proceedings of the Conference on Health, Inference, and Learning},
  pages = 	 {248--260},
  year = 	 {2022},
  editor = 	 {Flores, Gerardo and Chen, George H and Pollard, Tom and Ho, Joyce C and Naumann, Tristan},
  volume = 	 {174},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {07--08 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v174/pal22a/pal22a.pdf},
  url = 	 {https://proceedings.mlr.press/v174/pal22a.html},
  abstract = 	 {This paper introduces MedMCQA, a new large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. More than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. Each sample contains a question, correct answer(s), and other options which requires a deeper language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. A detailed explanation of the solution, along with the above information, is provided in this study.}
}

GitHub license GitHub commit PRs Welcome

Dataset Description

Links
Homepage: https://medmcqa.github.io
Repository: https://github.com/medmcqa/medmcqa
Paper: https://arxiv.org/abs/2203.14371
Leaderboard: https://paperswithcode.com/dataset/medmcqa
Point of Contact: Aaditya Ura, Logesh

Dataset Summary

MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.

MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity.

Each sample contains a question, correct answer(s), and other options which require a deeper language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. A detailed explanation of the solution, along with the above information, is provided in this study.

MedMCQA provides an open-source dataset for the Natural Language Processing community. It is expected that this dataset would facilitate future research toward achieving better QA systems. The dataset contains questions about the following topics:

  • Anesthesia
  • Anatomy
  • Biochemistry
  • Dental
  • ENT
  • Forensic Medicine (FM)
  • Obstetrics and Gynecology (O&G)
  • Medicine
  • Microbiology
  • Ophthalmology
  • Orthopedics
  • Pathology
  • Pediatrics
  • Pharmacology
  • Physiology
  • Psychiatry
  • Radiology
  • Skin
  • Preventive & Social Medicine (PSM)
  • Surgery

Requirements

pip3 install -r requirements.txt

Data Download and Preprocessing

download the data from below link

data : https://drive.google.com/uc?export=download&id=15VkJdq5eyWIkfb_aoD3oS8i4tScbHYky

Experiments code

To run the experiments mentioned in the paper, follow the below steps

  • Clone the repo
  • Install the dependencies

pip3 install -r requirements.txt

  • Download the data from google drive link
  • Unzip the data
  • run below command with the data path

python3 train.py --model bert-base-uncased --dataset_folder_name "/content/medmcqa_data/"

Supported Tasks and Leaderboards

multiple-choice-QA, open-domain-QA: The dataset can be used to train a model for multi-choice questions answering, open domain questions answering. Questions in these exams are challenging and generally require deeper domain and language understanding as it tests the 10+ reasoning abilities across a wide range of medical subjects & topics.

Languages

The questions and answers are available in English.

Dataset Structure

Data Instances

{
    "question":"A 40-year-old man presents with 5 days of productive cough and fever. Pseudomonas aeruginosa is isolated from a pulmonary abscess. CBC shows an acute effect characterized by marked leukocytosis (50,000 mL) and the differential count reveals a shift to left in granulocytes. Which of the following terms best describes these hematologic findings?",
    "exp": "Circulating levels of leukocytes and their precursors may occasionally reach very high levels (>50,000 WBC mL). These extreme elevations are sometimes called leukemoid reactions because they are similar to the white cell counts observed in leukemia, from which they must be distinguished. The leukocytosis occurs initially because of the accelerated release of granulocytes from the bone marrow (caused by cytokines, including TNF and IL-1) There is a rise in the number of both mature and immature neutrophils in the blood, referred to as a shift to the left. In contrast to bacterial infections, viral infections (including infectious mononucleosis) are characterized by lymphocytosis Parasitic infestations and certain allergic reactions cause eosinophilia, an increase in the number of circulating eosinophils. Leukopenia is defined as an absolute decrease in the circulating WBC count.",
    "cop":1,
    "opa":"Leukemoid reaction",
    "opb":"Leukopenia",
    "opc":"Myeloid metaplasia",
    "opd":"Neutrophilia",
    "subject_name":"Pathology",
    "topic_name":"Basic Concepts and Vascular changes of Acute Inflammation",
    "id":"4e1715fe-0bc3-494e-b6eb-2d4617245aef",
    "choice_type":"single"
}

Data Fields

  • id : a string question identifier for each example
  • question : question text (a string)
  • opa : Option A
  • opb : Option B
  • opc : Option C
  • opd : Option D
  • cop : Correct option (Answer of the question)
  • choice_type : Question is single-choice or multi-choice
  • exp : Expert's explanation of the answer
  • subject_name : Medical Subject name of the particular question
  • topic_name : Medical topic name from the particular subject

Data Splits

The goal of MedMCQA is to emulate the rigor of real word medical exams. To enable that, a predefined split of the dataset is provided. The split is by exams instead of the given questions. This also ensures the reusability and generalization ability of the models.

  • The training set of MedMCQA consists of all the collected mock & online test series.
  • The test set consists of all AIIMS PG exam MCQs (years 1991-present).
  • The development set consists of NEET PG exam MCQs (years 2001-present) to approximate real exam evaluation.

Similar questions from train , test and dev set were removed based on similarity. The final split sizes are as follow:

Train Valid Test
Question # 182,822 6,150 4,183
Vocab 94,231 11,218 10,800
Max Ques tokens 220 135 88
Max Ans tokens 38 21 25

Model Submission and Test Set Evaluation

To preserve the integrity of test results, we do not release the test set's ground-truth to the public. Instead, we require you to use the test-set to evaluate the model and send the predictions along with unique question id id in csv format to below email addresses

Example :

id Prediction (correct option)
84f328d3-fca4-422d-8fb2-19d55eb31503 2
bb85e248-b2e9-48e8-a887-67c1aff15b6d 3

aadityaura [at] gmail.com, logesh.umapathi [at] saama.com

Owner
MedMCQA
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
MedMCQA
Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

MITRE Cybersecurity 64 Nov 22, 2022
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, whic

Jesse Zaneveld 33 Dec 28, 2022
Contract Understanding Atticus Dataset

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

The Atticus Project 273 Dec 17, 2022
Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

Jay Alammar 1.6k Dec 25, 2022
Prithivida 690 Jan 04, 2023
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
In this repository we have tested 3 VQA models on the ImageCLEF-2019 dataset.

Med-VQA In this repository we have tested 3 VQA models on the ImageCLEF-2019 dataset. Two of these are made on top of Facebook AI Reasearch's Multi-Mo

Kshitij Ambilduke 8 Apr 14, 2022
A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

Simple-Vosk A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk. Check out the official Vosk G

2 Jun 19, 2022
A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

NEC Laboratories Europe 13 Sep 08, 2022
ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体,包括上市公司所属行业关系、行业上级关系、产品上游原材料关系、产品下游产品关系、公司主营产品、产品小类共6大类。 上市公司4,654家,行业511个,产品95,559条、上游材料56,824条,上级行业480条,下游产品390条,产品小类52,937条,所属行业3,946条。

liuhuanyong 415 Jan 06, 2023
Yes it's true :broken_heart:

Information WARNING: No longer hosted If you would like to be on this repo's readme simply fork or star it! Forks 1 - Flowzii 2 - Errorcrafter 3 - vk-

Dropout 66 Dec 31, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

Tianxiang Sun 41 Dec 30, 2022
A Telegram bot to add notes to Flomo.

flomo bot 使用 Telegram 机器人发送笔记到你的 Flomo. 你需要有一台可访问 Telegram 的服务器。 Steps @BotFather 新建机器人,获取 token Flomo 官网获取 API,链接 https://flomoapp.com/mine?source=in

Zhen 44 Dec 30, 2022
Awesome Treasure of Transformers Models Collection

💁 Awesome Treasure of Transformers Models for Natural Language processing contains papers, videos, blogs, official repo along with colab Notebooks. 🛫☑️

Ashish Patel 577 Jan 07, 2023
Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

PEGASUS library Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS, uses self-supervised

Google Research 1.4k Dec 22, 2022
Code for using and evaluating SpanBERT.

SpanBERT This repository contains code and models for the paper: SpanBERT: Improving Pre-training by Representing and Predicting Spans. If you prefer

Meta Research 798 Dec 30, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
Graph Coloring - Weighted Vertex Coloring Problem

Graph Coloring - Weighted Vertex Coloring Problem This project proposes several local searches and an MCTS algorithm for the weighted vertex coloring

Cyril 1 Jul 08, 2022
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

LightSpeech UnOfficial PyTorch implementation of LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

Rishikesh (ऋषिकेश) 54 Dec 03, 2022