Leaf: Multiple-Choice Question Generation

Last update: Dec 20, 2022

Overview

Leaf: Multiple-Choice Question Generation

Easy to use and understand multiple-choice question generation algorithm using T5 Transformers. The application accepts a short passage of text and uses two fine-tuned T5 Transformer models to first generate multiple question-answer pairs corresponding to the given text, after which it uses them to generate distractors - additional options used to confuse the test taker.

Originally inspired by a Bachelor's machine learning course (github link) and then continued as a topic for my Master's thesis at Sofia University, Bulgaria.

ECIR 2022 Demonstration paper

This work has been accepted as a demo paper for the ECIR 2022 conference.

Video demonstration: here

Live demo: coming soon

Paper: will be uploaded before the conference - 14th April 2022

Abstract: Testing with quiz questions has proven to be an effective strategy for better educational processes. However, manually creating quizzes is a tedious and time-consuming task. To address this challenge, we present Leaf, a system for generating multiple-choice questions from factual text. In addition to being very well suited for classroom settings, Leaf could be also used in an industrial setup, e.g., to facilitate onboarding and knowledge sharing, or as a component of chatbots, question answering systems, or Massive Open Online Courses (MOOCs).

Generating question and answer pairs

To generate the question-answer pairs we have fine-tuned a T5 transformer model from huggingface on the SQuAD1.1. dataset which is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles.

The model accepts the target answer and context as input:

'answer' + '
   
     + 'context'

and outputs a question that answers the given answer for the corresponding text.

'answer' + '
   
     + 'question'

To allow us to generate question-answer pairs without providing a target answer, we have trained the algorithm to do so when in place of the target answer the '[MASK]' token is passed.

'[MASK]' + '
   
     + 'context'

The full training script can be found in the training directory or accessed directly in Google Colab.

Generating incorrect options (distractors)

To generate the distractors, another T5 transformer model has been fine-tuned. This time using the RACE dataset which consists of more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students.

The model accepts the target answer, question and context as input:

'answer' + '
   
     + 'question' + 'context'

and outputs 3 distractors separated by the ' ' token.

'distractor1' + '
   
     + 'distractor2' + '
    
      'distractor3'

The full training script can be found in the training directory or accessed directly in Google Colab.

To extend the variety of distractors with simple words that are not so closely related to the context, we have also used sense2vec word embeddings in the cases where the T5 model does not good enough distractors.

Web application

To demonstrate the algorithm, a simple Angular web application has been created. It accepts the given paragraph along with the desired number of questions and outputs each generated question with the ability to redact them (shown below). The algorithm is exposing a simple REST API using flask which is consumed by the web app.

The code for the web application is located in a separated repository here.

Installation guide

Creating a virtual environment (optional)

To avoid any conflicts with python packages from other projects, it is a good practice to create a virtual environment in which the packages will be installed. If you do not want to this you can skip the next commands and directly install the the requirements.txt file.

Create a virtual environment :

python -m venv venv

Enter the virtual environment:

Windows:

. .\venv\Scripts\activate

Linux or MacOS

source .\venv\Scripts\activate

Installing packages

pip install -r .\requirements.txt

Downloading data

Question-answer model

Download the multitask-qg-ag model checkpoint and place it in the app/ml_models/question_generation/models/ directory.

Distractor generation

Download the race-distractors model checkpoint and place it in the app/ml_models/distractor_generation/models/ directory.

Download sense2vec, extract it and place the s2v_old folder and place it in the app/ml_models/sense2vec_distractor_generation/models/ directory.

Training on your own

The training scripts are available in the training directory. You can download the notebooks directly from there or open the Question-Answer Generation and Distractor Generation in Google Colab.

Leaf: Multiple-Choice Question Generation

Related tags

Overview

Leaf: Multiple-Choice Question Generation

ECIR 2022 Demonstration paper

Generating question and answer pairs

Generating incorrect options (distractors)

Web application

Installation guide

Creating a virtual environment (optional)

Installing packages

Downloading data

Question-answer model

Distractor generation

Training on your own

Owner

Kristiyan Vachev

Large-scale Hyperspectral Image Clustering Using Contrastive Learning, CIKM 21 Workshop

Decorators for maximizing memory utilization with PyTorch & CUDA

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

Open-source Monocular Python HawkEye for Tennis

Dynamic Attentive Graph Learning for Image Restoration, ICCV2021 [PyTorch Code]

Image-to-Image Translation in PyTorch

BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Reproduce ResNet-v2(Identity Mappings in Deep Residual Networks) with MXNet

PyTorch code for the ICCV'21 paper: "Always Be Dreaming: A New Approach for Class-Incremental Learning"

(ICCV 2021) PyTorch implementation of Paper "Progressive Correspondence Pruning by Consensus Learning"

My 1st place solution at Kaggle Hotel-ID 2021

Code for the TIP 2021 Paper "Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss"

Unofficial PyTorch implementation of Neural Additive Models (NAM) by Agarwal, et al.

Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Lama-cleaner: Image inpainting tool powered by LaMa

Implementation of Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis

pytorch implementation of fast-neural-style

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"