Code for the ACL 2021 paper "Structural Guidance for Transformer Language Models"

Overview

Structural Guidance for Transformer Language Models

This repository accompanies the paper, Structural Guidance for Transformer Language Models, published in ACL 2021. It includes inplementation of parsing-as-language-modelling and structural scaffolding for Transformer language models.

Environment

The code is based on Python3. You can install the different modules with

bash scripts/download_and_patch_transformers.sh
pip install -r requirements.txt
python -c "import nltk;nltk.download('punkt')"

The Huggingface transformers is updated indirectly through a patch. If you modifiy the code, to commit changes run

bash scripts/generate_patch.sh

and then just commit this patch

Data preparation

Prepare parsing oracle files

PLM and ScLM require syntactic parses to derive the action sequence oracle. The following command demonstrates how to prepare oracle files for these models.

python src/get_oracle.py --gen --fpath train.txt > train_gen.oracle
python src/get_oracle.py --gen --fpath dev.txt > dev_gen.oracle
python src/get_oracle.py --gen --fpath test.txt > test_gen.oracle

Prepare action ngram list

The following command generates the action ngram list for ScLM models. The training code of ScLM assumes that the action ngram list is stored in the root folder.

python src/get_action_ngram_list.py -f path/to/bllip-lg_train_gen.oracle path/to/bllip-lg_dev_gen.oracle -o bllip-lg_action_ngram_list.txt

Vanilla Language Models (LM)

The script src/lm.py implements a vanilla Transformer language model. Below are the commands for model training and evaluation, as well as commands to compute word-level surprisals from a trained model.

# Model training
python src/lm.py --train_data train.txt --dev_data dev.txt --lr 1e-5 --epochs ${EPOCHS} --seed ${SEED} --do_train --random_init --batch_size ${BATCH_SIZE} --report ${REPORT} --sample_every ${SAMPLE_EVERY} --model_path ${MODEL_PATH}

# Compute word-level perplexity
python src/lm.py --restore_from ${MODEL_PATH} --test_data test.txt --do_test

# Estimate word surprisals
python src/lm.py --restore_from ${MODEL_PATH} --do_eval --fpath ${TEST_SUITE_PATH} --pretokenized > ${OUTPUT_PATH}

Scaffoled Language Models (ScLM)

The script src/lm-sc.py implements Transformer language model with structural prediction as an auxilliary task, referred as ScLM. The commanline variable, ${SCAFFOLD_TYPE}, can be set as past or next, which corresponds to ScLM-past or ScLM-next respectively in the paper.

# Model training  
python src/lm-sc.py --train_data train_gen.oracle --dev_data dev_gen.oracle --lr 1e-5 --epochs ${EPOCHS} --seed ${SEED} --do_train --random_init --batch_size ${BATCH_SIZE} --report ${REPORT} --sample_every ${SAMPLE_EVERY} --alpha 0.5 --scaffold_type ${SCAFFOLD_TYPE} --model_path ${MODEL_PATH}

# Compute word-level perplexity
python src/plm-gen.py --restore_from ${MODEL_PATH} --test_data test_gen.oracle --do_test

# Estimate word surprisals
python src/lm-sc.py --restore_from ${MODEL_PATH} --do_eval --fpath ${TEST_SUITE_PATH} --pretokenized > ${OUTPUT_PATH}

Parsing as Language Modelling (PLM/PLM-mask)

The script src/plm-gen.py implements the idea of generative parsing as language modelling, a probabilistic model of top-down parsing action sequence. There are two variants, PLM and PLM-mask.

For PLM:

# Model training for PLM
python src/plm-gen.py --train_data train_gen.oracle --dev_data dev_gen.oracle --lr 1e-5 --epochs ${EPOCHS} --seed ${SEED} --do_train --batch_size ${BATCH_SIZE} --random_init --report ${REPORT} --sample_every ${SAMPLE_EVERY} --model_path ${MODEL_PATH}

# Estimate word-level perplexity with PLM
python src/plm-gen.py --restore_from ${MODEL_PATH} --test_data test_gen.oracle --do_test

# Estimate word surprisals with PLM
python src/plm-gen.py --restore_from ${MODEL_PATH} --do_eval --beam_size 100 --word_beam_size 10 --fast_track_size 5 --pretokenized --fpath ${TEST_SUITE_PATH} > ${OUTPUT_PATH} 2>${EVAL_LOG_PATH}

For PLM-mask:

# Model training for PLM-mask
python src/plm-gen.py --train_data train_gen.oracle --dev_data dev_gen.oracle --lr 1e-5 --epochs ${EPOCHS} --seed ${SEED} --do_train --batch_size ${BATCH_SIZE} --random_init --add_structured_mask --buffer_head 0 --stack_head 1 --report ${REPORT} --sample_every ${SAMPLE_EVERY} --model_path ${MODEL_PATH}

# Estimate word-level perplexity with PLM-mask
python src/plm-gen.py --restore_from ${MODEL_PATH} --add_structured_mask --buffer_head 0 --stack_head 1 --test_data test_gen.oracle --do_test

# Estimate word surprisals with PLM-mask
python src/plm-gen.py --restore_from ${MODEL_PATH} --add_structured_mask --buffer_head 0 --stack_head 1 --do_eval --beam_size 100 --word_beam_size 10 --fast_track_size 5 --pretokenized --fpath ${TEST_SUITE_PATH} > ${OUTPUT_PATH} 2>>${EVAL_LOG_PATH}

Plot figures

The analysis folder contains the code and model evaluation results for generating the figures in the paper. The following commands run the plotting scripts and generate figures in the figs folder. Python packages matplotlib and pandas are required to run the plotting scripts. RNNG results are taken from Hu et al., (2020).

cd analysis
mkdir -p figs

# Plot results on SG Test Suites and BLiMP-10%.
python analysis_sg.py
python analysis_blimp.py

Acknowledgements

We thank Ramon Astudillo and Tahira Naseem for their contributions to the repository.

Owner
International Business Machines
International Business Machines
State of the art faster Natural Language Processing in Tensorflow 2.0 .

tf-transformers: faster and easier state-of-the-art NLP in TensorFlow 2.0 ****************************************************************************

74 Dec 05, 2022
Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022
Code for the paper "Are Sixteen Heads Really Better than One?"

Are Sixteen Heads Really Better than One? This repository contains code to reproduce the experiments in our paper Are Sixteen Heads Really Better than

Paul Michel 143 Dec 14, 2022
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 07, 2022
Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computa

Open Business Software Solutions 129 Jan 06, 2023
Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

Li-am K. 0 Oct 25, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Google 290 Dec 26, 2022
Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Persian Lexicon This repo uses Uppsala Persian Corpus (UPC) to construct a lexic

Saman Vaisipour 7 Apr 01, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
LewusBot - Twitch ChatBot built in python with twitchio library

LewusBot Twitch ChatBot built in python with twitchio library. Uses twitch/leagu

Lewus 25 Dec 04, 2022
Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your t

Ravn Tech, Inc. 166 Jan 07, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Reddit text to speech generator A basic reddit tts video generator Current functionality Generate videos for subs based on comments,(askreddit) so rea

Aadvik 17 Dec 19, 2022
Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Fast (GAN Based Neural) Vocoder Chinese README Todo Submit demo Support NHV Discription Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe include N

Zhengxi Liu (刘正曦) 134 Dec 16, 2022
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts 「Help you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG Website | bioRxiv Paper | Harvard Dataverse Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integra

Machine Learning for Medicine and Science @ Harvard 103 Dec 10, 2022
TensorFlow code and pre-trained models for BERT

BERT ***** New March 11th, 2020: Smaller BERT Models ***** This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece

Google Research 32.9k Jan 08, 2023