Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

Related tags

Text Data & NLPERNIE
Overview

ERNIE

Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities"

Reqirements:

  • Pytorch>=0.4.1
  • Python3
  • tqdm
  • boto3
  • requests
  • apex (If you want to use fp16, you should make sure the commit is 79ad5a88e91434312b43b4a89d66226be5f2cc98.)

Prepare Pre-train Data

Run the following command to create training instances.

  # Download Wikidump
  wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
  # Download anchor2id
  wget -c https://cloud.tsinghua.edu.cn/f/1c956ed796cb4d788646/?dl=1 -O anchor2id.txt
  # WikiExtractor
  python3 pretrain_data/WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
  # Modify anchors with 4 processes
  python3 pretrain_data/extract.py 4
  # Preprocess with 4 processes
  python3 pretrain_data/create_ids.py 4
  # create instances
  python3 pretrain_data/create_insts.py 4
  # merge
  python3 code/merge.py

If you want to get anchor2id by yourself, run the following code(this will take about half a day) after python3 pretrain_data/extract.py 4

  # extract anchors
  python3 pretrain_data/utils.py get_anchors
  # query Mediawiki api using anchor link to get wikibase item id. For more details, see https://en.wikipedia.org/w/api.php?action=help.
  python3 pretrain_data/create_anchors.py 256 
  # aggregate anchors 
  python3 pretrain_data/utils.py agg_anchors

Run the following command to pretrain:

  python3 code/run_pretrain.py --do_train --data_dir pretrain_data/merge --bert_model ernie_base --output_dir pretrain_out/ --task_name pretrain --fp16 --max_seq_length 256

We use 8 NVIDIA-2080Ti to pre-train our model and there are 32 instances in each GPU. It takes nearly one day to finish the training (1 epoch is enough).

Pre-trained Model

Download pre-trained knowledge embedding from Google Drive/Tsinghua Cloud and extract it.

tar -xvzf kg_embed.tar.gz

Download pre-trained ERNIE from Google Drive/Tsinghua Cloud and extract it.

tar -xvzf ernie_base.tar.gz

Note that the extraction may be not completed in Windows.

Fine-tune

As most datasets except FewRel don't have entity annotations, we use TAGME to extract the entity mentions in the sentences and link them to their corresponding entitoes in KGs. We provide the annotated datasets Google Drive/Tsinghua Cloud.

tar -xvzf data.tar.gz

In the root directory of the project, run the following codes to fine-tune ERNIE on different datasets.

FewRel:

python3 code/run_fewrel.py   --do_train   --do_lower_case   --data_dir data/fewrel/   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 10   --output_dir output_fewrel   --fp16   --loss_scale 128
# evaluate
python3 code/eval_fewrel.py   --do_eval   --do_lower_case   --data_dir data/fewrel/   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 10   --output_dir output_fewrel   --fp16   --loss_scale 128

TACRED:

python3 code/run_tacred.py   --do_train   --do_lower_case   --data_dir data/tacred   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 4.0   --output_dir output_tacred   --fp16   --loss_scale 128 --threshold 0.4
# evaluate
python3 code/eval_tacred.py   --do_eval   --do_lower_case   --data_dir data/tacred   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 4.0   --output_dir output_tacred   --fp16   --loss_scale 128 --threshold 0.4

FIGER:

python3 code/run_typing.py    --do_train   --do_lower_case   --data_dir data/FIGER   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 2048   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir output_figer  --gradient_accumulation_steps 32 --threshold 0.3 --fp16 --loss_scale 128 --warmup_proportion 0.2
# evaluate
python3 code/eval_figer.py    --do_eval   --do_lower_case   --data_dir data/FIGER   --ernie_model ernie_base   --max_seq_length 256   --train_batch_size 2048   --learning_rate 2e-5   --num_train_epochs 3.0   --output_dir output_figer  --gradient_accumulation_steps 32 --threshold 0.3 --fp16 --loss_scale 128 --warmup_proportion 0.2

OpenEntity:

python3 code/run_typing.py    --do_train   --do_lower_case   --data_dir data/OpenEntity   --ernie_model ernie_base   --max_seq_length 128   --train_batch_size 16   --learning_rate 2e-5   --num_train_epochs 10.0   --output_dir output_open --threshold 0.3 --fp16 --loss_scale 128
# evaluate
python3 code/eval_typing.py   --do_eval   --do_lower_case   --data_dir data/OpenEntity   --ernie_model ernie_base   --max_seq_length 128   --train_batch_size 16   --learning_rate 2e-5   --num_train_epochs 10.0   --output_dir output_open --threshold 0.3 --fp16 --loss_scale 128

Some code is modified from the pytorch-pretrained-BERT. You can find the explanation of most parameters in pytorch-pretrained-BERT.

As the annotations given by TAGME have confidence score, we use --threshlod to set the lowest confidence score and choose the annotations whose scores are higher than --threshold. In this experiment, the value is usually 0.3 or 0.4.

The script for the evaluation of relation classification just gives the accuracy score. For the macro/micro metrics, you should use code/score.py which is from tacred repo.

python3 code/score.py gold_file pred_file

You can find gold_file and pred_file on each checkpoint in the output folder (--output_dir).

New Tasks:

If you want to use ERNIE in new tasks, you should follow these steps:

  • Use an entity-linking tool like TAGME to extract the entities in the text
  • Look for the Wikidata ID of the extracted entities
  • Take the text and entities sequence as input data

Here is a quick-start example (code/example.py) using ERNIE for Masked Language Model. We show how to annotate the given sentence with TAGME and build the input data for ERNIE. Note that it will take some time (around 5 mins) to load the model.

# If you haven't installed tagme
pip install tagme
# Run example
python3 code/example.py

Cite

If you use the code, please cite this paper:

@inproceedings{zhang2019ernie,
  title={{ERNIE}: Enhanced Language Representation with Informative Entities},
  author={Zhang, Zhengyan and Han, Xu and Liu, Zhiyuan and Jiang, Xin and Sun, Maosong and Liu, Qun},
  booktitle={Proceedings of ACL 2019},
  year={2019}
}
Owner
THUNLP
Natural Language Processing Lab at Tsinghua University
THUNLP
This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Neural Style Transfer Transition Video Processing By Brycen Westgarth and Tristan Jogminas Description This code extends the neural style transfer ima

Brycen Westgarth 110 Jan 07, 2023
PyJPBoatRace: Python-based Japanese boatrace tools ๐Ÿšค

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

5 Oct 29, 2022
Yes it's true :broken_heart:

Information WARNING: No longer hosted If you would like to be on this repo's readme simply fork or star it! Forks 1 - Flowzii 2 - Errorcrafter 3 - vk-

Dropout 66 Dec 31, 2022
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
Uncomplete archive of files from the European Nopsled Team

European Nopsled CTF Archive This is an archive of collected material from various Capture the Flag competitions that the European Nopsled team played

European Nopsled 4 Nov 24, 2021
๋‰ด์Šค ๋„๋ฉ”์ธ ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ (21-1ํ•™๊ธฐ ์กธ์—… ํ”„๋กœ์ ํŠธ)

๋‰ด์Šค ๋„๋ฉ”์ธ ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ ๋ณธ ํ”„๋กœ์ ํŠธ๋Š” ๋‰ด์Šค๊ธฐ์‚ฌ์— ๋Œ€ํ•œ ์งˆ์˜์‘๋‹ต ์„œ๋น„์Šค ๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ง„ํ–‰ํ•œ ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค. ์•ฝ 3๊ฐœ์›”๊ฐ„ ( 21. 03 ~ 21. 05 ) ์ง„ํ–‰ํ•˜์˜€์œผ๋ฉฐ Transformer ์•„ํ‚คํ…์ณ ๊ธฐ๋ฐ˜์˜ Encoder๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•œ๊ตญ์–ด ์งˆ์˜์‘๋‹ต ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ

TaegyeongEo 4 Jul 08, 2022
My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

Aflah 9 Oct 31, 2022
CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

Sajjad Ayoubi 66 Dec 14, 2022
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG Website | bioRxiv Paper | Harvard Dataverse Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integra

Machine Learning for Medicine and Science @ Harvard 103 Dec 10, 2022
SimBERTๅ‡็บง็‰ˆ๏ผˆSimBERTv2๏ผ‰๏ผ

RoFormer-Sim RoFormer-Sim๏ผŒๅˆ็งฐSimBERTv2๏ผŒๆ˜ฏๆˆ‘ไปฌไน‹ๅ‰ๅ‘ๅธƒ็š„SimBERTๆจกๅž‹็š„ๅ‡็บง็‰ˆใ€‚ ไป‹็ป https://kexue.fm/archives/8454 ่ฎญ็ปƒ tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 ไธ‹่ฝฝ

317 Dec 23, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword

Google Research 409 Jan 06, 2023
๐Ÿ“”๏ธ Generate a text-based journal from a template file.

JGen ๐Ÿ“”๏ธ Generate a text-based journal from a template file. Contents Getting Started Example Overview Usage Details Reserved Keywords Gotchas Getting

Harrison Broadbent 21 Sep 25, 2022
Big Bird: Transformers for Longer Sequences

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the c

Google Research 457 Dec 23, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
Faster, modernized fork of the language identification tool langid.py

py3langid py3langid is a fork of the standalone language identification tool langid.py by Marco Lui. Original license: BSD-2-Clause. Fork license: BSD

Adrien Barbaresi 12 Nov 05, 2022
Dรฉ op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dรฉ op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
TensorFlow code and pre-trained models for BERT

BERT ***** New March 11th, 2020: Smaller BERT Models ***** This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece

Google Research 32.9k Jan 08, 2023