Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition"

Overview

README

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition", accepted at ACL 2021. For details of the model and experiments, please see our paper.

Setup

Requirements

conda create --name acl python=3.8
conda activate acl
pip install -r requirements.txt

Datasets

The datasets used in our experiments:

Data format:

 {
       "tokens": ["2004-12-20T15:37:00", "Microscopic", "microcap", "Everlast", ",", "mainly", "a", "maker", "of", "boxing", "equipment", ",", "has", "soared", "over", "the", "last", "several", "days", "thanks", "to", "a", "licensing", "deal", "with", "Jacques", "Moret", "allowing", "Moret", "to", "buy", "out", "their", "women", "'s", "apparel", "license", "for", "$", "30", "million", ",", "on", "top", "of", "a", "$", "12.5", "million", "payment", "now", "."], 
       "pos": ["JJ", "JJ", "NN", "NNP", ",", "RB", "DT", "NN", "IN", "NN", "NN", ",", "VBZ", "VBN", "IN", "DT", "JJ", "JJ", "NNS", "NNS", "TO", "DT", "NN", "NN", "IN", "NNP", "NNP", "VBG", "NNP", "TO", "VB", "RP", "PRP$", "NNS", "POS", "NN", "NN", "IN", "$", "CD", "CD", ",", "IN", "NN", "IN", "DT", "$", "CD", "CD", "NN", "RB", "."], 
       "entities": [{"type": "ORG", "start": 1, "end": 4}, {"type": "ORG", "start": 5, "end": 11}, {"type": "ORG", "start": 25, "end": 27}, {"type": "ORG", "start": 28, "end": 29}, {"type": "ORG", "start": 32, "end": 33}, {"type": "PER", "start": 33, "end": 34}], 
       "ltokens": ["Everlast", "'s", "Rally", "Just", "Might", "Live", "up", "to", "the", "Name", "."], 
       "rtokens": ["In", "other", "words", ",", "a", "competitor", "has", "decided", "that", "one", "segment", "of", "the", "company", "'s", "business", "is", "potentially", "worth", "$", "42.5", "million", "."],
       "org_id": "MARKETVIEW_20041220.1537"
}

The ltokens contains the tokens from the previous sentence. And The rtokens contains the tokens from the next sentence.

Due to the license of LDC, we cannot directly release our preprocessed datasets of ACE04, ACE05 and KBP17. We only release the preprocessed GENIA dataset and the corresponding word vectors and dictionary. Download them from here.

If you need other datasets, please contact me ([email protected]) by email. Note that you need to state your identity and prove that you have obtained the LDC license.

Pretrained Wordvecs

The word vectors used in our experiments:

Download and extract the wordvecs from above links, save GloVe in ../glove and BioWord2Vec in ../biovec.

mkdir ../glove
mkdir ../biovec
mv glove.6B.100d.txt ../glove
mv PubMed-shuffle-win-30.txt ../biovec

Note: the BioWord2Vec downloaded from the above link is word2vec binary format, and needs to be converted to GloVe format. Refer to this.

Example

Train

python identifier.py train --config configs/example.conf

Note: You should edit this line in config_reader.py according to the actual number of GPUs.

Evaluation

You can download our checkpoints, or train your own model and then evaluate the model.

cd data/
# download checkpoints from https://drive.google.com/drive/folders/1NaoL42N-g1t9jiif427HZ6B8MjbyGTaZ?usp=sharing
unzip checkpoints.zip
cd ../
python identifier.py eval --config configs/eval.conf

If you use the checkpoints we provided, you will get the following results:

  • ACE05:
-- Entities (named entity recognition (NER)) ---
An entity is considered correct if the entity type and span is predicted correctly

                type    precision       recall     f1-score      support
                 WEA        84.62        88.00        86.27           50
                 ORG        83.23        78.78        80.94          523
                 PER        88.02        92.05        89.99         1724
                 FAC        80.65        73.53        76.92          136
                 GPE        85.13        87.65        86.37          405
                 VEH        86.36        75.25        80.42          101
                 LOC        66.04        66.04        66.04           53

               micro        86.05        87.20        86.62         2992
               macro        82.01        80.19        81.00         2992
  • GENIA:
--- Entities (named entity recognition (NER)) ---
An entity is considered correct if the entity type and span is predicted correctly

                type    precision       recall     f1-score      support
                 RNA        89.91        89.91        89.91          109
                 DNA        76.79        79.16        77.96         1262
           cell_line        82.35        72.36        77.03          445
             protein        81.11        85.18        83.09         3084
           cell_type        72.90        75.91        74.37          606

               micro        79.46        81.84        80.63         5506
               macro        80.61        80.50        80.47         5506
  • ACE04:
--- Entities (named entity recognition (NER)) ---
An entity is considered correct if the entity type and span is predicted correctly

                type    precision       recall     f1-score      support
                 FAC        72.16        62.50        66.99          112
                 PER        91.62        91.26        91.44         1498
                 LOC        74.36        82.86        78.38          105
                 VEH        94.44       100.00        97.14           17
                 GPE        89.45        86.09        87.74          719
                 WEA        79.17        59.38        67.86           32
                 ORG        83.49        82.43        82.95          552

               micro        88.24        86.79        87.51         3035
               macro        83.53        80.64        81.78         3035
  • KBP17:
--- Entities (named entity recognition (NER)) ---
An entity is considered correct if the entity type and span is predicted correctly

                type    precision       recall     f1-score      support
                 LOC        66.75        64.41        65.56          399
                 FAC        72.62        64.06        68.08          679
                 PER        87.86        88.30        88.08         7083
                 ORG        80.06        72.29        75.98         2461
                 GPE        89.58        87.36        88.46         1978

               micro        85.31        82.96        84.12        12600
               macro        79.38        75.28        77.23        12600

Quick Start

The preprocessed GENIA dataset is available, so we use it as an example to demonstrate the training and evaluation of the model.

cd identifier

mkdir -p data/datasets
cd data/datasets
# download genia.zip (the preprocessed GENIA dataset, wordvec and vocabulary) from https://drive.google.com/file/d/13Lf_pQ1-QNI94EHlvtcFhUcQeQeUDq8l/view?usp=sharing.
unzip genia.zip
python identifier.py train --config configs/example.conf

output:

--- Entities (named entity recognition (NER)) ---
An entity is considered correct if the entity type and span is predicted correctly

                type    precision       recall     f1-score      support
             protein        81.19        85.08        83.09         3084
                 RNA        90.74        89.91        90.32          109
           cell_line        82.35        72.36        77.03          445
                 DNA        76.83        79.08        77.94         1262
           cell_type        72.90        75.91        74.37          606

               micro        79.53        81.77        80.63         5506
               macro        80.80        80.47        80.55         5506

Best F1 score: 80.63560463237275, achieved at Epoch: 34
2021-01-02 15:07:39,565 [MainThread  ] [INFO ]  Logged in: data/genia/main/genia_train/2021-01-02_05:32:27.317850
2021-01-02 15:07:39,565 [MainThread  ] [INFO ]  Saved in: data/genia/main/genia_train/2021-01-02_05:32:27.317850
vim configs/eval.conf
# change model_path to the path of the trained model.
# eg: model_path = data/genia/main/genia_train/2021-01-02_05:32:27.317850/final_model
python identifier.py eval --config configs/eval.conf

output:

--------------------------------------------------
Config:
data/checkpoint/genia_train/2021-01-02_05:32:27.317850/final_model
Namespace(bert_before_lstm=True, cache_path=None, char_lstm_drop=0.2, char_lstm_layers=1, char_size=50, config='configs/eval.conf', cpu=False, dataset_path='data/datasets/genia/genia_test_context.json', debug=False, device_id='0', eval_batch_size=4, example_count=None, freeze_transformer=False, label='2021-01-02_eval', log_path='data/genia/main/', lowercase=False, lstm_drop=0.2, lstm_layers=1, model_path='data/checkpoint/genia_train/2021-01-02_05:32:27.317850/final_model', model_type='identifier', neg_entity_count=5, nms=0.65, no_filter='sigmoid', no_overlapping=False, no_regressor=False, no_times_count=False, norm='sigmoid', pool_type='max', pos_size=25, prop_drop=0.5, reduce_dim=True, sampling_processes=4, seed=47, size_embedding=25, spn_filter=5, store_examples=True, store_predictions=True, tokenizer_path='data/checkpoint/genia_train/2021-01-02_05:32:27.317850/final_model', types_path='data/datasets/genia/genia_types.json', use_char_lstm=True, use_entity_ctx=True, use_glove=True, use_pos=True, use_size_embedding=False, weight_decay=0.01, window_size=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], wordvec_path='../biovec/PubMed-shuffle-win-30.txt')
Repeat 1 times
--------------------------------------------------
Iteration 0
--------------------------------------------------
Avaliable devices:  [3]
Using Random Seed 47
2021-05-22 17:52:44,101 [MainThread  ] [INFO ]  Dataset: data/datasets/genia/genia_test_context.json
2021-05-22 17:52:44,101 [MainThread  ] [INFO ]  Model: identifier
Reused vocab!
Parse dataset 'test': 100%|███████████████████████████████████████| 1854/1854 [00:09<00:00, 202.86it/s]
2021-05-22 17:52:53,507 [MainThread  ] [INFO ]  Relation type count: 1
2021-05-22 17:52:53,507 [MainThread  ] [INFO ]  Entity type count: 6
2021-05-22 17:52:53,507 [MainThread  ] [INFO ]  Entities:
2021-05-22 17:52:53,507 [MainThread  ] [INFO ]  No Entity=0
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  DNA=1
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  RNA=2
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  cell_type=3
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  protein=4
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  cell_line=5
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  Relations:
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  No Relation=0
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  Dataset: test
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  Document count: 1854
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  Relation count: 0
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  Entity count: 5506
2021-05-22 17:52:53,508 [MainThread  ] [INFO ]  Context size: 242
2021-05-22 17:53:10,348 [MainThread  ] [INFO ]  Evaluate: test
Evaluate epoch 0: 100%|██████████████████████████████████████████| 464/464 [01:14<00:00,  6.26it/s]
Enmuberated Spans: 0
Evaluation

--- Entities (named entity recognition (NER)) ---
An entity is considered correct if the entity type and span is predicted correctly

                type    precision       recall     f1-score      support
           cell_line        82.60        71.46        76.63          445
                 DNA        77.67        78.29        77.98         1262
                 RNA        92.45        89.91        91.16          109
           cell_type        73.79        75.25        74.51          606
             protein        81.99        84.57        83.26         3084

               micro        80.33        81.15        80.74         5506
               macro        81.70        79.89        80.71         5506
2021-05-22 17:54:28,943 [MainThread  ] [INFO ]  Logged in: data/genia/main/genia_eval/2021-05-22_17:52:43.991876

Citation

If you have any questions related to the code or the paper, feel free to email [email protected].

@inproceedings{shen2021locateandlabel,
    author = {Shen, Yongliang and Ma, Xinyin and Tan, Zeqi and Zhang, Shuai and Wang, Wen and Lu, Weiming},
    title = {Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition},
    url = {https://arxiv.org/abs/2105.06804},
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
    year = {2021},
}
Owner
tricktreat
Knowledge is power.
tricktreat
Hypernetwork-Ensemble Learning of Segmentation Probability for Medical Image Segmentation with Ambiguous Labels

Hypernet-Ensemble Learning of Segmentation Probability for Medical Image Segmentation with Ambiguous Labels The implementation of Hypernet-Ensemble Le

Sungmin Hong 6 Jul 18, 2022
Reimplement of SimSwap training code

SimSwap-train Reimplement of SimSwap training code Instructions 1.Environment Preparation (1)Refer to the README document of SIMSWAP to configure the

seeprettyface.com 111 Dec 31, 2022
RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation (CIKM'17)

RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation This is the implementation of RATE: Overcoming Noise and Spar

Yu Zhang 5 Feb 10, 2022
An onlinel learning to rank python codebase.

OLTR Online learning to rank python codebase. The code related to Pairwise Differentiable Gradient Descent (ranker/PDGDLinearRanker.py) is copied from

ielab 5 Jul 18, 2022
Fermi Problems: A New Reasoning Challenge for AI

Fermi Problems: A New Reasoning Challenge for AI Fermi Problems are questions whose answer is a number that can only be reasonably estimated as a prec

AI2 15 May 28, 2022
Developing your First ML Workflow of the AWS Machine Learning Engineer Nanodegree Program

Exercises and project documentation for the 3. Developing your First ML Workflow of the AWS Machine Learning Engineer Nanodegree Program

Simona Mircheva 1 Jan 13, 2022
Hand gesture recognition model that can be used as a remote control for a smart tv.

Gesture_recognition The training data consists of a few hundred videos categorised into one of the five classes. Each video (typically 2-3 seconds lon

Pratyush Negi 1 Aug 11, 2022
Stream images from a connected camera over MQTT, view using Streamlit, record to file and sqlite

mqtt-camera-streamer Summary: Publish frames from a connected camera or MJPEG/RTSP stream to an MQTT topic, and view the feed in a browser on another

Robin Cole 183 Dec 16, 2022
Plotting points that lie on the intersection of the given curves using gradient descent.

Plotting intersection of curves using gradient descent Webapp Link --- What's the app about Why this app Plotting functions and their intersection. A

Divakar Verma 2 Jan 09, 2022
Official repo for our 3DV 2021 paper "Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements".

Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements Yu Rong, Jingbo Wang, Ziwei Liu, Chen Change Loy Paper. Pr

Yu Rong 41 Dec 13, 2022
Official PyTorch implementation of "ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows"

ArtFlow Official PyTorch implementation of the paper: ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows Jie An*, Siyu Huang*, Yibing

123 Dec 27, 2022
Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper

Ponder(ing) Transformer Implementation of a Transformer that learns to adapt the number of computational steps it takes depending on the difficulty of

Phil Wang 65 Oct 04, 2022
LOFO (Leave One Feature Out) Importance calculates the importances of a set of features based on a metric of choice,

LOFO (Leave One Feature Out) Importance calculates the importances of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and eval

Ahmet Erdem 691 Dec 23, 2022
A containerized REST API around OpenAI's CLIP model.

OpenAI's CLIP — REST API This is a container wrapping OpenAI's CLIP model in a RESTful interface. Running the container locally First, build the conta

Santiago Valdarrama 48 Nov 06, 2022
A PyTorch Implementation of Gated Graph Sequence Neural Networks (GGNN)

A PyTorch Implementation of GGNN This is a PyTorch implementation of the Gated Graph Sequence Neural Networks (GGNN) as described in the paper Gated G

Ching-Yao Chuang 427 Dec 13, 2022
Implementation of ConvMixer-Patches Are All You Need? in TensorFlow and Keras

Patches Are All You Need? - ConvMixer ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in t

Sayan Nath 8 Oct 03, 2022
NP DRAW paper released code

NP-DRAW: A Non-Parametric Structured Latent Variable Model for Image Generation This repo contains the official implementation for the NP-DRAW paper.

ZENG Xiaohui 22 Mar 13, 2022
N-RPG - Novel role playing game da turfu

N-RPG Ce README sera la page de garde du projet. Contenu Il contiendra la présen

4 Mar 15, 2022
Skipgram Negative Sampling in PyTorch

PyTorch SGNS Word2Vec's SkipGramNegativeSampling in Python. Yet another but quite general negative sampling loss implemented in PyTorch. It can be use

Jamie J. Seol 287 Dec 14, 2022
Temporally Coherent GAN SIGGRAPH project.

TecoGAN This repository contains source code and materials for the TecoGAN project, i.e. code for a TEmporally COherent GAN for video super-resolution

Duc Linh Nguyen 2 Jan 18, 2022