Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Overview

logo

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER.

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
    author = "Tedeschi, Simone  and
      Maiorca, Valentino  and
      Campolungo, Niccol{\`o}  and
      Cecconi, Francesco  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.215",
    pages = "2521--2533",
    abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
}

Please consider citing our work if you use data and/or code from this repository.

In a nutshell, WikiNEuRal consists in a novel technique which builds upon a multilingual lexical knowledge base (i.e., BabelNet) and transformer-based architectures (i.e., BERT) to produce high-quality annotations for multilingual NER. It shows consistent improvements of up to 6 span-based F1-score points against state-of-the-art alternative data production methods on common benchmarks for NER. Moreover, in our paper we also present a new approach for creating interpretable word embeddings together with a Domain Adaptation algorithm, which enable WikiNEuRal to create domain-specific training corpora.

Data

Dataset Version Sentences Tokens PER ORG LOC MISC OTHER
WikiNEuRal EN 116k 2.73M 51k 31k 67k 45k 2.40M
WikiNEuRal ES 95k 2.33M 43k 17k 68k 25k 2.04M
WikiNEuRal NL 107k 1.91M 46k 22k 61k 24k 1.64M
WikiNEuRal DE 124k 2.19M 60k 32k 59k 25k 1.87M
WikiNEuRal RU 123k 2.39M 40k 26k 89k 25k 2.13M
WikiNEuRal IT 111k 2.99M 67k 22k 97k 26k 2.62M
WikiNEuRal FR 127k 3.24M 76k 25k 101k 29k 2.83M
WikiNEuRal PL 141k 2.29M 59k 34k 118k 22k 1.91M
WikiNEuRal PT 106k 2.53M 44k 17k 112k 25k 2.20M
WikiNEuRal EN DA (CoNLL) 29k 759k 12k 23k 6k 3k 0.54M
WikiNEuRal NL DA (CoNLL) 34k 598k 17k 8k 18k 6k 0.51M
WikiNEuRal DE DA (CoNLL) 41k 706k 17k 12k 23k 3k 0.61M
WikiNEuRal EN DA (OntoNotes) 48k 1.18M 20k 13k 38k 12k 1.02M

Further datasets, such as the combination of WikiNEuRal with gold-standard training data (i.e., CoNLL) or the gold-standard datasets themselves, can be obtained by simply concatenating the two train.conllu files together (e.g., data/conll/en/train.conllu and data/wikineural/en/train.conllu give CoNLL+WikiNEuRal).

How to use

  1. To train 10 models on CoNLL English, run:

    python run.py -m +train.seed_idx=0,1,2,3,4,5,6,7,8,9 data.datamodule.source=conll data.datamodule.language=en
    

    note: for the EN, ES, NL and DE versions of WikiNEuRal, you can use the CoNLL splits as validation and testing material (e.g., copy the data/conll/en/val.conllu into data/wikineural/en/). Similarly, for RU and PL you can use the BSNLP splits. For the other languages instead, you can use the scripts/create_splits.py script to split a given train.conllu file into train, dev and test sets.

  2. To produce results for the 10 trained models, run:

    bash test.sh
    

    test.sh also contains more complex bash for loops that can produce results on multiple datasets / models at once.

License

WikiNEuRal is licensed under the CC BY-SA-NC 4.0 license. The text of the license can be found here.

We underline that the source from which the raw sentences have been extracted is Wikipedia (wikipedia.org) and the NER annotations have been produced by Babelscape.

Acknowledgments

We gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon2020 research and innovation programme (http://mousse-project.org/).

This work was also supported by the PerLIR project (Personal Linguistic resources in Information Retrieval) funded by the MIUR Progetti di ricerca di Rilevante Interesse Nazionale programme (PRIN2017).

The code in this repository is built on top of .

Owner
Babelscape
Babelscape is a deep tech company founded in 2016 focused on multilingual Natural Language Processing.
Babelscape
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 20 Jan 09, 2023
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

The KLEJ Benchmark Baselines The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language und

Allegro Tech 17 Oct 18, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
端到端的长本文摘要模型(法研杯2020司法摘要赛道)

端到端的长文本摘要模型(法研杯2020司法摘要赛道)

苏剑林(Jianlin Su) 334 Jan 08, 2023
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

buptlihang 84 Dec 28, 2022
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

43 Nov 11, 2022
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 02, 2023
A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which

Clova AI Research 94 Dec 30, 2022
GSoC'2021 | TensorFlow implementation of Wav2Vec2

GSoC'2021 | TensorFlow implementation of Wav2Vec2

Vasudev Gupta 73 Nov 28, 2022
Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Stanford NLP 6.4k Jan 02, 2023
Using BERT-based models for toxic span detection

SemEval 2021 Task 5: Toxic Spans Detection: Task: Link to SemEval-2021: Task 5 Toxic Span Detection is https://competitions.codalab.org/competitions/2

Ravika Nagpal 1 Jan 04, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

117 Jan 07, 2023