ANEA: Distant Supervision for Low-Resource Named Entity Recognition

Last update: Mar 30, 2022

Related tags

Overview

ANEA: Distant Supervision for Low-Resource Named Entity Recognition

ANEA is a tool to automatically annotate named entities in unlabeled text based on entity lists for the use as distant supervision.

Distant supervision allows obtaining labeled training corpora for low-resource settings where only limited hand-annotated data exists. However, to be used effectively, the distant supervision must be easy to gather. ANEA is a tool to automatically annotate named entities in texts based on entity lists. It spans the whole pipeline from obtaining the lists to analyzing the errors of the distant supervision. A tuning step allows the user to improve the automatic annotation with their linguistic insights without labelling or checking all tokens manually.

An example of the workflow can be seen in this video. For more details, take a look at our paper (accepted at PML4DC @ ICLR'21). For the additional material of the paper, please check the subdirectory additional of this repository.

Installation

ANEA should run on all major operating systems. We recommend the installation via conda or miniconda:

git clone https://github.com/uds-lsv/anea

conda create -n anea python=3.7
conda activate anea
pip install spacy==2.2.4 Flask==1.1.1 fuzzywuzzy==0.18.0

For tokenizationa and lemmatization, a spacy language pack needs to be installed. Run the following command with the corresponding language code, e.g. en for English. Check https://spacy.io/usage for supported languages

python -m spacy download en

Download the Wikidata JSON dump from https://dumps.wikimedia.org/wikidatawiki/entities/ and extract it to the instance directory (this may take a while).

Running

After the installation, you can run ANEA using the following commands on the command line

conda activate anea
./run.sh

Then open the browser and go to the address http://localhost:5000/ If you run it for the first time, you should configure ANEA at the Settings tab.

The ANEA (server) tool can run on a different machine than the browser of the user. It is just necessary that the user's computer can access the port 5000 on the machine that the ANEA server is running on (e.g. via ssh port forwarding or opening the correspoding port on the firewall).

Support for Other Languages

ANEA uses Spacy for language preprocessing (tokenization and lemmatization). It currently supports English, German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian Bokmål and Lithuanian. For Estonian, EstNLTK, version 1.6, is supported by ANEA. In that case, ANEA needs to be installed with Python 3.6.

Text can also be preprocessed using external tools and then uploaded as whitespace tokenized text or in the CoNLL format (one token per line).

Other external preprocessing libraries can be added directly to ANEA by implementing a new Tokenizer class in autom_labeling_library/preprocessing.py (you can take a look at EstnltkTokenizer as an example) and adding it to the Preprocessing class. If you encounter any issues, just contact us.

Citation

If you use this tool, please cite us:

@article{hedderich21ANEA,
  author    = {Michael A. Hedderich and
               Lukas Lange and
               Dietrich Klakow},
  title     = {{ANEA:} Distant Supervision for Low-Resource Named Entity Recognition},
  journal   = {CoRR},
  volume    = {abs/2102.13129},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.13129},
  archivePrefix = {arXiv},
  eprint    = {2102.13129},
}

Development, Support & License

If you encounter any issues or problems when using ANEA, feel free to raise an issue on Github or contact us directly (mhedderich [at] lsv.uni-saarland [dot] de). We welcome contributes from other developers.

ANEA is licensed under the Apache License 2.0.

ANEA: Distant Supervision for Low-Resource Named Entity Recognition

Related tags

Overview

ANEA: Distant Supervision for Low-Resource Named Entity Recognition

Installation

Running

Support for Other Languages

Citation

Development, Support & License

Owner

Saarland University Spoken Language Systems Group

Code for Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid

ADSPM: Attribute-Driven Spontaneous Motion in Unpaired Image Translation

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Code in both PyTorch and TensorFlow

Attention-based Transformation from Latent Features to Point Clouds (AAAI 2022)

3D ResNets for Action Recognition (CVPR 2018)

Unofficial PyTorch implementation of TokenLearner by Google AI

Non-Attentive-Tacotron - This is Pytorch Implementation of Google's Non-attentive Tacotron.

Multi-Stage Progressive Image Restoration

Paddle-Adversarial-Toolbox (PAT) is a Python library for Deep Learning Security based on PaddlePaddle.

👐OpenHands : Making Sign Language Recognition Accessible (WiP 🚧👷‍♂️🏗)

A style-based Quantum Generative Adversarial Network

PyTorch code to run synthetic experiments.

A pyparsing-based library for parsing SOQL statements

Official repository for Jia, Raghunathan, Göksel, and Liang, "Certified Robustness to Adversarial Word Substitutions" (EMNLP 2019)

[AAAI2022] Source code for our paper《Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning》

Pneumonia Detection using machine learning - with PyTorch

Multimodal Descriptions of Social Concepts: Automatic Modeling and Detection of (Highly Abstract) Social Concepts evoked by Art Images

Simply enable or disable your Nvidia dGPU

Companion code for the paper "Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks" by Yatsura et al.

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".