ANEA: Distant Supervision for Low-Resource Named Entity Recognition

Related tags

Deep Learninganea
Overview

ANEA: Distant Supervision for Low-Resource Named Entity Recognition

ANEA is a tool to automatically annotate named entities in unlabeled text based on entity lists for the use as distant supervision.

Distant supervision allows obtaining labeled training corpora for low-resource settings where only limited hand-annotated data exists. However, to be used effectively, the distant supervision must be easy to gather. ANEA is a tool to automatically annotate named entities in texts based on entity lists. It spans the whole pipeline from obtaining the lists to analyzing the errors of the distant supervision. A tuning step allows the user to improve the automatic annotation with their linguistic insights without labelling or checking all tokens manually.

An example of the workflow can be seen in this video. For more details, take a look at our paper (accepted at PML4DC @ ICLR'21). For the additional material of the paper, please check the subdirectory additional of this repository.

Installation

ANEA should run on all major operating systems. We recommend the installation via conda or miniconda:

git clone https://github.com/uds-lsv/anea

conda create -n anea python=3.7
conda activate anea
pip install spacy==2.2.4 Flask==1.1.1 fuzzywuzzy==0.18.0

For tokenizationa and lemmatization, a spacy language pack needs to be installed. Run the following command with the corresponding language code, e.g. en for English. Check https://spacy.io/usage for supported languages

python -m spacy download en

Download the Wikidata JSON dump from https://dumps.wikimedia.org/wikidatawiki/entities/ and extract it to the instance directory (this may take a while).

Running

After the installation, you can run ANEA using the following commands on the command line

conda activate anea
./run.sh

Then open the browser and go to the address http://localhost:5000/ If you run it for the first time, you should configure ANEA at the Settings tab.

The ANEA (server) tool can run on a different machine than the browser of the user. It is just necessary that the user's computer can access the port 5000 on the machine that the ANEA server is running on (e.g. via ssh port forwarding or opening the correspoding port on the firewall).

Support for Other Languages

ANEA uses Spacy for language preprocessing (tokenization and lemmatization). It currently supports English, German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian Bokmål and Lithuanian. For Estonian, EstNLTK, version 1.6, is supported by ANEA. In that case, ANEA needs to be installed with Python 3.6.

Text can also be preprocessed using external tools and then uploaded as whitespace tokenized text or in the CoNLL format (one token per line).

Other external preprocessing libraries can be added directly to ANEA by implementing a new Tokenizer class in autom_labeling_library/preprocessing.py (you can take a look at EstnltkTokenizer as an example) and adding it to the Preprocessing class. If you encounter any issues, just contact us.

Citation

If you use this tool, please cite us:

@article{hedderich21ANEA,
  author    = {Michael A. Hedderich and
               Lukas Lange and
               Dietrich Klakow},
  title     = {{ANEA:} Distant Supervision for Low-Resource Named Entity Recognition},
  journal   = {CoRR},
  volume    = {abs/2102.13129},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.13129},
  archivePrefix = {arXiv},
  eprint    = {2102.13129},
}

Development, Support & License

If you encounter any issues or problems when using ANEA, feel free to raise an issue on Github or contact us directly (mhedderich [at] lsv.uni-saarland [dot] de). We welcome contributes from other developers.

ANEA is licensed under the Apache License 2.0.

Owner
Saarland University Spoken Language Systems Group
Saarland University Spoken Language Systems Group
Image Segmentation using U-Net, U-Net with skip connections and M-Net architectures

Brain-Image-Segmentation Segmentation of brain tissues in MRI image has a number of applications in diagnosis, surgical planning, and treatment of bra

Angad Bajwa 8 Oct 27, 2022
Introducing neural networks to predict stock prices

IntroNeuralNetworks in Python: A Template Project IntroNeuralNetworks is a project that introduces neural networks and illustrates an example of how o

Vivek Palaniappan 637 Jan 04, 2023
Train SN-GAN with AdaBelief

SNGAN-AdaBelief Train a state-of-the-art spectral normalization GAN with AdaBelief https://github.com/juntang-zhuang/Adabelief-Optimizer Acknowledgeme

Juntang Zhuang 10 Jun 11, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022
RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

RM Operation can equivalently convert ResNet to VGG, which is better for pruning; and can help RepVGG perform better when the depth is large.

184 Jan 04, 2023
Download files from DSpace systems (because for some reason DSpace won't let you)

DSpaceDL A tool for downloading files from DSpace items. For some reason, DSpace systems have a dogshit UI, and Universities absolutely LOOOVE to use

Soumitra Shewale 5 Dec 01, 2022
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 05, 2022
PyTorch implementation of the ACL, 2021 paper Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks.

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks This repo contains the PyTorch implementation of the ACL, 2021 pa

Rabeeh Karimi Mahabadi 98 Dec 28, 2022
Code accompanying our NeurIPS 2021 traffic4cast challenge

Traffic forecasting on traffic movie snippets This repo contains all code to reproduce our approach to the IARAI Traffic4cast 2021 challenge. In the c

Nina Wiedemann 2 Aug 09, 2022
A python toolbox for predictive uncertainty quantification, calibration, metrics, and visualization

Website, Tutorials, and Docs    Uncertainty Toolbox A python toolbox for predictive uncertainty quantification, calibration, metrics, and visualizatio

Uncertainty Toolbox 1.4k Dec 28, 2022
DAT4 - General Assembly's Data Science course in Washington, DC

DAT4 Course Repository Course materials for General Assembly's Data Science course in Washington, DC (12/15/14 - 3/16/15). Instructors: Sinan Ozdemir

Kevin Markham 779 Dec 25, 2022
Implementation of the Paper: "Parameterized Hypercomplex Graph Neural Networks for Graph Classification" by Tuan Le, Marco Bertolini, Frank Noé and Djork-Arné Clevert

Parameterized Hypercomplex Graph Neural Networks (PHC-GNNs) PHC-GNNs (Le et al., 2021): https://arxiv.org/abs/2103.16584 PHM Linear Layer Illustration

Bayer AG 26 Aug 11, 2022
Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers.

Less is More: Pay Less Attention in Vision Transformers Official PyTorch implementation of Less is More: Pay Less Attention in Vision Transformers. By

73 Jan 01, 2023
A model that attempts to learn and benefit from data collected on card counting.

A model that attempts to learn and benefit from data collected on card counting. A decision tree like model is built to win more often than loose and increase the bet of the player appropriately to c

1 Dec 17, 2021
[CVPR 2022] Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions" paper

template-pose Pytorch implementation of "Templates for 3D Object Pose Estimation Revisited: Generalization to New objects and Robustness to Occlusions

Van Nguyen Nguyen 92 Dec 28, 2022
LAnguage Model Analysis

LAMA: LAnguage Model Analysis LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models. The dataset

Meta Research 960 Jan 08, 2023
🚀 PyTorch Implementation of "Progressive Distillation for Fast Sampling of Diffusion Models(v-diffusion)"

PyTorch Implementation of "Progressive Distillation for Fast Sampling of Diffusion Models(v-diffusion)" Unofficial PyTorch Implementation of Progressi

Vitaliy Hramchenko 58 Dec 19, 2022
This is the implementation of GGHL (A General Gaussian Heatmap Labeling for Arbitrary-Oriented Object Detection)

GGHL: A General Gaussian Heatmap Labeling for Arbitrary-Oriented Object Detection This is the implementation of GGHL 👋 👋 👋 [Arxiv] [Google Drive][B

551 Dec 31, 2022
Integrated physics-based and ligand-based modeling.

ComBind ComBind integrates data-driven modeling and physics-based docking for improved binding pose prediction and binding affinity prediction. Given

Dror Lab 44 Oct 26, 2022
Jiminy Cricket Environment (NeurIPS 2021)

Jiminy Cricket This is the repository for "What Would Jiminy Cricket Do? Towards Agents That Behave Morally" by Dan Hendrycks*, Mantas Mazeika*, Andy

Dan Hendrycks 15 Aug 29, 2022