KIND: an Italian Multi-Domain Dataset for Named Entity Recognition

Related tags

Deep LearningKIND
Overview

KIND (Kessler Italian Named-entities Dataset)

KIND is an Italian dataset for Named-Entity Recognition.

It contains more than one million tokens with the annotation covering three classes: persons, locations, and organizations. Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses.

For the construction of the dataset, we decide to use texts available for free, under a license that permits both research and commercial use.

In particular we release four chapters with texts taken from: (i) Wikinews (WN) as a source of news texts belonging to the last decades; (ii) some Italian fiction books (FIC) whose authors died more than 70 years ago; (iii) writings and speeches from Italian politicians Aldo Moro (AM) and (iv) Alcide De Gasperi (ADG).

Wikinews

Wikinews is a multi-language free project of collaborative journalism. The Italian chapter contains more than 11,000 news articles, released under the Creative Commons Attribution 2.5 License.

In building KIND, we randomly choose 1,000 articles evenly distributed in the last 20 years, for a total of 308,622 tokens.

Literature

Regarding fiction literature, we annotate 86 book chapters taken from 10 books written by Italian authors, who all died more than 70 years ago, for a total of 192,448 tokens. The plain texts are taken from the Liber Liber website.

In particular, we choose: Il giorno delle Mésules (Ettore Castiglioni, 12,853 tokens), L'amante di Cesare (Augusto De Angelis, 13,464 tokens), Canne al vento (Grazia Deledda, 13,945 tokens), 1861-1911 - Cinquant’anni di vita nazionale ricordati ai fanciulli (Guido Fabiani, 10,801 tokens), Lettere dal carcere (Antonio Gramsci, 10,655), Anarchismo e democrazia (Errico Malatesta, 11,557 tokens), L'amore negato (Maria Messina, 31,115 tokens), La luna e i falò (Cesare Pavese, 10,705 tokens), La coscienza di Zeno (Italo Svevo, 56,364 tokens), Le cose piu grandi di lui (Luciano Zuccoli, 20,989 tokens).

In selecting works without copyright, we favored texts as recent as possible, so that the model trained on this data can be used efficiently on novels written in the last years, since the language used in these novels is more likely to be similar to the language used in the novels of our days.

Aldo Moro's Works

Writings belonging to Aldo Moro have recently been collected by the University of Bologna and published on a platform called Edizione Nazionale delle Opere di Aldo Moro.

The project is still ongoing and, by now, it contains 806 documents for a total of about one million tokens.

In the first release of KIND, we include 392,604 tokens from the Aldo Moro's works dataset, with silver annotations (see the reference below).

Alcide De Gasperi's Writings

Finally, we annotate 158 document (150,632 tokens) from Alcide Digitale, spanning 50 years of European history.

The complete corpus contains a comprehensive collection of Alcide De Gasperi’s public documents, 2,762 in total, written or transcribed between 1901 and 1954.

License

The NER annotations in (i), (ii), and (iii) are released under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Annotation from Alcide De Gasperi's writings are released under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Owner
Digital Humanities
Digital Humanities Unit at Fondazione Bruno Kessler
Digital Humanities
StyleGAN-Human: A Data-Centric Odyssey of Human Generation

StyleGAN-Human: A Data-Centric Odyssey of Human Generation Abstract: Unconditional human image generation is an important task in vision and graphics,

stylegan-human 762 Jan 08, 2023
Video Frame Interpolation without Temporal Priors (a general method for blurry video interpolation)

Video Frame Interpolation without Temporal Priors (NeurIPS2020) [Paper] [video] How to run Prerequisites NVIDIA GPU + CUDA 9.0 + CuDNN 7.6.5 Pytorch 1

YoujianZhang 31 Sep 04, 2022
Fast SHAP value computation for interpreting tree-based models

FastTreeSHAP FastTreeSHAP package is built based on the paper Fast TreeSHAP: Accelerating SHAP Value Computation for Trees published in NeurIPS 2021 X

LinkedIn 369 Jan 04, 2023
Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation

DistMIS Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation. DistriMIS Distributing Deep Learning Hyperparameter Tuning

HiEST 2 Sep 09, 2022
LBK 35 Dec 26, 2022
Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning

Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning Reference Abeßer, J. & Müller, M. Towards Audio Domain Adapt

Jakob Abeßer 2 Jul 06, 2022
Awesome Long-Tailed Learning

Awesome Long-Tailed Learning This repo pays specially attention to the long-tailed distribution, where labels follow a long-tailed or power-law distri

Stomach_ache 284 Jan 06, 2023
PyTorch implementation of EGVSR: Efficcient & Generic Video Super-Resolution (VSR)

This is a PyTorch implementation of EGVSR: Efficcient & Generic Video Super-Resolution (VSR), using subpixel convolution to optimize the inference speed of TecoGAN VSR model. Please refer to the offi

789 Jan 04, 2023
PyGRANSO: A PyTorch-enabled port of GRANSO with auto-differentiation

PyGRANSO PyGRANSO: A PyTorch-enabled port of GRANSO with auto-differentiation Please check https://ncvx.org/PyGRANSO for detailed instructions (introd

SUN Group @ UMN 26 Nov 16, 2022
A general-purpose programming language, focused on simplicity, safety and stability.

The Rivet programming language A general-purpose programming language, focused on simplicity, safety and stability. Rivet's goal is to be a very power

The Rivet programming language 17 Dec 29, 2022
SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation SeqFormer SeqFormer: a Frustratingly Simple Model for Video Instance Segmentat

Junfeng Wu 298 Dec 22, 2022
LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation by Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zh

Payphone 8 Nov 21, 2022
Combinatorial model of ligand-receptor binding

Combinatorial model of ligand-receptor binding The binding of ligands to receptors is the starting point for many import signal pathways within a cell

Mobolaji Williams 0 Jan 09, 2022
Probabilistic Tensor Decomposition of Neural Population Spiking Activity

Probabilistic Tensor Decomposition of Neural Population Spiking Activity Matlab (recommended) and Python (in developement) implementations of Soulat e

Hugo Soulat 6 Nov 30, 2022
Veri Setinizi Yolov5 Formatına Dönüştürün

Veri Setinizi Yolov5 Formatına Dönüştürün! Bu Repo da Neler Var? Xml Formatındaki Veri Setini .Txt Formatına Çevirme Xml Formatındaki Dosyaları Silme

Kadir Nar 4 Aug 22, 2022
This is the official implementation for "Do Transformers Really Perform Bad for Graph Representation?".

Graphormer By Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng*, Guolin Ke, Di He*, Yanming Shen and Tie-Yan Liu. This repo is the official impl

Microsoft 1.3k Dec 29, 2022
3D Human Pose Machines with Self-supervised Learning

3D Human Pose Machines with Self-supervised Learning Keze Wang, Liang Lin, Chenhan Jiang, Chen Qian, and Pengxu Wei, “3D Human Pose Machines with Self

Chenhan Jiang 398 Dec 20, 2022
A particular navigation route using satellite feed and can help in toll operations & traffic managemen

How about adding some info that can quanitfy the stress on a particular navigation route using satellite feed and can help in toll operations & traffic management The current analysis is on the satel

Ashish Pandey 1 Feb 14, 2022
Source code of AAAI 2022 paper "Towards End-to-End Image Compression and Analysis with Transformers".

Towards End-to-End Image Compression and Analysis with Transformers Source code of our AAAI 2022 paper "Towards End-to-End Image Compression and Analy

37 Dec 21, 2022
Multistream CNN for Robust Acoustic Modeling

Multistream Convolutional Neural Network (CNN) A multistream CNN is a novel neural network architecture for robust acoustic modeling in speech recogni

ASAPP Research 37 Sep 21, 2022