Klexikon: A German Dataset for Joint Summarization and Simplification

Overview

Klexikon: A German Dataset for Joint Summarization and Simplification

Dennis Aumiller and Michael Gertz
Heidelberg University

Under submission at LREC 2022
A preprint version of the paper can be found on arXiv!
For easy access, we have also made the dataset available on Huggingface Datasets!


Data Availability

To use data in your experiments, we suggest the existing training/validation/test split, available in ./data/splits/. This split has been generated with a stratified sampling strategy (based on document lengths) and a 80/10/10 split, which ensure that the samples are somewhat evenly distributed.

Alternatively, please refer to our Huggingface Datasets version for easy access of the preprocessed data.

Installation

This repository contains the code to crawl the Klexikon data set presented in our paper, as well as all associated baselines and splits. You can work on the existing code base by simply cloning this repository.

Install all required dependencies with the following command:

python3 -m pip install -r requirements.txt

The experiments were run on Python 3.8.4, but should run fine with any version >3.7. To run files, relative imports are required, which forces you to run them as modules, e.g.,

python3 -m klexikon.analysis.compare_offline_stats

instead of

python3 klexikon/analysis/compare_offline_stats.py

Furthermore, this requires the working directory to be the root folder as well, to ensure correct referencing of relative data paths. I.e., if you cloned this repository into /home/dennis/projects/klexikon, make sure to run scripts directly from this path.

Extended Explanation

Manually Replaced Articles in articles.json

Aside from all the manual matches, which can be produced by create_matching_url_list.py, there are some articles which simply link to an incorrect article in Wikipedia.
We approximate this by the number of paragraphs in the Wikipedia article, which is generally much longer than the Klexikon article, and therefore should have at least 15 paragraphs. Note that most of the pages are disambiguations, which unfortunately don't necessarily correspond neatly to a singular Wikipedia page. We remove the article if it is not possible to find a singular Wikipedia article that covers more than 66% of the paragraphs in the Klexikon article. Some examples for manual changes were:

  • "Aal" to "Aale"
  • "Abendmahl" to "Abendmahl Jesu"
  • "Achse" to "Längsachse"
  • "Ader" to "Blutgefäß"
  • "Albino" to "Albinismus"
  • "Alkohol" to "Ethanol"
  • "Android" to "Android (Betriebssystem)"
  • "Anschrift" to "Postanschrift"
  • "Apfel" to "Kulturapfel"
  • "App" to "Mobile App"
  • "Appenzell" to "Appenzellerland"
  • "Arabien" to "Arabische Halbinsel"
  • "Atlas" to "Atlas (Kartografie)"
  • "Atmosphäre" to "Erdatmospähre"

Merging sentences that end in a semicolon (;)

This applies to any position in the document. The reason is rectifying some unwanted splits by spaCy.

Merge of short lines in lead 3 baseline

Also checking for lines that have less than 10 characters in the first three sentences. This helps with fixing the lead-3 baseline, and most issues arise from some incorrect splits to begin with.

Removal of coordinates

Sometimes, coordinate information is leading in the data, which seems to be embedded in some Wikipedia articles. We remove any coordinate with a simple regex.

Sentences that do not end in a period

Manual correction of sentences (in the lead 3) that do not end in periods. This has been automatically fixed by merging content similarly to the semicolon case. Specifically, we only merge if the subsequent line is not just an empty line.

Using your own data

Currently, the systems expect input data to be processed in a line-by-line fashion, where every line represents a sentence, and each file represents an input document. Note that we currently do not support multi-document summarization.

Criteria for discarding articles

Articles where Wikipedia has less than 15 paragraphs. Otherwise, manually discarding when there are no matching articles in Wikipedia (see above). Examples of the latter case are for example "Kiwi" or "Washington"

Reasons for not using lists

As described in the paper, we discard any element that is not a

tag in the HTLM code. This helps getting rid of actual unwanted information (images, image captions, meta-descriptors, etc.), but also removes list items. After reviewing some examples, we have decided to discard list elements altogether. This means that some articles (especially disambiguation pages) are also easier to detect.

Final number of valid article pairs: 2898

This means we had to discard around 250 articles from the original list at the time of crawling (April 2021). In the meantime, there have been new articles added to Klexikon, which leaves room for future improvements.

Execution Order of Scripts

TK: I'll include a better reference to the particular scripts in the near future, as well as a script that actually executes everything relevant in order.

  • Generate JSON file with article URLs
  • Crawl texts
  • Fix lead sentences
  • Remove unused articles (optional)
  • Generate stratified split

License Information

Both Wikipedia and Klexikon make their textual contents available under the CC BY-SA license. Per recommendation of the Creative Commons, we apply a separate license to the software component of this repository. Data will be re-distributed under the CC BY-SA license.

Contributions

Contributions are very welcome. Please either open an issue or pull request if you have any suggestion on how this data can be improved. Open TODOs:

  • So far, the data does not have more than a few simplistic baselines, and lacks an actually trained system on top of the data.
  • The dataset is "out-of-date", since it does not include any of the more recently articles (~100 since the inception of my version). Potentially, we can increase the availability to almost 3000 articles.
  • Adding a top-level script that adds correct execution order of different scripts to generate baselines/results/etc.
  • Adding a proper data managing script for the Huggingface Datasets version of this dataset.

How to Cite?

If you use our dataset, or code from this repository, please cite

@article{aumiller-gertz-2022-klexikon,  
  title   = {{Klexikon: A German Dataset for Joint Summarization and Simplification}},  
  author  = {Aumiller, Dennis and Gertz, Michael},  
  year    = {2022},  
  journal = {arXiv preprint arXiv:2201.07198},  
  url     = {https://arxiv.org/abs/2201.07198},  
}
Owner
Dennis Aumiller
PhD student in Information Retrieval & NLP at Heidelberg University. Python is awesome, and so is Huggingface
Dennis Aumiller
Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Persian Lexicon This repo uses Uppsala Persian Corpus (UPC) to construct a lexic

Saman Vaisipour 7 Apr 01, 2022
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Kaushal Shetty 488 Nov 28, 2022
Curso práctico: NLP de cero a cien 🤗

Curso Práctico: NLP de cero a cien Comprende todos los conceptos y arquitecturas clave del estado del arte del NLP y aplícalos a casos prácticos utili

Somos NLP 147 Jan 06, 2023
Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Channel Auto-Post Bot This bot can send all new messages from one channel, directly to another channel (or group, just in case), without the forwarded

Aditya 128 Dec 29, 2022
Built for cleaning purposes in military institutions

Ferramenta do AL Construído para fins de limpeza em instituições militares. Instalação Requer python = 3.2 pip install -r requirements.txt Usagem Exe

0 Aug 13, 2022
In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Transformers are all you need In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a

Aymen Berriche 8 Apr 13, 2022
Auto translate textbox from Japanese to English or Indonesia

priconne-auto-translate Auto translate textbox from Japanese to English or Indonesia How to use Install python first, Anaconda is recommended Install

Aji Priyo Wibowo 5 Aug 25, 2022
This is the offline-training-pipeline for our project.

offline-training-pipeline This is the offline-training-pipeline for our project. We adopt the offline training and online prediction Machine Learning

0 Apr 22, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
a CTF web challenge about making screenshots

screenshotter (web) A CTF web challenge about making screenshots. It is inspired by a bug found in real life. The challenge was created by @LiveOverfl

219 Jan 02, 2023
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022
Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Twitter-News-Summarizer Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline 1.) Extracts all tweets fr

Rohit Govindan 1 Jan 27, 2022
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

BADER ALABDAN 2 Oct 22, 2022
MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data. It is implemented using Python.

willow 6 Jun 27, 2022
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

1 Oct 14, 2021
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 316 Jan 03, 2023
Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Wav2Vec2 STT Python Beta Software Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 mode

David Zurow 22 Dec 29, 2022