Klexikon: A German Dataset for Joint Summarization and Simplification

Overview

Klexikon: A German Dataset for Joint Summarization and Simplification

Dennis Aumiller and Michael Gertz
Heidelberg University

Under submission at LREC 2022
A preprint version of the paper can be found on arXiv!
For easy access, we have also made the dataset available on Huggingface Datasets!


Data Availability

To use data in your experiments, we suggest the existing training/validation/test split, available in ./data/splits/. This split has been generated with a stratified sampling strategy (based on document lengths) and a 80/10/10 split, which ensure that the samples are somewhat evenly distributed.

Alternatively, please refer to our Huggingface Datasets version for easy access of the preprocessed data.

Installation

This repository contains the code to crawl the Klexikon data set presented in our paper, as well as all associated baselines and splits. You can work on the existing code base by simply cloning this repository.

Install all required dependencies with the following command:

python3 -m pip install -r requirements.txt

The experiments were run on Python 3.8.4, but should run fine with any version >3.7. To run files, relative imports are required, which forces you to run them as modules, e.g.,

python3 -m klexikon.analysis.compare_offline_stats

instead of

python3 klexikon/analysis/compare_offline_stats.py

Furthermore, this requires the working directory to be the root folder as well, to ensure correct referencing of relative data paths. I.e., if you cloned this repository into /home/dennis/projects/klexikon, make sure to run scripts directly from this path.

Extended Explanation

Manually Replaced Articles in articles.json

Aside from all the manual matches, which can be produced by create_matching_url_list.py, there are some articles which simply link to an incorrect article in Wikipedia.
We approximate this by the number of paragraphs in the Wikipedia article, which is generally much longer than the Klexikon article, and therefore should have at least 15 paragraphs. Note that most of the pages are disambiguations, which unfortunately don't necessarily correspond neatly to a singular Wikipedia page. We remove the article if it is not possible to find a singular Wikipedia article that covers more than 66% of the paragraphs in the Klexikon article. Some examples for manual changes were:

  • "Aal" to "Aale"
  • "Abendmahl" to "Abendmahl Jesu"
  • "Achse" to "Längsachse"
  • "Ader" to "Blutgefäß"
  • "Albino" to "Albinismus"
  • "Alkohol" to "Ethanol"
  • "Android" to "Android (Betriebssystem)"
  • "Anschrift" to "Postanschrift"
  • "Apfel" to "Kulturapfel"
  • "App" to "Mobile App"
  • "Appenzell" to "Appenzellerland"
  • "Arabien" to "Arabische Halbinsel"
  • "Atlas" to "Atlas (Kartografie)"
  • "Atmosphäre" to "Erdatmospähre"

Merging sentences that end in a semicolon (;)

This applies to any position in the document. The reason is rectifying some unwanted splits by spaCy.

Merge of short lines in lead 3 baseline

Also checking for lines that have less than 10 characters in the first three sentences. This helps with fixing the lead-3 baseline, and most issues arise from some incorrect splits to begin with.

Removal of coordinates

Sometimes, coordinate information is leading in the data, which seems to be embedded in some Wikipedia articles. We remove any coordinate with a simple regex.

Sentences that do not end in a period

Manual correction of sentences (in the lead 3) that do not end in periods. This has been automatically fixed by merging content similarly to the semicolon case. Specifically, we only merge if the subsequent line is not just an empty line.

Using your own data

Currently, the systems expect input data to be processed in a line-by-line fashion, where every line represents a sentence, and each file represents an input document. Note that we currently do not support multi-document summarization.

Criteria for discarding articles

Articles where Wikipedia has less than 15 paragraphs. Otherwise, manually discarding when there are no matching articles in Wikipedia (see above). Examples of the latter case are for example "Kiwi" or "Washington"

Reasons for not using lists

As described in the paper, we discard any element that is not a

tag in the HTLM code. This helps getting rid of actual unwanted information (images, image captions, meta-descriptors, etc.), but also removes list items. After reviewing some examples, we have decided to discard list elements altogether. This means that some articles (especially disambiguation pages) are also easier to detect.

Final number of valid article pairs: 2898

This means we had to discard around 250 articles from the original list at the time of crawling (April 2021). In the meantime, there have been new articles added to Klexikon, which leaves room for future improvements.

Execution Order of Scripts

TK: I'll include a better reference to the particular scripts in the near future, as well as a script that actually executes everything relevant in order.

  • Generate JSON file with article URLs
  • Crawl texts
  • Fix lead sentences
  • Remove unused articles (optional)
  • Generate stratified split

License Information

Both Wikipedia and Klexikon make their textual contents available under the CC BY-SA license. Per recommendation of the Creative Commons, we apply a separate license to the software component of this repository. Data will be re-distributed under the CC BY-SA license.

Contributions

Contributions are very welcome. Please either open an issue or pull request if you have any suggestion on how this data can be improved. Open TODOs:

  • So far, the data does not have more than a few simplistic baselines, and lacks an actually trained system on top of the data.
  • The dataset is "out-of-date", since it does not include any of the more recently articles (~100 since the inception of my version). Potentially, we can increase the availability to almost 3000 articles.
  • Adding a top-level script that adds correct execution order of different scripts to generate baselines/results/etc.
  • Adding a proper data managing script for the Huggingface Datasets version of this dataset.

How to Cite?

If you use our dataset, or code from this repository, please cite

@article{aumiller-gertz-2022-klexikon,  
  title   = {{Klexikon: A German Dataset for Joint Summarization and Simplification}},  
  author  = {Aumiller, Dennis and Gertz, Michael},  
  year    = {2022},  
  journal = {arXiv preprint arXiv:2201.07198},  
  url     = {https://arxiv.org/abs/2201.07198},  
}
Owner
Dennis Aumiller
PhD student in Information Retrieval & NLP at Heidelberg University. Python is awesome, and so is Huggingface
Dennis Aumiller
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Kakao Brain 797 Dec 26, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 03, 2023
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 02, 2023
Py65 65816 - Add support for the 65C816 to py65

Add support for the 65C816 to py65 Py65 (https://github.com/mnaberez/py65) is a

4 Jan 04, 2023
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

RuCLIPtiny Zero-shot image classification model for Russian language RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network

Shahmatov Arseniy 26 Sep 20, 2022
a CTF web challenge about making screenshots

screenshotter (web) A CTF web challenge about making screenshots. It is inspired by a bug found in real life. The challenge was created by @LiveOverfl

219 Jan 02, 2023
A Python/Pytorch app for easily synthesising human voices

Voice Cloning App A Python/Pytorch app for easily synthesising human voices Documentation Discord Server Video guide Voice Sharing Hub FAQ's System Re

Ben Andrew 840 Jan 04, 2023
Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

Memo Sim 1 Jan 04, 2022
A demo of chinese asr

chinese_asr_demo 一个端到端的中文语音识别模型训练、测试框架 具备数据预处理、模型训练、解码、计算wer等等功能 训练数据 训练数据采用thchs_30,

4 Dec 09, 2021
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
LCG T-TEST USING EUCLIDEAN METHOD

This project has been created for statistical usage, purposing for determining ATL takers and nontakers using LCG ttest and Euclidean Method, especially for internal business case in Telkomsel.

2 Jan 21, 2022
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
End-2-end speech synthesis with recurrent neural networks

Introduction New: Interactive demo using Google Colaboratory can be found here TTS-Cube is an end-2-end speech synthesis system that provides a full p

Tiberiu Boros 214 Dec 07, 2022
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
Semi-automated vocabulary generation from semantic vector models

vec2word Semi-automated vocabulary generation from semantic vector models This script generates a list of potential conlang word forms along with asso

9 Nov 25, 2022
Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

Mickaël Schoentgen 163 Dec 31, 2022
基于pytorch+bert的中文事件抽取

pytorch_bert_event_extraction 基于pytorch+bert的中文事件抽取,主要思想是QA(问答)。 要预先下载好chinese-roberta-wwm-ext模型,并在运行时指定模型的位置。

西西嘛呦 31 Nov 30, 2022