Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

📥 Download Datasets
📥 Download Trained Models

INTRODUCTION

TH2ZH (Thai-to-Simplified Chinese) and TH2EN (Thai-to-English) are cross-lingual summarization (CLS) datasets. The source articles of these datasets are from TR-TPBS dataset, a monolingual Thai text summarization dataset. To create CLS dataset out of TR-TPBS, we used a neural machine translation service to translate articles into target languages. For some reasons, we were strongly recommended not to mention the name of the service that we used 🥺 . We will refer to the service we used as ‘main translation service’.

Cross-lingual summarization (cross-sum) is a task to summarize a given document written in one language to another language short summary.

Traditional cross-sum approaches are based on two techniques namely early translation technique and late translation technique. Early translation can be explained easily as translate-then-summarize method. Late translation, in reverse, is summarize-then-translate method.

However, classical cross-sum methods tend to carry errors from monolingual summarization process or translation process to final cross-language output summary. Several end-to-end approaches have been proposed to tackle problems of traditional ones. Couple of end-to-end models are available to download as well.

DATASET CONSTRUCTION

💡 Important Note In contrast to Zhu, et al, in our experiment, we found that filtering out articles using RTT technique worsen the overall performance of the end-to-end models significantly. Therefore, full datasets are highly recommended.

We used TR-TPBS as source documents for creating cross-lingual summarization dataset. In the same way as Zhu, et al., we constructed Th2En and Th2Zh by translating the summary references into target languages using translation service and filtered out those poorly-translated summaries using round-trip translation technique (RTT). The overview of cross-lingual summarization dataset construction is presented in belowe figure. Please refer to the corresponding paper for more details on RTT.

In our experiment, we set 𝑇1 and 𝑇2 equal to 0.45 and 0.2 respectively, backtranslation technique filtered out 27.98% from Th2En and 56.79% documents from Th2Zh.

python3 src/tools/cls_dataset_construction.py \
--dataset th2en \
--input_csv path/to/full_dataset.csv \
--output_csv path/to/save/filtered_csv \
--r1 0.45 \
--r2 0.2

--dataset can be {th2en, th2zh}.
--r1 and --r2 are where you can set ROUGE score thresholds (r1 and r2 represent ROUGE-1 and ROUGE-2 respectively) for filtering (assumingly) poor translated articles.

Dataset Statistic

Click the file name to download.

File	Number of Articles	Size
th2en_full.csv	310,926	2.96 GB
th2zh_full.csv	310,926	2.81 GB
testset.csv	3,000	44 MB
validation.csv	3,000	43 MB

Data Fields

Please refer to th2enzh_data_exploration.ipynb for more details.

Column	Description
`th_body`	Original Thai body text
`th_sum`	Original Thai summary
`th_title`	Original Thai Article headline
`{en/zh}_body`	Translated body text
`{en/zh}_sum`	Translated summary
`{en/zh}_title`	Translated article's headline
`{en/zh}2th`	Back translation of`{en/zh}_body`
`{en/zh}_gg_sum`	Translated summary (by Google Translation)
`url`	URL to original article’s webpage

{th/en/zh}_title are only available in test set.
{en/zh}_gg_sum are also only available in test set. We (at the time this experiment took place) assumed that Google translation was better than the main translation service we were using. We intended to use these Google translated summaries as some kind of alternative summary references, but in the end, they never been used. We decided to make them available in the test set anyway, just in case the others find them useful.
{en/zh}_body were not presented during training end-to-end models. They were used only in early translation methods.

AVAILABLE TRAINED MODELS

Model	Corresponding Paper	Thai -> English		Thai -> Simplified Chinese
Model	Corresponding Paper	Full	Filtered	Full	Filtered
TNCLS	Zhu et al., 2019	-	Available	-	-
CLS+MS	Zhu et al., 2019	Available	-	-	-
CLS+MT	Zhu et al., 2019	Available	-	Available	-
XLS – RL-ROUGE	Dou et al., 2020	Available	-	Available	-

To evaluate these trained models, please refer to xls_model_evaluation.ipynb and ncls_model_evaluation.ipynb.

If you wish to evaluate the models with our test sets, you can use below script to create test files for XLS and NCLS models.

python3 src/tools/create_cls_test_manifest.py \
--test_csv_path path/to/testset.csv \
--output_dir path/to/save/testset_files \
--use_google_sum {true/false} \
--max_tokens 500 \
--create_ms_ref {true/false}

output_dir is path to directory that you want to save test set files
use_google_sum can be {true/false}. If true, it will select summary reference from columns {en/zh}_gg_sum. Default is false.
max_tokens number of maximum words in input articles. Default is 500 words. Too short or too long articles can significantly worsen performance of the models.
create_ms_ref whether to create Thai summary reference file to evaluate MS task in NCLS:CLS+MS model.

This script will produce three files namely test.CLS.source.thai.txt and test.CLS.target.{en/zh}.txt. test.CLS.source.thai.txt is used as a test file for cls task. test.CLS.target.{en/zh}.txt are the crosslingual summary reference for English and Chinese, they are used to evaluate ROUGE and BertScore. Each line is corresponding to the body articles in test.CLS.source.thai.txt.

🥳 We also evaluated MT tasks in XLS and NCLS:CLS+MT models. Please refers to xls_model_evaluation.ipynb and ncls_model_evaluation.ipynb for BLUE score results . For test sets that we used to evaluate MT task, please refer to data/README.md.

EXPERIMENT RESULTS

🔆 It has to be noted that all of end-to-end models reported in this section were trained on filtered datasets NOT full datasets. And for all end-to-end models, only `th_body` and `{en/zh}_sum` were present during training. We trained end-to-end models for 1,000,000 steps and selected model checkpoints that yielded the highest overall ROUGE scores to report the experiment.

In this experiment, we used two automatic evaluation matrices namely ROUGE and BertScore to assess the performance of CLS models. We evaluated ROUGE on Chinese text at word-level, NOT character level.

We only reported BertScore on abstractive summarization models. To evaluate the results with BertScore we used weights from ‘roberta-large’ and ‘bert-base-chinese’ pretrained models for Th2En and Th2Zh respectively.

Model	Thai to English				Thai to Chinese
	ROUGE			BertScore	ROUGE			BertScore
	R1	R2	RL	F1	R1	R2	RL	F1
Traditional Approaches
Translated Headline	23.44	6.99	21.49	-	21.55	4.66	18.58	-
ETrans → LEAD2	51.96	42.15	50.01	-	44.18	18.83	43.84	-
ETrans → BertSumExt	51.85	38.09	49.50	-	34.58	14.98	34.84	-
ETrans → BertSumExtAbs	52.63	32.19	48.14	88.18	35.63	16.02	35.36	70.42
BertSumExt → LTrans	42.33	27.33	34.85	-	28.11	18.85	27.46	-
End-to-End Training Approaches
TNCLS	26.48	6.65	21.66	85.03	27.09	6.69	21.99	63.72
CLS+MS	32.28	15.21	34.68	87.22	34.34	12.23	28.80	67.39
CLS+MT	42.85	19.47	39.48	88.06	42.48	19.10	37.73	71.01
XLS – RL-ROUGE	42.82	19.62	39.53	88.03	43.20	19.19	38.52	72.19

LICENSE

Thai crosslingual summarization datasets including TH2EN, TH2ZH, test and validation set are licensed under MIT License.

ACKNOWLEDGEMENT

These cross-lingual datasets and the experiments are parts of Nakhun Chumpolsathien ’s master’s thesis at school of computer science, Beijing Institute of Technology. Therefore, as well, a great appreciation goes to his supervisor, Assoc. Prof. Gao Yang.
Shout out to Tanachat Arayachutinan for the initial data processing and for introducing me 麻辣烫, 黄焖鸡.
We would like to thank Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications for providing computing resources to conduct the experiment.
In this experiment, we used PyThaiNLP v. 2.2.4 to tokenize (on both word & sentence levels) Thai texts. For Chinese and English segmentation, we used Stanza.

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Related tags

Overview

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

📥 Download Datasets
📥 Download Trained Models

INTRODUCTION

DATASET CONSTRUCTION

Dataset Statistic

Data Fields

AVAILABLE TRAINED MODELS

EXPERIMENT RESULTS

LICENSE

ACKNOWLEDGEMENT

Owner

Nakhun Chumpolsathien

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

German Text-To-Speech Engine using Tacotron and Griffin-Lim

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

Submit issues and feature requests for our API here.

Python library for parsing resumes using natural language processing and machine learning

This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

KoBERT - Korean BERT pre-trained cased (KoBERT)

An Open-Source Package for Neural Relation Extraction (NRE)

华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

An evaluation toolkit for voice conversion models.

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Black for Python docstrings and reStructuredText (rst).

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Related tags

Overview

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

📥 Download Datasets 📥 Download Trained Models

INTRODUCTION

DATASET CONSTRUCTION

Dataset Statistic

Data Fields

AVAILABLE TRAINED MODELS

EXPERIMENT RESULTS

LICENSE

ACKNOWLEDGEMENT

Owner

Nakhun Chumpolsathien

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

German Text-To-Speech Engine using Tacotron and Griffin-Lim

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

Submit issues and feature requests for our API here.

Python library for parsing resumes using natural language processing and machine learning

This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

KoBERT - Korean BERT pre-trained cased (KoBERT)

An Open-Source Package for Neural Relation Extraction (NRE)

华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

An evaluation toolkit for voice conversion models.

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Black for Python docstrings and reStructuredText (rst).

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

📥 Download Datasets
📥 Download Trained Models