CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Last update: Jan 06, 2023

Related tags

Overview

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

CVSS is a massively multilingual-to-English speech-to-speech translation corpus, covering sentence-level parallel speech-to-speech translation pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation corpus. The translation speech in CVSS is synthesized with two state-of-the-art TTS models trained on the LibriTTS corpus.

CVSS includes two versions of spoken translation for all the 21 x-en language pairs from CoVoST 2, with each version providing unique values:

CVSS-C: All the translation speeches are in a single canonical speaker's voice. Despite being synthetic, these speeches are of very high naturalness and cleanness, as well as having consistent speaking style. These properties ease the modelling of the target speech and enable models to produce high quality translation speech suitable for user-facing applications.
CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. Each translation pair has similar voices on the two sides despite of being in different languages, making this dataset suitable for building models that preserve speakers' voices when translate speech into different languages.

In together with the source speeches originated from Common Voice, they make two multilingual speech-to-speech tranlsation datasets each with about 1,900 hours of speech.

In addition to translation speech, CVSS also provides normalized translation text matching the pronunciation in the translation speech (e.g. on numbers, currencies, acronyms, etc.), which can be use for both model training as well as standalizing evaluation.

Please check out our paper for the detailed description of this corpus, as well as the baseline models we trained on both datasets.

Getting the data

The translation speech and the normalized translation text in CVSS can be downloaded from the links in the following table:

Source language	Code	CVSS-C	CVSS-T
Arabic	ar	link	link
Catalan	ca	link	link
Welsh	cy	link	link
German	de	link	link
Estonian	et	link	link
Spanish	es	link	link
Persian	fa	link	link
French	fr	link	link
Indonesian	id	link	link
Italian	it	link	link
Japanese	ja	link	link
Latvian	lv	link	link
Mongolian	mn	link	link
Dutch	nl	link	link
Portuguese	pt	link	link
Russian	ru	link	link
Slovenian	sl	link	link
Swedish	sv	link	link
Tamil	ta	link	link
Turkish	tr	link	link
Chinese	zh	link	link

Each tar.gz file in the links above includes train, dev and test directories containing audio clips as the translation speech, as well as train.tsv, dev.tsv and test.tsv files containing the normalized translation text. The normalized translation text files included in CVSS-C and CVSS-T are identical.

These translation audio clips and translation texts are to be paired with the Common Voice release version 4 (required) based on the audio file names. If you need the original translation text without the normalization, they are provided by CoVoST 2.

License

CVSS is released under the very permissive Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

Please cite this paper when referencing the CVSS corpus:

@misc{jia2022cvss,
    title={{CVSS} Corpus and Massively Multilingual Speech-to-Speech Translation},
    author={Jia, Ye and Tadmor Ramanovich, Michelle and Wang, Quan and Zen, Heiga},
    eprint={2201.03713},
    archivePrefix={arXiv},
    year={2022}
}

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Related tags

Overview

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Getting the data

License

Citation

Owner

Google Research Datasets

Code Generation using a large neural network called GPT-J

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Large-scale pretraining for dialogue

Quick insights from Zoom meeting transcripts using Graph + NLP

Simple, hackable offline speech to text - using the VOSK-API.

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Findings of ACL 2021

NSFW A chatbot based on GPT2-chitchat

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Natural Language Processing library built with AllenNLP 🌲🌱

Code for the paper PermuteFormer

🏖 Easy training and deployment of seq2seq models.

Rootski - Full codebase for rootski.io (without the data)

🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

A fast, efficient universal vector embedding utility package.

A Transformer Implementation that is easy to understand and customizable.

A paper list for aspect based sentiment analysis.

This code is the implementation of Text Emotion Recognition (TER) with linguistic features

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Related tags

Overview

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

Getting the data

License

Citation

Owner

Google Research Datasets

Code Generation using a large neural network called GPT-J

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Large-scale pretraining for dialogue

Quick insights from Zoom meeting transcripts using Graph + NLP

Simple, hackable offline speech to text - using the VOSK-API.

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Findings of ACL 2021

**NSFW** A chatbot based on GPT2-chitchat

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Natural Language Processing library built with AllenNLP 🌲🌱

Code for the paper PermuteFormer

🏖 Easy training and deployment of seq2seq models.

Rootski - Full codebase for rootski.io (without the data)

🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

A fast, efficient universal vector embedding utility package.

A Transformer Implementation that is easy to understand and customizable.

A paper list for aspect based sentiment analysis.

This code is the implementation of Text Emotion Recognition (TER) with linguistic features

NSFW A chatbot based on GPT2-chitchat