Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Last update: Oct 16, 2022

Related tags

Overview

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation

This is a PyTorch implementation for the ACL 2022 main conference paper STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation.

Training a Model on MuST-C

Let's first take a look at training an En-De model as an example.

Enviroment Configuration

Clone this repository:

git clone [email protected]:ictnlp/STEMM.git
cd STEMM/

Install Montreal Forced Aligner following the official guidance. Please also download the pertained models and dictionary for MFA.
Please make sure you have installed PyTorch, and then install fairseq and other packages as follows:

pip install --editable ./
python3 setup.py install --user
python3 setup.py build_ext --inplace
pip install inflect sentencepiece soundfile textgrid pandas

Data Preparation

First make a directory to store the dataset:

TGT_LANG=de
MUSTC_ROOT=data/mustc/
mkdir -p $MUSTC_ROOT

Download the MuST-C v1.0 archive MUSTC_v1.0_en-de.tar.gz to the $MUSTC_ROOT path, and uncompress it:

cd $MUSTC_ROOT
tar -xzvf MUSTC_v1.0_en-de.tar.gz

Return to the root directory, run the preprocess script preprocess.sh, which will perform forced alignment and organize the raw data and alignment information into .tsv format for using:

sh preprocess.sh $TGT_LANG

Finally, the directory $MUSTC_ROOT should look like this:

.
├── en-de
│   ├── config_raw.yaml
│   ├── data
│   ├── dev_raw_seg_plus.tsv
│   ├── docs
│   ├── segment
│   ├── spm_unigram10000_raw.model
│   ├── spm_unigram10000_raw.txt
│   ├── spm_unigram10000_raw.vocab
│   ├── train_raw_seg_plus.tsv
│   ├── tst-COMMON_raw_seg_plus.tsv
│   ├── tst-HE_raw_seg_plus.tsv
└── MUSTC_v1.0_en-de.tar.gz

Pretrain the MT Module

[OPTIONAL] Use External MT Corpus

If you want to use external MT corpus, please first pretrain a MT model on this corpus following these steps:

Perform BPE on external corpus with the sentencepiece model learned on MuST-C. As we mentioned in our paper, we use WMT for En-De, En-Fr, En-Ru, En-Es, En-Ro, and OPUS100 for En-Pt, En-It, En-Nl as external corpus. You can download them from the internet and put them in the data/ext_en${TGT_LANG}/ directory. Run the following command and replace $input_file with the path of raw text to perform BPE. You should apply BPE to texts in both source and target language of all subset (train/valid/test).

python3 data/scripts/apply_spm.py --input-file $input_file --output-file $output_file --model data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.model

Use fairseq-preprocess command to convert the BPE texts into fairseq formats. Make sure to use the sentencepiece dictionary learned on MuST-C.

$spm_dict=data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.txt
fairseq-preprocess --source-lang en --target-lang $TGT_LANG --trainpref data/ext_en${TGT_LANG}/train --validpref data/ext_en${TGT_LANG}/valid --testpref data/ext_en${TGT_LANG}/test --destdir data/ext_en${TGT_LANG}/binary --joined-dictionary --srcdict $spm_dict --tgtdict $spm_dict --workers=20 --nwordssrc 10000 --nwordstgt 10000

Train the model using the following command:

sh pretrain_mt_ext.sh $TGT_LANG

Pretrain the MT module on MuST-C

Run the following script to pretrain the MT module. The argument --load-pretrained-mt-encoder-decoder-from indicates the path of MT model pretrained on external corpus obtained in the last step.

sh pretrain_mt.sh $TGT_LANG

To ensure consistent performance, we have released our checkpoints of pretrained MT modules. You can download them and directly use them do initialize the MT module in our model for the following experiments.

Direction	Link
En-De	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ende_mt.pt
En-Fr	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enfr_mt.pt
En-Es	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enes_mt.pt
En-Ro	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enro_mt.pt
En-Ru	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enru_mt.pt
En-Nl	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ennl_mt.pt
En-It	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enit_mt.pt
En-Pt	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enpt_mt.pt

Training

Download the pretrained wav2vec2.0 model from the official link, and put it in the checkpoints/ directory.
Just run the training scripts:

sh train.sh $TGT_LANG

Evaluate

Run the following script to average the last 10 checkpoints and evaluate on the tst-COMMON set:

sh test.sh mustc_en${TGT_LANG}_stmm_self_learning $TGT_LANG

We also released our checkpoints as follows. You can download and evaluate them directly.

Direction	Link
En-De	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ende_stmm_self_learning.pt
En-Fr	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enfr_stmm_self_learning.pt
En-Es	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enes_stmm_self_learning.pt
En-Ro	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enro_stmm_self_learning.pt
En-Ru	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enru_stmm_self_learning.pt
En-Nl	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ennl_stmm_self_learning.pt
En-It	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enit_stmm_self_learning.pt
En-Pt	https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enpt_stmm_self_learning.pt

Citation

In this repository is useful for you, please cite as:

@inproceedings{fang-etal-2022-STEMM,
	title = {STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation},
	author = {Fang, Qingkai and Ye, Rong and Li, Lei and Feng, Yang and Wang, Mingxuan},
	booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
	year = {2022},
}

Contact

If you have any questions, feel free to contact me at [email protected].

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Related tags

Overview

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation

Training a Model on MuST-C

Enviroment Configuration

Data Preparation

Pretrain the MT Module

[OPTIONAL] Use External MT Corpus

Pretrain the MT module on MuST-C

Training

Evaluate

Citation

Contact

Owner

ICTNLP

PUA Programming Language written in Python.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Model parallel transformers in JAX and Haiku

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Python library for parsing resumes using natural language processing and machine learning

초성 해석기 based on ko-BART

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

DELTA is a deep learning based natural language and speech processing platform.

Pretrained Japanese BERT models

FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

A python gui program to generate reddit text to speech videos from the id of any post.

An open source framework for seq2seq models in PyTorch.

Poetry PEP 517 Build Backend & Core Utilities

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

An A-SOUL Text Generator Based on CPM-Distill.