Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Related tags

Deep LearningVOLT
Overview

**Codebase and data are uploaded in progress. **

VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.

What's New:

  • July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.
  • July 2021: Support subword-nmt tokenization.
  • July 2021: Support sentencepiece tokenization.

What's On-going:

  • Add translation training/evaluation codes.
  • Support classification tasks.
  • Support pip usage.

Features:

  • Efficient: CPU learning on one machine.
  • Simple: The core code is no more than 200 lines.
  • Easy-to-use: Support widely-used tokenization toolkits,subword-nmt and sentencepiece.
  • Flexible: User can customize their own tokenization rules.

Requirements and Installation

The required environments:

  • python 3.0
  • tqdm
  • mosedecoder
  • subword-nmt

To use VOLT and develop locally:

git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt
pip3 install sentencepiece
pip3 install tqdm 

Usage

  • The first step is to get vocabulary candidates and tokenized texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples:

    
    #Assume source_data is the file stroing data in the source language
    #Assume target_data is the file stroing data in the target language
    BPEROOT=subword-nmt
    size=30000 # the size of BPE
    cat source_data > training_data
    cat target_data >> training_data
    
    #subword-nmt style:
    mkdir bpeoutput
    BPE_CODE=code # the path to save vocabulary
    python3 $BPEROOT/learn_bpe.py -s $size  < training_data > $BPE_CODE
    python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file
    python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/source.file
    
    #sentencepiece style:
    mkdir spmout
    python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
    #After this step, you will see spm.vocab and spm.model
    python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
    python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece
    
  • The second step is to run VOLT scripts. It accepts the following parameters:

    • --source_file: the file storing data in the source language.
    • --target_file: the file storing data in the target language.
    • --token_candidate_file: the file storing token candidates.
    • --max_number: the maximum size of the vocabulary generated by VOLT.
    • --interval: the search granularity in VOLT.
    • --loop_in_ot: the maximum interation loop in sinkhorn solution.
    • --tokenizer: which toolkit you use to get vocabulary. Only subword-nmt and sentencepiece are supported.
    • --size_file: the file to store the vocabulary size generated by VOLT.
    • --threshold: the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Less threshold means that less token candidates are dropped.
    #subword-nmt style
    python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \
              --token_candidate_file $BPE_CODE \
              --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size 
    #sentencepiece style
    python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \
              --token_candidate_file $BPE_CODE \
              --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size 
    
  • The third step is to use the generated vocabulary to tokenize your texts:

      #for subword-nmt toolkit
      python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < source_file > bpeoutput/source.file
      python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < target_file > bpeoutput/source.file
    
      #for sentencepiece toolkit, here we only keep the optimal size
      best_size=$(cat spmoutput/size)
      python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe
    
      #After this step, you will see spm.vocab and spm.model
      python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
      python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece
    

Examples

We have given several examples in path "examples/".

Datasets

The WMT-14 En-de translation data can be downloaed via the running scripts.

For TED, you can download at TED.

Citation

Please cite as:

@inproceedings{volt,
  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
  author= {Jingjing Xu and
               Hao Zhou and
               Chun Gan and
               Zaixiang Zheng and
               Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}
Ejemplo Algoritmo Viterbi - Example of a Viterbi algorithm applied to a hidden Markov model on DNA sequence

Ejemplo Algoritmo Viterbi Ejemplo de un algoritmo Viterbi aplicado a modelo ocul

Mateo Velásquez Molina 1 Jan 10, 2022
CLOOB training (JAX) and inference (JAX and PyTorch)

cloob-training Pretrained models There are two pretrained CLOOB models in this repo at the moment, a 16 epoch and a 32 epoch ViT-B/16 checkpoint train

Katherine Crowson 64 Nov 27, 2022
[ICLR2021] Unlearnable Examples: Making Personal Data Unexploitable

Unlearnable Examples Code for ICLR2021 Spotlight Paper "Unlearnable Examples: Making Personal Data Unexploitable " by Hanxun Huang, Xingjun Ma, Sarah

Hanxun Huang 98 Dec 07, 2022
Implementation of various Vision Transformers I found interesting

Implementation of various Vision Transformers I found interesting

Kim Seonghyeon 78 Dec 06, 2022
Siamese TabNet

Raifhack-DS-2021 https://raifhack.ru/ - Команда Звёздочка Siamese TabNet Сиамская TabNet предсказывает стоимость объекта недвижимости с price_type=1,

Daniel Gafni 15 Apr 16, 2022
The project of phase's key role in complex and real NN

Phase-in-NN This is the code for our project at Princeton (co-authors: Yuqi Nie, Hui Yuan). The paper title is: "Neural Network is heterogeneous: Phas

YuqiNie-lab 1 Nov 04, 2021
An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available actions

Agar.io_Q-Learning_AI An experiment on the performance of homemade Q-learning AIs in Agar.io depending on their state representation and available act

1 Jun 09, 2022
Continuous Security Group Rule Change Detection & Response at scale

Introduction Get notified of Security Group Changes across all AWS Accounts & Regions in an AWS Organization, with the ability to respond/revert those

Raajhesh Kannaa Chidambaram 3 Aug 13, 2022
DeiT: Data-efficient Image Transformers

DeiT: Data-efficient Image Transformers This repository contains PyTorch evaluation code, training code and pretrained models for DeiT (Data-Efficient

Facebook Research 3.2k Jan 06, 2023
Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

TensorLayer Community 7.1k Dec 29, 2022
Bayesian Image Reconstruction using Deep Generative Models

Bayesian Image Reconstruction using Deep Generative Models R. Marinescu, D. Moyer, P. Golland For technical inquiries, please create a Github issue. F

Razvan Valentin Marinescu 51 Nov 23, 2022
Noise Conditional Score Networks (NeurIPS 2019, Oral)

Generative Modeling by Estimating Gradients of the Data Distribution This repo contains the official implementation for the NeurIPS 2019 paper Generat

451 Dec 26, 2022
Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

StrengthNet Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis" https://arxiv.org/abs/2110

RuiLiu 65 Dec 20, 2022
Contains code for the paper "Vision Transformers are Robust Learners".

Vision Transformers are Robust Learners This repository contains the code for the paper Vision Transformers are Robust Learners by Sayak Paul* and Pin

Sayak Paul 103 Jan 05, 2023
torchbearer: A model fitting library for PyTorch

Note: We're moving to PyTorch Lightning! Read about the move here. From the end of February, torchbearer will no longer be actively maintained. We'll

631 Jan 04, 2023
Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt) Task Training huge unsupervised deep neural networks yields to strong progress in

Oliver Hahn 1 Jan 26, 2022
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

Introduction YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and ind

7.7k Jan 03, 2023
Pytorch implementation for "Density-aware Chamfer Distance as a Comprehensive Metric for Point Cloud Completion" (NeurIPS 2021)

Density-aware Chamfer Distance This repository contains the official PyTorch implementation of our paper: Density-aware Chamfer Distance as a Comprehe

Tong WU 93 Dec 15, 2022
A Tensorflow implementation of BicycleGAN.

BicycleGAN implementation in Tensorflow As part of the implementation series of Joseph Lim's group at USC, our motivation is to accelerate (or sometim

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 97 Dec 02, 2022
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE: A Benchmark Suite for Data-centric NLP You can get the english version of README. 以数据为中心的AI测评(DataCLUE) 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE

CLUE benchmark 135 Dec 22, 2022