Styled Augmented Translation

Last update: Dec 29, 2022

Related tags

Overview

SAT

Style Augmented Translation

Introduction

By collecting high-quality data, we were able to train a model that outperforms Google Translate on 6 different domains of English-Vietnamese Translation.

English to Vietnamese Translation (BLEU score)

Vietnamese to English Translation (BLEU score)

Get data and model at Google Cloud Storage

Check out our demo web app

Visit our blog post for more details.

Using the code

This code is build on top of vietai/dab:

To prepare for training, generate tfrecords from raw text files:

python t2t_datagen.py \
--data_dir=$path_to_folder_contains_vocab_file \
--tmp_dir=$path_to_folder_that_contains_training_data \
--problem=$problem

To train a Transformer model on the generated tfrecords

python t2t_trainer.py \
--data_dir=$path_to_folder_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_to_save_checkpoints

To run inference on the trained model:

python t2t_decoder.py \
--data_dir=$path_to_folde_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_contains_checkpoints

In this colab, we demonstrated how to run these three phases in the context of hosting data/model on Google Cloud Storage.

Dataset

Our data contains roughly 3.3 million pairs of texts. After augmentation, the data is of size 26.7 million pairs of texts. A more detail breakdown of our data is shown in the table below.

	Pure	Augmented
Fictional Books	333,189	2,516,787
Legal Document	1,150,266	3,450,801
Medical Publication	5,861	27,588
Movies Subtitles	250,000	3,698,046
Software	79,912	239,745
TED Talk	352,652	4,983,294
Wikipedia	645,326	1,935,981
News	18,449	139,341
Religious texts	124,389	1,182,726
Educational content	397,008	8,475,342
No tag	5,517	66,299
Total	3,362,569	26,715,950

Data sources is described in more details here.

Comments

Data leakage issue in evaluation?

Hi team @lmthang @thtrieu @heraclex12 @hqphat @KienHuynh

The obtained results of a Transformer-based model on the PhoMT test set surprised me. My first thought was that as VietAI and PhoMT datasets have several overlapping domains (e.g. Wikihow, TED talks, Opensubtitles, news..): whether there might be a potential data leakage issue in your evaluation (e.g. PhoMT English-Vietnamese test pairs appear in the VietAI training set)?

In particular, we find that 6294/19151 PhoMT English-Vietnamese test pairs appear in the VietAI training set (v2). When evaluating your model on the PhoMT test set, did you guys retrain the model on a VietAI training set variant that does not contain PhoMT English-Vietnamese test pairs?

Cheers, Dat.

opened by datquocnguyen 3

Demo website is not working

Hi, seems like the easiest to reach out here but https://demo.vietai.org/ is down, looks like the page tried to serve a 404 error page.

Connection failed with status 404, and response "<!DOCTYPE html> <html lang=en> <meta charset=utf-8> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"> <title>Error 404 (Not Found)!!1</title> <style> *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px} </style> <a href=//www.google.com/><span id=logo aria-label=Google></span></a> <p><b>404.</b> <ins>That’s an error.</ins> <p>The requested URL <code>/healthz</code> was not found on this server. <ins>That’s all we know.</ins> ".

opened by VietThan 1

Got RuntimeError when run on Google Colab

I ran the Readme.md samples on Google Colab with GPU and got this Error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)".

Error code: outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to('cuda'), max_length=512)

opened by kietbg0079 0
Got error 'AssertionError: Torch not compiled with CUDA enabled' on Macbook M1 pro

I have tried the example on my Macbook M1 pro but got this error: =>outputs = model.generate(tokenizer(inputs, return_tensors="pt", padding=True).input_ids.to('cuda'), max_length=512) raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

Please help!

opened by htnha 4
Question about loading model

I have a question about loading model. I have trained a Russian-to-Vietnamese model base on your code and tensor2tensor. Every time I want to predict a new sentence, it always load the model again, even before that I have already predicted another sentence. I want to ask that if there is a way not to have reload the model when predict a new sentence. Thank you very much

opened by hieunguyenquoc 1
I have a issue about running decoder

Data loss: Unable to open table file /content/drive/MyDrive/SAT/checkpoint: Failed precondition: /content/drive/MyDrive/SAT/checkpoint; Is a directory: perhaps your file is in a different file format and you need to use a different restore operator?

I used a pretrain model : model.augmented.envi.ckpt-1415000.data-00000-of-00001, model.augmented.envi.ckpt-1415000.index, model.augmented.envi.ckpt-1415000.meta. All 3 file are put in checkpoint

Could somebody help me with this issue ?

opened by hieunguyenquoc 6

Releases(v1.0)

v1.0(Oct 2, 2021)

First version.

Trained on 3.3M training data points. Transformer with 9-layer encoder and 9-layer decoder. Tested on a multi-domain dataset, outperforming Google Translate. Experiments with style-tagging and data appending.
Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository

A new test set for ImageNet

ImageNetV2 The ImageNetV2 dataset contains new test data for the ImageNet benchmark. This repository provides associated code for assembling and worki

186 Dec 18, 2022

Text to image synthesis using thought vectors

Text To Image Synthesis Using Thought Vectors This is an experimental tensorflow implementation of synthesizing images from captions using Skip Though

2.1k Jan 05, 2023

A PyTorch re-implementation of Neural Radiance Fields

nerf-pytorch A PyTorch re-implementation Project | Video | Paper NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis Ben Mildenhall

709 Jan 09, 2023

Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"

CSRA This is the official code of ICCV 2021 paper: Residual Attention: A Simple But Effective Method for Multi-Label Recoginition Demo, Train and Vali

163 Dec 22, 2022

Constructing Neural Network-Based Models for Simulating Dynamical Systems

Constructing Neural Network-Based Models for Simulating Dynamical Systems Note this repo is work in progress prior to reviewing This is a companion re

21 Nov 25, 2022

An Api for Emotion recognition.

PLAYEMO Playemo was built from the ground-up with Flask, a python tool that makes it easy for developers to build APIs. Use Cases Is Python your langu

2 Jul 16, 2022

NeurIPS 2021, "Fine Samples for Learning with Noisy Labels"

[Official] FINE Samples for Learning with Noisy Labels This repository is the official implementation of "FINE Samples for Learning with Noisy Labels"

27 Dec 23, 2022

Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"

AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds (ICCV 2021 oral) **Project Page | Arxiv ** Runsong Zhu¹, Yuan Liu², Zhen Dong¹, Te

40 Dec 30, 2022

Implementation of the final project of the course DDA6309 Probabilistic Graphical Model

Task-aware Joint CWS and POS (TCwsPos) This is the implementation of the final project of the course DDA6309 Probabilistic Graphical Models, The Chine

1 Dec 26, 2021

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions. The result is a pure Python function with no third-party dependencies that you can simply c

24 Dec 20, 2022

A stock generator that assess a list of stocks and returns the best stocks for investing and money allocations based on users choices of volatility, duration and number of stocks

Stock-Generator Please visit "Stock Generator.ipynb" for a clearer view and "Stock Generator.py" for scripts. The stock generator is designed to allow

1 Aug 02, 2022

Styled Augmented Translation

Related tags

Overview

SAT

Style Augmented Translation

Introduction

Using the code

Dataset

Comments

Data leakage issue in evaluation?

Demo website is not working

Got RuntimeError when run on Google Colab

Got error 'AssertionError: Torch not compiled with CUDA enabled' on Macbook M1 pro

Question about loading model

I have a issue about running decoder

Releases(v1.0)

v1.0(Oct 2, 2021)

Owner

A new test set for ImageNet

Text to image synthesis using thought vectors

A PyTorch re-implementation of Neural Radiance Fields

Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"

Constructing Neural Network-Based Models for Simulating Dynamical Systems

An Api for Emotion recognition.

NeurIPS 2021, "Fine Samples for Learning with Noisy Labels"

Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"

Implementation of the final project of the course DDA6309 Probabilistic Graphical Model

naked is a Python tool which allows you to strip a model and only keep what matters for making predictions.

A stock generator that assess a list of stocks and returns the best stocks for investing and money allocations based on users choices of volatility, duration and number of stocks

Xview3 solution - XView3 challenge, 2nd place solution

tsflex - feature-extraction benchmarking

C3D is a modified version of BVLC caffe to support 3D ConvNets.

Recurrent Scale Approximation (RSA) for Object Detection

An official TensorFlow implementation of “CLCC: Contrastive Learning for Color Constancy” accepted at CVPR 2021.

Pytorch implementation of "Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet"

Unofficial keras(tensorflow) implementation of MAE model from Masked Autoencoders Are Scalable Vision Learners

This is an official implementation for "Video Swin Transformers".

Code for IntraQ, PyTorch implementation of our paper under review