Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Last update: Dec 03, 2022

Related tags

Deep Learning AVATAR

Overview

AVATAR

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
AVATAR stands for jAVA-pyThon progrAm tRanslation.
AVATAR is a corpus of 8,475 programming problems and their solutions written in Java and Python.
Supervised fine-tuning and evaluation in terms of Computational Accuracy, see details here.

AVATAR

Dataset

We have collected the programming problems and their solutions from competitive programming sites, online platforms, and open source repositories. We list the sources below.

CodeForces
AtCoder
CodeJam
GeeksforGeeks
LeetCode
ProjectEuler

Data collected can be downloaded by following:

cd data
bash download.sh

To prepare the data, we perform the following steps.

Removing docstrings, comments, etc.
Use baseline models' tokenizer to perform tokenization.
Filter data based on length threshold (~512).
Perform de-duplication. (remove examples that are duplicates)

To perform the preparation, run:

cd data
bash prepare.sh

Models

We studied 8 models for program translation.

Models trained from scratch

Seq2Seq+Attn. [1Lx512H]
Transformer [6Lx512H]

Pre-trained models

Training & Evaluation

To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.

# Seq2Seq+Attn.
cd seq2seq
bash rnn.sh GPU_ID LANG1 LANG2

# Transformer
cd seq2seq
bash transformer.sh GPU_ID LANG1 LANG2

# CodeGPT
cd codegpt
bash run.sh GPU_ID LANG1 LANG2 CodeGPT

# CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID LANG1 LANG2

# CodeBERT
cd codebert
bash run.sh GPU_ID LANG1 LANG2

# GraphCoderBERT
cd graphcodebert
bash run.sh GPU_ID LANG1 LANG2

# PLBART
cd plbart
# fine-tuning either for Java->Python or Python-Java
bash run.sh GPU_ID LANG1 LANG2
# multilingual fine-tuning
bash multilingual.sh GPU_ID

# Naive Copy
cd naivecopy
bash run.sh

Here, LANG1 LANG2=Java Python or LANG1 LANG2=Python Java.
Download pre-trained PLBART, GraphCodeBERT, and Transcoder model files by running download.sh script.
We trained the models on GeForce RTX 2080 ti GPUs (11019MiB).

Benchmarks

We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), Exact Match (EM). We report the model performances below.

Training	Models	Java to Python						Python to Java
Training	Models	CA	BLEU	SM	DM	CB	EM	CA	BLEU	SM	DM	CB	EM
None	Naive Copy	-	23.4	-	-	-	0.0	-	26.9	-	-	-	0.0
	TransCoder	76.9	36.8	31.0	17.1	29.1	0.1	100	49.4	37.6	18.5	31.9	0.0
	TC-DOBF	77.7	43.4	29.7	33.9	34.8	0.0	100	46.1	36.0	12.6	28.8	0.0
From Scratch	Seq2Seq+Attn.	66.5	56.3	39.1	18.4	37.9	1.0	71.8	62.7	46.6	28.5	43.0	0.8
From Scratch	Transformer	61.5	38.9	34.2	16.5	29.1	0.0	67.4	45.6	45.7	26.4	37.4	0.1
Pre-trained	CodeGPT	47.3	38.2	32.5	11.5	26.1	1.1	71.2	44.0	38.8	26.7	33.8	0.1
	CodeGPT-adapted	48.1	38.2	32.5	12.1	26.2	1.2	68.6	42.4	37.2	27.2	33.1	0.5
	CodeBERT	62.3	59.3	37.7	16.2	36.7	0.5	74.7	55.3	38.4	22.5	36.1	0.6
	GraphCodeBERT	65.7	59.7	38.9	16.4	37.1	0.7	57.2	60.6	48.4	20.6	40.1	0.4
	PLBART_mono	76.4	67.1	42.6	19.3	43.3	2.4	34.4	69.1	57.1	34.0	51.4	1.2
	PLBART_multi	70.4	67.1	42.0	17.6	42.4	2.4	30.8	69.4	56.6	34.5	51.8	1.0

License

This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.

Citation

@article{ahmad-etal-2021-avatar,
  title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  journal={arXiv preprint arXiv:2108.11590},
  year={2021}
}

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Related tags

Overview

AVATAR

Table of Contents

Dataset

Models

Models trained from scratch

Pre-trained models

Training & Evaluation

Benchmarks

License

Citation

Owner

Wasi Ahmad

⚓ Eurybia monitor model drift over time and securize model deployment with data validation

Survival analysis in Python

[CVPR 2021] Pytorch implementation of Hijack-GAN: Unintended-Use of Pretrained, Black-Box GANs

Code release for Hu et al. Segmentation from Natural Language Expressions. in ECCV, 2016

Seach Losses of our paper 'Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search', accepted by ICLR 2021.

Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors, CVPR 2021

Using VideoBERT to tackle video prediction

This repo contains the official implementations of EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

PyTorch implementation of DCT fast weight RNNs

Python scripts form performing stereo depth estimation using the HITNET model in Tensorflow Lite.

[2021 MultiMedia] CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Coarse implement of the paper "A Simultaneous Denoising and Dereverberation Framework with Target Decoupling", On DNS-2020 dataset, the DNSMOS of first stage is 3.42 and second stage is 3.47.

pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

A package to predict protein inter-residue geometries from sequence data

Visual dialog agents with pre-trained vision-and-language encoders.

ChatBot-Pytorch - A GPT-2 ChatBot implemented using Pytorch and Huggingface-transformers

WSDM2022 Challenge - Large scale temporal graph link prediction

Intelligent Video Analytics toolkit based on different inference backends.

Dataset Condensation with Contrastive Signals

GLIP: Grounded Language-Image Pre-training