[ICML 2021] Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data

Overview

Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data

This repo provides the source code & data of our paper: Break-It-Fix-It: Unsupervised Learning for Program Repair (ICML 2021).

@InProceedings{yasunaga2021break,
  author =  {Michihiro Yasunaga and Percy Liang},
  title =   {Break-It-Fix-It: Unsupervised Learning for Program Repair},
  year =    {2021},  
  booktitle = {International Conference on Machine Learning (ICML)},  
}

Problem: Repair Task

Our approach: BIFI

0. Dependencies

Specifically, run the following commands to create a conda environment (assuming CUDA10.1):

conda create -n BIFI python=3.7.7
conda activate BIFI
pip install tqdm
pip install torch==1.4.0 torchvision==0.5.0
cd utils/fairseq
pip install -e .
pip numpy==1.20.1 editdistance

1. Download Data

Download all the data from here (data.zip) and unzip it (note: 67GB when compressed, 400GB when decompressed). This includes the GitHub-Python dataset, and all the processed training data and trained models associated with BIFI. If you only want the original GitHub-Python dataset, you can download it from here (data_minimal.zip; 1GB). After unzipping the data.zip, the resulting file structure will look like:

.
├── README.md
└── data/
    ├── orig_bad_code/       (GitHub-Python dataset's bad code)
    ├── orig_good_code/      (GitHub-Python dataset's good code)
    └── round0/
        ├── data_paired      (paired data used to train fixer in round0)
        └── model-fixer      (fixer trained in round0)
    ├── round1-BIFI-part1/
        ├── data_paired      (paired data used to train breaker in BIFI round1)
        └── model-breaker    (breaker trained in BIFI round1)
    ├── round1-BIFI-part2/
        ├── data_paired      (paired data used to train fixer in BIFI round1)
        └── model-fixer      (fixer trained in BIFI round1)
    ├── ...

About the GitHub-Python dataset

We collected 3 million Python3 snippets from GitHub. Using the critic (Python AST parser), the code snippets are split into a set of bad code (with AST parse errors) and a set of good code (with no errors). The set of bad code is located at data/orig_bad_code/orig.bad.json and good code at data/orig_good_code/orig.good.json. Each entry of orig.bad.json or orig.good.json is a dictionary consisting of

  • "code_string": raw code in the string format
  • "code_toks_joined": the raw code is split into tokens by Python tokenizer, anonymized (string/number is replaced with special tokens <STRING>/<NUMBER>), and then joined by whitespace. The tokenization was done by utils/code_utils.py: tokenize_python_code()
  • "anonymize_dict": mapping betweens raw string/number and <STRING>/<NUMBER> so that "code_string" can be recovered from "code_toks_joined". This recovery can be done by utils/code_utils.py: code_toks_to_code_string()
  • "err_obj": type of the error caught by the critic (e.g. unbalanced parentheses, indentation error). This is only applicable to orig.bad.json.

The bad code snippets in orig.bad.json are split into 5 chunks (orig.0.bad to orig.4.bad in data/orig_bad_code/), where 3,4 is heldout as the test set and 0,1,2 is made available for BIFI training. This splitting was done by scripts/split_orig_bad_and_good.py

2. Training and Evaluation

First, train the initial fixer by running commands in src/run-round0.py one by one. We then consider three training algorithms on top of it: BIFI (our proposed method), FixerOnly (BIFI without breaker), and BackTranslation (BT; our baseline). For each algorithm,

  • BIFI: run commands in src/run-BIFI.py one by one
  • FixerOnly: run commands in src/run-FixerOnly.py one by one
  • BT: run commands in src/run-BT.py one by one

Below is an illustration for the case of BIFI.

run-round0.sh

export PYTHONPATH=.

#Train initial fixer on synthetic paired data
python src/c001__train_fixer.py --round_name round0 --gpu_id 0 --max_epoch 2

#Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic
python src/c003__run_fixer.py   --round_name round0 --gpu_ids '0,1,2,3,4'

#Evaluate the fixer outputs on the test set (chunk 3,4)
python src/c005__eval_fixer.py  --round_name round0

run-BIFI.sh (round 1)

#Use the fixer outputs on the bad code (chunk 0,1,2) to get new paired data (Equation 6 in the paper)
python src/c006__generate_paired_data_from_fixer.py --round_name round0 --out_round_name round1-BIFI-part1

#Train breaker on the new paired data (Equation 7 in the paper)
python src/c002__train_breaker.py --round_name round1-BIFI-part1 --gpu_id 0 --max_epoch 3

#Run the trained breaker on the good code and get new paired data (Equation 8 in the paper)
python src/c004__run_breaker.py   --round_name round1-BIFI-part1 --gpu_ids '0,1,2,3,4'
python src/c007__generate_paired_data_from_breaker.py --round_name round1-BIFI-part1 --out_round_name round1-BIFI-part2

#Train fixer on the new paired data (Equation 9 in the paper)
python src/c001__train_fixer.py --round_name round1-BIFI-part2 --gpu_id 0 --max_epoch 2 --continue_from 'data/round0/model-fixer/checkpoint.pt'

#Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic
python src/c003__run_fixer.py   --round_name round1-BIFI-part2 --gpu_ids '0,1,2,3,4'

#Evaluate the fixer outputs on the test set (chunk 3,4)
python src/c005__eval_fixer.py  --round_name round1-BIFI-part2

This is repeated similarly for round 2.

Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
The fastest way to visualize GradCAM with your Keras models.

VizGradCAM VizGradCam is the fastest way to visualize GradCAM in Keras models. GradCAM helps with providing visual explainability of trained models an

58 Nov 19, 2022
Official codebase for Pretrained Transformers as Universal Computation Engines.

universal-computation Overview Official codebase for Pretrained Transformers as Universal Computation Engines. Contains demo notebook and scripts to r

Kevin Lu 210 Dec 28, 2022
AI pipelines for Nvidia Jetson Platform

Jetson Multicamera Pipelines Easy-to-use realtime CV/AI pipelines for Nvidia Jetson Platform. This project: Builds a typical multi-camera pipeline, i.

NVIDIA AI IOT 96 Dec 23, 2022
3D2Unet: 3D Deformable Unet for Low-Light Video Enhancement (PRCV2021)

3DDUNET This is the code for 3D2Unet: 3D Deformable Unet for Low-Light Video Enhancement (PRCV2021) Conference Paper Link Dataset We use SMOID dataset

1 Jan 07, 2022
Distributed Asynchronous Hyperparameter Optimization in Python

Hyperopt: Distributed Hyperparameter Optimization Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which

6.5k Jan 01, 2023
Implementation of SwinTransformerV2 in TensorFlow.

SwinTransformerV2-TensorFlow A TensorFlow implementation of SwinTransformerV2 by Microsoft Research Asia, based on their official implementation of Sw

Phan Nguyen 2 May 30, 2022
Sound Event Detection with FilterAugment

Sound Event Detection with FilterAugment Official implementation of Heavily Augmented Sound Event Detection utilizing Weak Predictions (DCASE2021 Chal

43 Aug 28, 2022
[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University) 842 Jan 04, 2023
Unofficial Pytorch Lightning implementation of Contrastive Syn-to-Real Generalization (ICLR, 2021)

Unofficial Pytorch Lightning implementation of Contrastive Syn-to-Real Generalization (ICLR, 2021)

Gyeongjae Choi 17 Sep 23, 2021
PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation

PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation Winner method of the ICCV-2021 SemKITTI-DVPS Challenge. [arxiv] [

Yuan Haobo 38 Jan 03, 2023
Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).

Single-view robot pose and joint angle estimation via render & compare Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic CVPR: Conference on C

Yann Labbé 51 Oct 14, 2022
Official pytorch implementation of DeformSyncNet: Deformation Transfer via Synchronized Shape Deformation Spaces

DeformSyncNet: Deformation Transfer via Synchronized Shape Deformation Spaces Minhyuk Sung*, Zhenyu Jiang*, Panos Achlioptas, Niloy J. Mitra, Leonidas

Zhenyu Jiang 21 Aug 30, 2022
NumQMBasic - A mini-course offered to Undergrad physics students

The best way to use this material is by forking it by click the Fork button at the top, right corner. Then you will get your own copy to play with! Th

Raghu 35 Dec 05, 2022
Breast Cancer Classification Model is applied on a different dataset

Breast Cancer Classification Model is applied on a different dataset

1 Feb 04, 2022
Predicting path with preference based on user demonstration using Maximum Entropy Deep Inverse Reinforcement Learning in a continuous environment

Preference-Planning-Deep-IRL Introduction Check my portfolio post Dependencies Gym stable-baselines3 PyTorch Usage Take Demonstration python3 record.

Tianyu Li 9 Oct 26, 2022
Implementation of OmniNet, Omnidirectional Representations from Transformers, in Pytorch

Omninet - Pytorch Implementation of OmniNet, Omnidirectional Representations from Transformers, in Pytorch. The authors propose that we should be atte

Phil Wang 48 Nov 21, 2022
Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer"

TSOD Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer" Usage For training, open train_test, run p

Jinming Su 2 Dec 23, 2021
A curated list of awesome Model-Based RL resources

Awesome Model-Based Reinforcement Learning This is a collection of research papers for model-based reinforcement learning (mbrl). And the repository w

OpenDILab 427 Jan 03, 2023
Improving Transferability of Representations via Augmentation-Aware Self-Supervision

Improving Transferability of Representations via Augmentation-Aware Self-Supervision Accepted to NeurIPS 2021 TL;DR: Learning augmentation-aware infor

hankook 38 Sep 16, 2022
A Python package to create, run, and post-process MODFLOW-based models.

Version 3.3.5 — release candidate Introduction FloPy includes support for MODFLOW 6, MODFLOW-2005, MODFLOW-NWT, MODFLOW-USG, and MODFLOW-2000. Other s

388 Nov 29, 2022