[ICML 2020] DrRepair: Learning to Repair Programs from Error Messages

Overview

DrRepair: Learning to Repair Programs from Error Messages

This repo provides the source code & data of our paper: Graph-based, Self-Supervised Program Repair from Diagnostic Feedback (ICML 2020).

@InProceedings{Yasunaga20DrRepair,
  author =  {Michihiro Yasunaga and Percy Liang},
  title =   {Graph-based, Self-Supervised Program Repair from Diagnostic Feedback},
  year =    {2020},  
  booktitle =   {International Conference on Machine Learning (ICML)},  
}

Dependencies

  • GCC: Follow the SPoC requirement (https://github.com/Sumith1896/spoc)
  • Python 3.6.8 (e.g. conda create -n DrRepair python=3.6.8)
  • Python libraries
    • torch==1.0.1, numpy, tqdm, regex, joblib, pyyaml, bottle, cheroot, tensorboardX
    • clang==8.0.1 (do the following)
      conda config --add channels conda-forge
      conda install python-clang==8.0.1
      

Data

Download all the raw data -- DeepFix, SPoC, codeforce (for pretraining) -- by

./download_raw_data.sh

You can preprocess the raw data to get the program repair data by running the commands in

data/1.run-gen-err-dataset--orig-spoc.sh
data/2.run-gen-err-dataset--auto-corrupt--spoc.sh
data/3.run-gen-err-dataset--auto-corrupt--deepfix.sh

However, this takes a significant time, so for your convenience, you can download all the preprocessed data by

./download_preprocessed_data.sh

The repo structure looks like the following:

.
└─ raw_data/
   ├── codeforce_data/                  (raw programs from codeforce)
   ├── deepfix_data/                    (raw programs from deepfix)
   └── spoc_data/
       ├── spoc                              (SPoC data release)
       └── translation_preds                 (line-level code predictions from Kulal+19)

└─ data/                             
   ├── *.sh, *.py                       (preprocessing scripts)
   ├── err-data-compiler--orig-spoc/    (preprocessed, program repair data for spoc)
   ├── err-dev-compiler--for-SPoC/      (└─ dev data for spoc)
   ├── err-vocab-compiler--for-SPoC/    (└─ vocab for spoc)
   ...
   ... [similarly for deepfix and pre-training]

└─ utils/                      (utilities for code processing)

└─ model/                      (DrRepair model)

└─ evaluation/                 (to evaluate Repair model on deepfix/spoc test)
   ├── deepfix
   └── spoc
       ├── translation_preds_test/           (line-level code predictions from Kulal+19 for TestP/TestW)
       ...

Train models

Let's train program repair models. First, go to model directory. Then, run commands listed in run_deepfix.sh or run_spoc.sh. For example, if we train DrRepair ("base + graph" in the paper) on the DeepFix data, run:

name="code-compiler--2l-graph"
mkdir -p out_deepfix/${name}
python3 -u main_deepfix.py -o ${name} train \
    configs/base.yml  configs/data-deepfix/err-data-orig.yml \
    configs/model-code-compiler/2l-graph--dec-attn-all.yml

Evaluate models

We run the trained program repair model as a server. We then call this model on application tasks (DeepFix and SPoC) to evaluate the usefulness of the model.

DeepFix

1. Start server

First, go to model directory. We run a trained model (e.g. code-compiler--2l-graph) as a server by

name="SERVER--code-compiler--2l-graph"
mkdir out_deepfix/${name}
python3 -u main_deepfix.py -o ${name} server -p <port> \
    -l out_deepfix/code-compiler--2l-graph/<checkpoint> \
    configs/base.yml  configs/data-deepfix/err-data-orig.yml \
    configs/model-code-compiler/2l-graph--dec-attn-all.yml

For <port>, pick a port number (e.g. 8080) for the server. For <checkpoint>, pick a checkpoint (e.g. 150000) of the trained model. Then run ifconfig to get the IP address (e.g. 172.24.67.161) of the machine hosting this model. Concrete examples are provided in the second half of model/run_deepfix.sh.

2. Run model on DeepFix test

Go to evaluation/deepfix directory. First prepare:

repo_root="../../../.."
program_data_root=${repo_root}"/raw_data/deepfix_data"
test_split_root=${repo_root}"/data/err-data-compiler--auto-corrupt--orig-deepfix/bin4"

To run the trained model on the DeepFix test examples, do

name="code-compiler--2l-graph"
mkdir -p out/${name}/log
cd out/${name}

for entry in ${test_split_root}/*
do
  probid=`basename $entry`
  python3 -u ../../test_deepfix.py \
  --input-code-dir ${program_data_root}/${probid}/erroneous \
  --repairer-server  http://<IP>:<port>/pred
done

where you plug the IP address and port number into <IP> and <port>. After this completes, you can get the test accuracy by

python3 -u ../../collate_deepfix.py

Concrete examples are provided in evaluation/run_test_deepfix.sh.

SPoC

1. Start server

First, go to model directory. We run a trained model (e.g. code-compiler--2l-graph--finetune) as a server by

name="SERVER--code-compiler--2l-graph--finetune"
mkdir out_spoc/${name}
python3 -u main_spoc.py -o ${name} server -p <port> \
    -l out_spoc/code-compiler--2l-graph--finetune/<checkpoint> \
    configs/base.yml  configs/data-spoc/err-data-orig.yml \
    configs/model-code-compiler/2l-graph--dec-attn-all.yml

Similar to DeepFix, pick a port number and a checkpoint, and get the IP address. Concrete examples are provided in the second half of model/run_spoc.sh.

2. Run model on SPoC test

Go to evaluation/spoc directory. First prepare:

repo_root="../../../.."

To run the trained model on all the programs in SPoC TestW, do

name="code-compiler--2l-graph--finetune"

INPUT=translation_preds_test/testw    #change to testp if you want to evaluate on testp
N=$(tail -n+2 ${INPUT}.tsv | cut -f 3-6 | uniq | wc -l)  # Count the number of programs
interval=10

mkdir -p out_testw/${name}/log        #change to testp if you want to evaluate on testp
cd out_testw/${name}                  #change to testp if you want to evaluate on testp

i=1
while [[ $i -le $N ]]; do
  python -u ../../test_spoc.py -p 100 \
  --compile-budget 100 --n-parallel ${interval} \
  --repairer-server  http://<IP>:<port>/pred \
  ../../${INPUT} $i
  i=$(($i + ${interval}))
done

where you plug the IP address and port number into <IP> and <port>. After this completes, you can get the test accuracy by

python3 -u ../../collate_spoc.py

Concrete examples are provided in evaluation/run_test_spoc.sh.

Acknowledgment

The original DeepFix and SPoC data used in this work come from the following papers:

DeepFix: Fixing common C language errors by deep learning. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. AAAI 2017.
SPoC: Search-based Pseudocode to Code. Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken and Percy Liang. NeurIPS 2019.
Owner
Michihiro Yasunaga
PhD Student in Computer Science
Michihiro Yasunaga
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

Appen Repos 86 Dec 07, 2022
People log into different sites every day to get information and browse through these sites one by one

HyperLink People log into different sites every day to get information and browse through these sites one by one. And they are exposed to advertisemen

0 Feb 17, 2022
Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Deepak Nandwani 1 Dec 31, 2021
Amazing-Python-Scripts - 🚀 Curated collection of Amazing Python scripts from Basics to Advance with automation task scripts.

📑 Introduction A curated collection of Amazing Python scripts from Basics to Advance with automation task scripts. This is your Personal space to fin

Avinash Ranjan 1.1k Dec 29, 2022
Platform-agnostic AI Framework 🔥

🇬🇧 TensorLayerX is a multi-backend AI framework, which can run on almost all operation systems and AI hardwares, and support hybrid-framework progra

TensorLayer Community 171 Jan 06, 2023
Advantage Actor Critic (A2C): jax + flax implementation

Advantage Actor Critic (A2C): jax + flax implementation Current version supports only environments with continious action spaces and was tested on muj

Andrey 3 Jan 23, 2022
Project page for our ICCV 2021 paper "The Way to my Heart is through Contrastive Learning"

The Way to my Heart is through Contrastive Learning: Remote Photoplethysmography from Unlabelled Video This is the official project page of our ICCV 2

36 Jan 06, 2023
Generalized Proximal Policy Optimization with Sample Reuse (GePPO)

Generalized Proximal Policy Optimization with Sample Reuse This repository is the official implementation of the reinforcement learning algorithm Gene

Jimmy Queeney 9 Nov 28, 2022
Scene-Text-Detection-and-Recognition (Pytorch)

Scene-Text-Detection-and-Recognition (Pytorch) Competition URL: https://tbrain.t

Gi-Luen Huang 9 Jan 02, 2023
Space robot - (Course Project) Using the space robot to capture the target satellite that is disabled and spinning, then stabilize and fix it up

Space robot - (Course Project) Using the space robot to capture the target satellite that is disabled and spinning, then stabilize and fix it up

Mingrui Yu 3 Jan 07, 2022
OCR-D wrapper for detectron2 based segmentation models

ocrd_detectron2 OCR-D wrapper for detectron2 based segmentation models Introduction Installation Usage OCR-D processor interface ocrd-detectron2-segm

Robert Sachunsky 13 Dec 06, 2022
Categorizing comments on YouTube into different categories.

Youtube Comments Categorization This repo is for categorizing comments on a youtube video into different categories. negative (grievances, complaints,

Rhitik 5 Nov 26, 2022
Learning with Subset Stacking

Learning with Subset Stacking (LESS) LESS is a new supervised learning algorithm that is based on training many local estimators on subsets of a given

S. Ilker Birbil 19 Oct 04, 2022
[ICCV 2021] Focal Frequency Loss for Image Reconstruction and Synthesis

Focal Frequency Loss - Official PyTorch Implementation This repository provides the official PyTorch implementation for the following paper: Focal Fre

Liming Jiang 460 Jan 04, 2023
PuppetGAN - Cross-Domain Feature Disentanglement and Manipulation just got way better! 🚀

Better Cross-Domain Feature Disentanglement and Manipulation with Improved PuppetGAN Quite cool... Right? Introduction This repo contains a TensorFlow

Giorgos Karantonis 5 Aug 25, 2022
Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective

Does-MAML-Only-Work-via-Feature-Re-use-A-Data-Set-Centric-Perspective Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective Installin

2 Nov 07, 2022
An Abstract Cyber Security Simulation and Markov Game for OpenAI Gym

gym-idsgame An Abstract Cyber Security Simulation and Markov Game for OpenAI Gym gym-idsgame is a reinforcement learning environment for simulating at

Kim Hammar 29 Dec 03, 2022
SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning

SPCL SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning Update on 2021/11/25: ArXiv Ver

Binhui Xie (谢斌辉) 11 Oct 29, 2022
PyMove is a Python library to simplify queries and visualization of trajectories and other spatial-temporal data

Use PyMove and go much further Information Package Status License Python Version Platforms Build Status PyPi version PyPi Downloads Conda version Cond

Insight Data Science Lab 64 Nov 15, 2022
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022