Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Last update: Nov 15, 2022

Overview

Transformers for variable misuse, function naming and code completion tasks

The official PyTorch implementation of:

Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)

The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).

Repository structure

data_utils: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)
vm_fn: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)
cc: code for Code Completion (CC) task (additional preprocessing, models, training etc)

See README in each directory for details.

Run

The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt in VM_FN and CC directories. The implementation is based on PyTorch>=1.5.

Running experiments:

Download and resplit data, see data_utils for details;
Preprocess data for a task you are interested in (VM, FN or CC), see vm_fn or cc for details;
Run the experiment you are interested in, see vm_fn or cc for details.

Attribution

Parts of this code are based on the following repositories:

Citation

If you found this code useful, please cite our papers

@misc{chirkova2020empirical,
      title={Empirical Study of Transformers for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      year={2020},
      eprint={2010.07987},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@inproceedings{chirkova2020simple,
      title={A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      booktitle={North American Chapter of the Association for Computational Linguistics}
      year={2021}, 
}

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Related tags

Overview

Transformers for variable misuse, function naming and code completion tasks

Repository structure

Run

Attribution

Citation

Owner

Bayesian Methods Research Group

Accommodating supervised learning algorithms for the historical prices of the world's favorite cryptocurrency and boosting it through LightGBM.

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

Use graph-based analysis to re-classify stocks and to improve Markowitz portfolio optimization

This project aims to explore the deployment of Swin-Transformer based on TensorRT, including the test results of FP16 and INT8.

The codes and related files to reproduce the results for Image Similarity Challenge Track 2.

MTCNN face detection implementation for TensorFlow, as a PIP package.

NPBG++: Accelerating Neural Point-Based Graphics

Makes patches from huge resolution .svs slide files using openslide

Pytorch implementation of MaskFlownet

NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

Pytorch Implementation of "Desigining Network Design Spaces", Radosavovic et al. CVPR 2020.

Pytorch implementation of SenFormer: Efficient Self-Ensemble Framework for Semantic Segmentation

Generative Exploration and Exploitation - This is an improved version of GENE.

Streamlit App For Product Analysis - Streamlit App For Product Analysis

HODEmu, is both an executable and a python library that is based on Ragagnin 2021 in prep.

Learning kernels to maximize the power of MMD tests

for a paper about leveraging discourse markers for training new models

Bayesian Optimization Library for Medical Image Segmentation.

Ascend your Jupyter Notebook usage

[ICSE2020] MemLock: Memory Usage Guided Fuzzing