Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Last update: Oct 31, 2022

Overview

CodeFill

This repository contains the code for our paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences", DOI: 10.1145/3510003.3510172. This work is authored by Maliheh Izadi, Roberta Gismondi, and Georgios Gousios and it has been accepted for publication at #ICSE2022.

Abstract

Code completion is an essential feature of IDEs, yet current autocompleters are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context.

In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets.

Data

Our datasets are available on HuggingFace hub.

Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Related tags

Overview

CodeFill

Abstract

Data

Owner

Software Analytics Lab

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Transformers implementation for Fall 2021 Clinic

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Baseline code for Korean open domain question answering(ODQA)

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

🏖 Easy training and deployment of seq2seq models.

Use the power of GPT3 to execute any function inside your programs just by giving some doctests

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

A python package to fine-tune transformer-based models for named entity recognition (NER).

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

NLP library designed for reproducible experimentation management

Signature remover is a NLP based solution which removes email signatures from the rest of the text.

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）

[ICLR'19] Trellis Networks for Sequence Modeling

New Modeling The Background CodeBase

Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...