VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Last update: Oct 24, 2022

Overview

VarCLR: Variable Representation Pre-training via Contrastive Learning

New: Paper accepted by ICSE 2022. Preprint at arXiv!

This repository contains code and pre-trained models for VarCLR, a contrastive learning based approach for learning semantic representations of variable names that effectively captures variable similarity, with state-of-the-art results on [email protected].

VarCLR: Variable Representation Pre-training via Contrastive Learning

Step 0: Install

pip install -e .

Step 1: Load a Pre-trained VarCLR Model

from varclr.models import Encoder
model = Encoder.from_pretrained("varclr-codebert")

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

emb = model.encode("squareslab")
print(emb.shape)
# torch.Size([1, 768])

Get embeddings of list of variables (supports batching)

emb = model.encode(["squareslab", "strudel"])
print(emb.shape)
# torch.Size([2, 768])

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

print(model.score("squareslab", "strudel"))
# [0.42812108993530273]
print(model.score(["squareslab", "average", "max", "max"], ["strudel", "mean", "min", "maximum"]))
# [0.42812108993530273, 0.8849745988845825, 0.8035818338394165, 0.889922022819519]

Get pairwise (N * M) similarity scores from two lists of variables

variable_list = ["squareslab", "strudel", "neulab"]
print(model.cross_score("squareslab", variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832]]
print(model.cross_score(variable_list, variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832],
#  [0.4281214475631714, 1.0000004768371582, 0.549992561340332],
#  [0.7207341194152832, 0.549992561340332, 1.000000238418579]]

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

from varclr.benchmarks import Benchmark

# Similarity on IdBench-Medium
b1 = Benchmark.build("idbench", variant="medium", metric="similarity")
# Relatedness on IdBench-Large
b2 = Benchmark.build("idbench", variant="large", metric="relatedness")

Compute VarCLR scores and evaluate

id1_list, id2_list = b1.get_inputs()
predicted = model.score(id1_list, id2_list)
print(b1.evaluate(predicted))
# {'spearmanr': 0.5248567181503295, 'pearsonr': 0.5249843473193132}

print(b2.evaluate(model.score(*b2.get_inputs())))
# {'spearmanr': 0.8012168379981921, 'pearsonr': 0.8021791703187449}

Let's compare with the original CodeBERT

codebert = Encoder.from_pretrained("codebert")
print(b1.evaluate(codebert.score(*b1.get_inputs())))
# {'spearmanr': 0.2056582946575104, 'pearsonr': 0.1995058696927054}
print(b2.evaluate(codebert.score(*b2.get_inputs())))
# {'spearmanr': 0.3909218857993804, 'pearsonr': 0.3378219622284688}

Results on IdBench benchmarks

Similarity

Method	Small	Medium	Large
FT-SG	0.30	0.29	0.28
LV	0.32	0.30	0.30
FT-cbow	0.35	0.38	0.38
VarCLR-Avg	0.47	0.45	0.44
VarCLR-LSTM	0.50	0.49	0.49
VarCLR-CodeBERT	0.53	0.53	0.51

Combined-IdBench	0.48	0.59	0.57
Combined-VarCLR	0.66	0.65	0.62

Relatedness

Method	Small	Medium	Large
LV	0.48	0.47	0.48
FT-SG	0.70	0.71	0.68
FT-cbow	0.72	0.74	0.73
VarCLR-Avg	0.67	0.66	0.66
VarCLR-LSTM	0.71	0.70	0.69
VarCLR-CodeBERT	0.79	0.79	0.80

Combined-IdBench	0.71	0.78	0.79
Combined-VarCLR	0.79	0.81	0.85

Pre-train your own VarCLR models

Coming soon.

Cite

If you find VarCLR useful in your research, please cite our [email protected]:

@misc{chen2021varclr,
      title={VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning},
      author={Qibin Chen and Jeremy Lacomis and Edward J. Schwartz and Graham Neubig and Bogdan Vasilescu and Claire Le Goues},
      year={2021},
      eprint={2112.02650},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Related tags

Overview

VarCLR: Variable Representation Pre-training via Contrastive Learning

Step 0: Install

Step 1: Load a Pre-trained VarCLR Model

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

Get embeddings of list of variables (supports batching)

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

Get pairwise (N * M) similarity scores from two lists of variables

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

Compute VarCLR scores and evaluate

Let's compare with the original CodeBERT

Results on IdBench benchmarks

Similarity

Relatedness

Pre-train your own VarCLR models

Cite

Owner

squaresLab

3D2Unet: 3D Deformable Unet for Low-Light Video Enhancement (PRCV2021)

Reinfore learning tool box, contains trpo, a3c algorithm for continous action space

Official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR)

An implementation of the 1. Parallel, 2. Streaming, 3. Randomized SVD using MPI4Py

The code uses SegFormer for Semantic Segmentation on Drone Dataset.

Machine learning algorithms for many-body quantum systems

Resilience from Diversity: Population-based approach to harden models against adversarial attacks

Neural Oblivious Decision Ensembles

(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

Simple image captioning model - CLIP prefix captioning.

MPViT:Multi-Path Vision Transformer for Dense Prediction

Self-Regulated Learning for Egocentric Video Activity Anticipation

The pytorch implementation of SOKD (BMVC2021).

This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

Code for our CVPR 2022 Paper "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection"

Augmentation for Single-Image-Super-Resolution

It's like Shape Editor in Maya but works with skeletons (transforms).

Repository of the paper Compressing Sensor Data for Remote Assistance of Autonomous Vehicles using Deep Generative Models at ML4AD @ NeurIPS 2021.

Python library for analysis of time series data including dimensionality reduction, clustering, and Markov model estimation

Detail-Preserving Transformer for Light Field Image Super-Resolution