CodeContests is a competitive programming dataset for machine-learning

Last update: Jan 08, 2023

Related tags

Overview

CodeContests

CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode.

It consists of programming problems, from a variety of sources:

Site	URL	Source
Aizu	https://judge.u-aizu.ac.jp	CodeNet
AtCoder	https://atcoder.jp	CodeNet
CodeChef	https://www.codechef.com	description2code
Codeforces	https://codeforces.com	description2code and Codeforces
HackerEarth	https://www.hackerearth.com	description2code

Problems include test cases in the form of paired inputs and outputs, as well as both correct and incorrect human solutions in a variety of languages.

Usage

Install the Cloud SDK, which provides the gsutil utility. You can then download the full data (~3GiB) with, e.g:

gsutil -m cp -r gs://dm-code_contests /tmp

The data consists of ContestProblem protocol buffers in Riegeli format. See contest_problem.proto for the protocol buffer definition and documentation of its fields.

The dataset contains three splits:

Split	Filename
Training	`code_contests_train.riegeli-*-of-00128`
Validation	`code_contests_valid.riegeli`
Test	`code_contests_test.riegeli`

There is example code for iterating over the dataset in C++ (in print_names.cc) and Python (in print_names_and_sources.py). For example, you can print the source and name of each problem in the validation data by installing bazel and then running:

bazel run -c opt \
  :print_names_and_sources /tmp/dm-code_contests/code_contests_valid.riegeli

Or do the same for the training data with the following command (which will print around 13000 lines of output):

bazel run -c opt \
  :print_names_and_sources /tmp/dm-code_contests/code_contests_train.riegeli*

Planned updates

We plan to update this repository with code for executing and evaluating potential solutions.

Citing this work

If you use this dataset or code, please cite this paper:

@misc{alphacode,
    title={Competition-Level Code Generation with AlphaCode},
    author={Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and
    Schrittwieser, Julian and Leblond, Rémi and Eccles, Tom and
    Keeling, James and Gimeno, Felix and Dal Lago, Agustin and
    Hubert, Thomas and Choy, Peter and de Masson d'Autume, Cyprien and
    Babuschkin, Igor and Chen, Xinyun and Huang, Po-Sen and Welbl, Johannes and
    Gowal, Sven and Cherepanov, Alexey and Molloy, James and
    Mankowitz, Daniel and Sutherland Robson, Esme and Kohli, Pushmeet and
    de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol},
    year={2022},
    month={Feb}}

License

The code is licensed under the Apache 2.0 License.

All non-code materials provided are made available under the terms of the CC BY 4.0 license (Creative Commons Attribution 4.0 International license).

We gratefully acknowledge the contributions of the following:

Codeforces materials are sourced from http://codeforces.com.
Description2Code materials are sourced from: Description2Code Dataset, licensed under the MIT open source license, copyright not specified.
CodeNet materials are sourced from: Project_CodeNet, licensed under Apache 2.0, copyright not specified.

Use of the third-party software, libraries code or data may be governed by separate terms and conditions or license provisions. Your use of the third-party software, libraries or code may be subject to any such terms. We make no representations here with respect to rights or abilities to use any such materials.

Disclaimer

This is not an official Google product.

CodeContests is a competitive programming dataset for machine-learning

Related tags

Overview

CodeContests

Usage

Planned updates

Citing this work

License

Disclaimer

Owner

DeepMind

Source code for the BMVC-2021 paper "SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation".

A solution to ensure Crowd Management with Contactless and Safe systems.

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Official Pytorch Implementation of Length-Adaptive Transformer (ACL 2021)

Random Erasing Data Augmentation. Experiments on CIFAR10, CIFAR100 and Fashion-MNIST

PyTorch-based framework for Deep Hedging

Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon.

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

FastReID is a research platform that implements state-of-the-art re-identification algorithms.

'Aligned mixture of latent dynamical systems' (amLDS) for stimulus decoding probabilistic manifold alignment across animals. P. Herrero-Vidal et al. NeurIPS 2021 code.

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

HairCLIP: Design Your Hair by Text and Reference Image

KE-Dialogue: Injecting knowledge graph into a fully end-to-end dialogue system.

Perception-aware multi-sensor fusion for 3D LiDAR semantic segmentation (ICCV 2021)

State-of-the-art data augmentation search algorithms in PyTorch

Lightweight tool to perform MITM attack on local network

MicRank is a Learning to Rank neural channel selection framework where a DNN is trained to rank microphone channels.

PyTorch implementation of PP-LCNet

CodeContests is a competitive programming dataset for machine-learning

Related tags

Overview

CodeContests

Usage

Planned updates

Citing this work

License

Disclaimer

Owner

DeepMind

Source code for the BMVC-2021 paper "SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation".

A solution to ensure Crowd Management with Contactless and Safe systems.

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Official Pytorch Implementation of Length-Adaptive Transformer (ACL 2021)

Random Erasing Data Augmentation. Experiments on CIFAR10, CIFAR100 and Fashion-MNIST

PyTorch-based framework for Deep Hedging

Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park*, Rares Ambrus*, Vitor Guizilini, Jie Li, and Adrien Gaidon.

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

FastReID is a research platform that implements state-of-the-art re-identification algorithms.

'Aligned mixture of latent dynamical systems' (amLDS) for stimulus decoding probabilistic manifold alignment across animals. P. Herrero-Vidal et al. NeurIPS 2021 code.

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

HairCLIP: Design Your Hair by Text and Reference Image

KE-Dialogue: Injecting knowledge graph into a fully end-to-end dialogue system.

Perception-aware multi-sensor fusion for 3D LiDAR semantic segmentation (ICCV 2021)

State-of-the-art data augmentation search algorithms in PyTorch

Lightweight tool to perform MITM attack on local network

MicRank is a Learning to Rank neural channel selection framework where a DNN is trained to rank microphone channels.

PyTorch implementation of PP-LCNet

Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon.