Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

Overview

Selection via Proxy: Efficient Data Selection for Deep Learning

This repository contains a refactored implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

If you use this code in your research, please use the following BibTeX entry.

@inproceedings{
    coleman2020selection,
    title={Selection via Proxy: Efficient Data Selection for Deep Learning},
    author={Cody Coleman and Christopher Yeh and Stephen Mussmann and Baharan Mirzasoleiman and Peter Bailis and Percy Liang and Jure Leskovec and Matei Zaharia},
    booktitle={International Conference on Learning Representations},
    year={2020},
    url={https://openreview.net/forum?id=HJg2b0VYDr}
}

The original code is also available as a zip file, but lacks documentation, uses outdated packages, and won't be maintained. Please use this repository instead and report issues here.

Setup

Prerequisites

Installation

git clone https://github.com/stanford-futuredata/selection-via-proxy.git
cd selection-via-proxy
pip install -e .

or simply

pip install git+https://github.com/stanford-futuredata/selection-via-proxy.git

Quickstart

Perform active learning on CIFAR10 from the command line:

python -m svp.cifar active

Or from the python interpreter:

from svp.cifar.active import active
active()

"Selection via proxy" happens when --proxy-arch doesn't match --arch:

# ResNet20 selecting data for a ResNet164
python -m svp.cifar active --proxy-arch preact20 --arch preact164

For help, see python -m svp.cifar active --help or active()'s docstrinng.

Example Usage

Below are more examples of the command line interface that cover different datasets (e.g., CIFAR100, ImageNet, Amazon Review Polarity) and commands (e.g., train, coreset).

Basic Training

CIFAR10 and CIFAR100

Preliminaries

None. The CIFAR10 and CIFAR100 datasets will download if they don't exist in ./data/cifar10 and ./data/cifar100 respectively.

Examples
# Train ResNet164 with pre-activation (https://arxiv.org/abs/1603.05027) on CIFAR10.
python -m svp.cifar train --dataset cifar10 --arch preact164

Replace --dataset CIFAR10 with --dataset CIFAR100 to run on CIFAR100 rather than CIFAR10.

# Train ResNet164 with pre-activation (https://arxiv.org/abs/1603.05027) on CIFAR100.
python -m svp.cifar train --dataset cifar100 --arch preact164

The same is true for all the python -m svp.cifar commands below

ImageNet

Preliminaries
  • Download the ImageNet dataset into a directory called imagenet.
  • Extract the images.
# Extract train data.
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
# Extract validation data.
cd ../ && mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
  • Replace /path/to/data in all the python -m svp.imagenet commands below with the path to the imagenet directory you created. Note, do not include imagenet in the path; the script will automatically do that.
Examples
# Train ResNet50 (https://arxiv.org/abs/1512.03385).
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20

For convenience, you can use larger batch sizes and scale learning rates according to "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" with --scale-learning-rates:

# Train ResNet50 with a batch size of 1048 and scaled learning rates accordingly.
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --batch-size 1048 --scale-learning-rates

Mixed precision training is also supported using apex. Apex isn't installed during the pip install instructions above, so please follow the installation instructions in the apex repository before running the command below.

# Use mixed precision training to train ResNet50 with a batch size of 1048 and scale learning rates accordingly.
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --batch-size 1048 --scale-learning-rates --fp16

Amazon Review Polarity and Full

Preliminaries
tar -xvzf amazon_review_full_csv.tar.gz
tar -xvzf amazon_review_polarity_csv.tar.gz
  • Replace /path/to/data in all the python -m svp.amazon commands below with the path to the root directory you created. Note, do not include amazon_review_full_csv or amazon_review_polarity_csv in the path; the script will automatically do that.
Examples
# Train VDCNN29 (https://arxiv.org/abs/1606.01781) on Amazon Review Polarity.
python -m svp.amazon train --datasets-dir '/path/to/data' --dataset amazon_review_polarity --arch vdcnn29-conv \
    --num-workers 4 --eval-num-workers 8

Replace --dataset amazon_review_polarity with --dataset amazon_review_full to run on Amazon Review Full rather than Amazon Review Polarity.

# Train VDCNN29 (https://arxiv.org/abs/1606.01781) on Amazon Review Full.
python -m svp.amazon train --datasets-dir '/path/to/data' --dataset amazon_review_full --arch vdcnn29-maxpool \
    --num-workers 4 --eval-num-workers 8

The same is true for all the python -m svp.amazon commands below

Active learning

Active learning selects points to label from a large pool of unlabeled data by repeatedly training a model on a small pool of labeled data and selecting additional examples to label based on the model’s uncertainty (e.g., the entropy of predicted class probabilities) or other heuristics. The commands below demonstrate how to perform active learning on CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity and Amazon Review Full with a variety of models and selection methods.

CIFAR10 and CIFAR100

Baseline Approach
# Perform active learning with ResNet164 for both selection and the final predictions.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model.

# Perform active learning with ResNet20 for selection and ResNet164 for the final predictions.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence --proxy-arch preact20 \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--eval-target-at 25000

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform active learning with ResNet20 after only 50 epochs for selection.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence --proxy-arch preact20 \
	--proxy-learning-rate 0.01 --proxy-epochs 1 \
	--proxy-learning-rate 0.1 --proxy-epochs 45 \
	--proxy-learning-rate 0.01 --proxy-epochs 4 \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--eval-target-at 25000

ImageNet

Baseline Approach
# Perform active learning with ResNet50 for both selection and the final predictions.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model.

# Perform active learning with ResNet18 for selection and ResNet50 for the final predictions.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates \
    --eval-target-at 512467

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform active learning with ResNet18 after only 45 epochs for selection.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates \
    --eval-target-at 512467 \
    --proxy-learning-rate 0.0167 --proxy-epochs 1 \
    --proxy-learning-rate 0.0333 --proxy-epochs 1 \
    --proxy-learning-rate 0.05 --proxy-epochs 1 \
    --proxy-learning-rate 0.0667 --proxy-epochs 1 \
    --proxy-learning-rate 0.0833 --proxy-epochs 1 \
    --proxy-learning-rate 0.1 --proxy-epochs 25 \
    --proxy-learning-rate 0.01 --proxy-epochs 15

Amazon Review Polarity and Full

Baseline Approach
# Perform active learning with VDCNN29 for both selection and the final predictions.
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity  --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model. You can evaluate a series of selections later using the precomputed_selection option.

# Perform active learning with VDCNN9 for selection and VDCNN29 for the final predictions.
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence \
    --proxy-arch vdcnn9-maxpool --eval-target-at 1440000

To use fastText as a proxy, Install fastText 0.1.0 and replace /path/to/fastText/fasttext in the python -m svp.amazon fasttext commands below with the path to the fastText binary you created.

# For convenience, save fastText results in a separate directory
mkdir fasttext
# Perform active learning with fastText.
python -m svp.amazon fasttext '/path/to/fastText/fasttext' --run-dir fasttext \
    --datasets-dir '/path/to/data' --dataset amazon_review_polarity --selection-method least_confidence \
    --size 72000 --size 360000 --size 720000 --size 1080000 --size 1440000
# Get the most recent timestamp from the fasttext directory.
fasttext_path="fasttext/$(ls fasttext | sort -nr | head -n 1)"
# Use selected labeled data from fastText to train VDCNN29
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence \
    --precomputed-selection $fasttext_path --eval-target-at 1440000

Core-set Selection

Core-set selection techniques start with a large labeled or unlabeled dataset and aim to find a small subset that accurately approximates the full dataset by selecting representative examples. The commands below demonstrate how to perform core-set selection on CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity and Amazon Review Full with a variety of models and selection methods.

CIFAR10 and CIFAR100

Baseline Approach
# Perform core-set selection with an oracle that uses ResNet164 for both selection and the final predictions.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events
Selection via Proxy
# Perform core-set selection with ResNet20 selecting for ResNet164.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events \
    --proxy-arch preact20

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform core-set selection with ResNet20 after only 50 epochs.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events \
    --proxy-arch preact20 \
	--proxy-learning-rate 0.01 --proxy-epochs 1 \
	--proxy-learning-rate 0.1 --proxy-epochs 45 \
	--proxy-learning-rate 0.01 --proxy-epochs 4

ImageNet

Baseline Approach
# Perform core-set selection with an oracle that uses ResNet50 for both selection and the final predictions.
python -m svp.imagenet coreset --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --subset 768700 --selection-method forgetting_events
Selection via Proxy
# Perform core-set selection with ResNet18 selecting for ResNet50.
python -m svp.imagenet coreset --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --subset 768700 --selection-method forgetting_events \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates

Amazon Review Polarity and Full

Baseline Approach
# Perform core-set selection with an oracle that uses VDCNN29 for both selection and the final predictions.
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --subset 2160000  --selection-method entropy
Selection via Proxy
# Perform core-set selection with VDCNN9 selecting for VDCNN29.
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --subset 2160000 --selection-method entropy \
    --proxy-arch vdcnn9-maxpool

To use fastText as a proxy, Install fastText 0.1.0 and replace /path/to/fastText/fasttext in the python -m svp.amazon fasttext commands below with the path to the fastText binary you created.

# For convenience, save fastText results in a separate directory
mkdir fasttext
# Perform core-set selection with fastText.
python -m svp.amazon fasttext '/path/to/fastText/fasttext' --run-dir fasttext \
    --datasets-dir '/path/to/data' --dataset amazon_review_polarity \
    --selection-method entropy --size 3600000 --size 2160000
# Get the most recent timestamp from the fasttext directory.
fasttext_path="fasttext/$(ls fasttext | sort -nr | head -n 1)"
# Use selected labeled data from fastText to train VDCNN29
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --precomputed-selection $fasttext_path
Owner
Stanford Future Data Systems
We are a CS research group at Stanford building data-intensive systems
Stanford Future Data Systems
official implemntation for "Contrastive Learning with Stronger Augmentations"

CLSA CLSA is a self-supervised learning methods which focused on the pattern learning from strong augmentations. Copyright (C) 2020 Xiao Wang, Guo-Jun

Lab for MAchine Perception and LEarning (MAPLE) 47 Nov 29, 2022
Powerful and efficient Computer Vision Annotation Tool (CVAT)

Computer Vision Annotation Tool (CVAT) CVAT is free, online, interactive video and image annotation tool for computer vision. It is being used by our

OpenVINO Toolkit 8.6k Jan 01, 2023
A Python library for Deep Graph Networks

PyDGN Wiki Description This is a Python library to easily experiment with Deep Graph Networks (DGNs). It provides automatic management of data splitti

Federico Errica 194 Dec 22, 2022
catch-22: CAnonical Time-series CHaracteristics

catch22 - CAnonical Time-series CHaracteristics About catch22 is a collection of 22 time-series features coded in C that can be run from Python, R, Ma

Carl H Lubba 229 Oct 21, 2022
This is an example implementation of the paper "Cross Domain Robot Imitation with Invariant Representation".

IR-GAIL This is an example implementation of the paper "Cross Domain Robot Imitation with Invariant Representation". Dependency The experiments are de

Zhao-Heng Yin 1 Jul 14, 2022
A simple code to perform canny edge contrast detection on images.

CECED-Canny-Edge-Contrast-Enhanced-Detection A simple code to perform canny edge contrast detection on images. A simple code to process images using c

Happy N. Monday 3 Feb 15, 2022
A PyTorch implementation of the Transformer model in "Attention is All You Need".

Attention is all you need: A Pytorch Implementation This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish V

Yu-Hsiang Huang 7.1k Jan 04, 2023
Nvidia Semantic Segmentation monorepo

Paper | YouTube | Cityscapes Score Pytorch implementation of our paper Hierarchical Multi-Scale Attention for Semantic Segmentation. Please refer to t

NVIDIA Corporation 1.6k Jan 04, 2023
Parametric Contrastive Learning (ICCV2021)

Parametric-Contrastive-Learning This repository contains the implementation code for ICCV2021 paper: Parametric Contrastive Learning (https://arxiv.or

DV Lab 156 Dec 21, 2022
Invasive Plant Species Identification

Invasive_Plant_Species_Identification Used LiDAR Odometry and Mapping (LOAM) to create a 3D point cloud map which can be used to identify invasive pla

2 May 12, 2022
A Python library created to assist programmers with complex mathematical functions

libmaths libmaths was created not only as a learning experience for me, but as a way to make mathematical models in seconds for Python users using mat

Simple 73 Oct 02, 2022
Uses Open AI Gym environment to create autonomous cryptocurrency bot to trade cryptocurrencies.

Crypto_Bot Uses Open AI Gym environment to create autonomous cryptocurrency bot to trade cryptocurrencies. Steps to get started using the bot: Sign up

21 Oct 03, 2022
ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation

ClevrTex This repository contains dataset generation code for ClevrTex benchmark from paper: ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi

Laurynas Karazija 26 Dec 21, 2022
Anomaly detection related books, papers, videos, and toolboxes

Anomaly Detection Learning Resources Outlier Detection (also known as Anomaly Detection) is an exciting yet challenging field, which aims to identify

Yue Zhao 6.7k Dec 31, 2022
NeuralCompression is a Python repository dedicated to research of neural networks that compress data

NeuralCompression is a Python repository dedicated to research of neural networks that compress data. The repository includes tools such as JAX-based entropy coders, image compression models, video c

Facebook Research 297 Jan 06, 2023
A Pytorch Implementation for Compact Bilinear Pooling.

CompactBilinearPooling-Pytorch A Pytorch Implementation for Compact Bilinear Pooling. Adapted from tensorflow_compact_bilinear_pooling Prerequisites I

169 Dec 23, 2022
[ICCV'21] UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction

UNISURF: Unifying Neural Implicit Surfaces and Radiance Fields for Multi-View Reconstruction Project Page | Paper | Supplementary | Video This reposit

331 Dec 28, 2022
Official code for the publication "HyFactor: Hydrogen-count labelled graph-based defactorization Autoencoder".

HyFactor Graph-based architectures are becoming increasingly popular as a tool for structure generation. Here, we introduce a novel open-source archit

Laboratoire-de-Chemoinformatique 11 Oct 10, 2022
Neighbor2Seq: Deep Learning on Massive Graphs by Transforming Neighbors to Sequences

Neighbor2Seq: Deep Learning on Massive Graphs by Transforming Neighbors to Sequences This repository is an official PyTorch implementation of Neighbor

DIVE Lab, Texas A&M University 8 Jun 12, 2022
Python version of the amazing Reaction Mechanism Generator (RMG).

Reaction Mechanism Generator (RMG) Description This repository contains the Python version of Reaction Mechanism Generator (RMG), a tool for automatic

Reaction Mechanism Generator 284 Dec 27, 2022