Data from "Datamodels: Predicting Predictions with Training Data"

Overview

Data from "Datamodels: Predicting Predictions with Training Data"

Here we provide the data used in the paper "Datamodels: Predicting Predictions with Training Data" (arXiv, Blog).

Note that all of the data below is stored on Amazon S3 using the “requester pays” option to avoid a blowup in our data transfer costs (we put estimated AWS costs below)---if you are on a budget and do not mind waiting a bit longer, please contact us at [email protected] and we can try to arrange a free (but slower) transfer.

Citation

To cite this data, please use the following BibTeX entry:

@inproceedings{ilyas2022datamodels,
  title = {Datamodels: Predicting Predictions from Training Data},
  author = {Andrew Ilyas and Sung Min Park and Logan Engstrom and Guillaume Leclerc and Aleksander Madry},
  booktitle = {ArXiv preprint arXiv:2202.00622},
  year = {2022}
}

Overview

We provide the data used in our paper to analyze two image classification datasets: CIFAR-10 and (a modified version of) FMoW.

For each dataset, the data consists of two parts:

  1. Training data for datamodeling, which consists of:
    • Training subsets or "training masks", which are the independent variables of the regression tasks; and
    • Model outputs (correct-class margins and logits), which are the dependent variables of the regression tasks.
  2. Datamodels estimated from this data using LASSO.

For each dataset, there are multiple versions of the data depending on the choice of the hyperparameter α, the subsampling fraction (this is the random fraction of training examples on which each model is trained; see Section 2 of our paper for more information).

Following table shows the number of models we trained and used for estimating datamodels (also see Table 1 in paper):

Subsampling α (%) CIFAR-10 FMoW
10 1,500,000 N/A
20 750,000 375,000
50 300,000 150,000
75 600,000 300,000

Training data

For each dataset and $\alpha$, we provide the following data:

# M is the number of models trained
/{DATASET}/data/train_masks_{PCT}pct.npy  # [M x N_train] boolean
/{DATASET}/data/test_margins_{PCT}pct.npy # [M x N_test] np.float16
/{DATASET}/data/test_margins_{PCT}pct.npy # [M x N_train] np.float16

(The files live in the Amazon S3 bucket madrylab-datamodels; we provide instructions for acces in the next section.)

Each row of the above matrices corresponds to one instance of model trained; each column corresponds to a training or test example. CIFAR-10 examples are organized in the default order; for FMoW, see here. For example, a train mask for CIFAR-10 has the shape [M x 50,000].

For CIFAR-10, we also provide the full logits for all ten classes:

/cifar/data/train_logits_{PCT}pct.npy  # [M x N_test x 10] np.float16
/cifar/data/test_logits_{PCT}pct.npy   # [M x N_test x 10] np.float16

Note that you can also compute the margins from these logits.

We include an addtional 10,000 models for each setting that we used for evaluation; the total number of models in each matrix is M as indicated in the above table plus 10,000.

Datamodels

All estimated datamodels for each split (train or test) are provided as a dictionary in a .pt file (load with torch.load):

/{DATASET}/datamodels/train_{PCT}pct.pt
/{DATASET}/datamodels/test_{PCT}pct.pt

Each dictionary contains:

  • weight: matrix of shape N_train x N, where N is either N_train or N_test depending on the group of target examples
  • bias: vector of length N, corresponding to biases for each datamodel
  • lam: vector of length N, regularization λ chosen by CV for each datamodel

Downloading

We make all of our data available via Amazon S3. Total sizes of the training data files are as follows:

Dataset, α (%) masks, margins (GB) logits (GB)
CIFAR-10, 10 245 1688
CIFAR-10, 20 123 849
CIFAR-10, 50 49 346
CIFAR-10, 75 98 682
FMoW, 20 25.4 -
FMoW, 50 10.6 -
FMoW, 75 21.2 -

Total sizes of datamodels data (the model weights) are 16.9 GB for CIFAR-10 and 0.75 GB for FMoW.

API

You can download them using the Amazon S3 CLI interface with the requester pays option as follows (replacing the fields {...} as appropriate):

aws s3api get-object --bucket madrylab-datamodels \
                     --key {DATASET}/data/{SPLIT}_{DATA_TYPE}_{PCT}.npy \
                     --request-payer requester \
                     [OUT_FILE]

For example, to retrieve the test set margins for CIFAR-10 models trained on 50% subsets, use:

aws s3api get-object --bucket madrylab-datamodels \
                     --key cifar/data/test_margins_50pct.npy \
                     --request-payer requester \
                     test_margins_50pct.npy

Pricing

The total data transfer fee (from AWS to internet) for all of the data is around $374 (= 4155 GB x 0.09 USD per GB).

If you only download everything except for the logits (which is sufficient to reproduce all of our analysis), the fee is around $53.

Loading data

The data matrices are in numpy array format (.npy). As some of these are quite large, you can read small segments without reading the entire file into memory by additionally specifying the mmap_mode argument in np.load:

X = np.load('train_masks_10pct.npy', mmap_mode='r')
Y = np.load('test_margins_10pct.npy', mmap_mode='r')
...
# Use segments, e.g, X[:100], as appropriate
# Run regress(X, Y[:]) using choice of estimation algorithm.

FMoW data

We use a customized version of the FMoW dataset from WILDS (derived from this original dataset) that restricts the year of the training set to 2012. Our code is adapted from here.

To use the dataset, first download WILDS using:

pip install wilds

(see here for more detailed instructions).

In our paper, we only use the in-distribution training and test splits in our analysis (the original version from WILDS also has out-of-distribution as well as validation splits). Our dataset splits can be constructed as follows and used like a PyTorch dataset:

from fmow import FMoWDataset

ds = FMoWDataset(root_dir='/mnt/nfs/datasets/wilds/',
                     split_scheme='time_after_2016')

transform_steps = [
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]
transform = transforms.Compose(transform_steps)

ds_train = ds.get_subset('train', transform=transform)
ds_test = ds.get_subset('id_test', transform=transform)

The columns of matrix data described above is ordered according to the default ordering of examples given by the above constructors.

Owner
Madry Lab
Towards a Principled Science of Deep Learning
Madry Lab
Winning solution for the Galaxy Challenge on Kaggle

Winning solution for the Galaxy Challenge on Kaggle

Sander Dieleman 483 Jan 02, 2023
Transform ML models into a native code with zero dependencies

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Bayes' Witnesses 2.3k Jan 03, 2023
A Microsoft Azure Web App project named Covid 19 Predictor using Machine learning Model

A Microsoft Azure Web App project named Covid 19 Predictor using Machine learning Model (Random Forest Classifier Model ) that helps the user to identify whether someone is showing positive Covid sym

Priyansh Sharma 2 Oct 06, 2022
LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is imp

432 Jan 05, 2023
PennyLane is a cross-platform Python library for differentiable programming of quantum computers

PennyLane is a cross-platform Python library for differentiable programming of quantum computers. Train a quantum computer the same way as a neural ne

PennyLaneAI 1.6k Jan 01, 2023
BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

BASTA team 16 Nov 15, 2022
scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-multimodallearn is a Python package implementing algorithms multimodal data. It is compatible with scikit-learn, a popul

12 Jun 29, 2022
Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Artsem Zhyvalkouski 64 Nov 30, 2022
Made in collaboration with Chris George for Art + ML Spring 2019.

Deepdream Eyes Made in collaboration with Chris George for Art + ML Spring 2019.

Francisco Cabrera 1 Jan 12, 2022
Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining

**Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining.** S

Sebastian Raschka 4k Dec 30, 2022
flexible time-series processing & feature extraction

A corona statistics and information telegram bot.

PreDiCT.IDLab 206 Dec 28, 2022
PyHarmonize: Adding harmony lines to recorded melodies in Python

PyHarmonize: Adding harmony lines to recorded melodies in Python About To use this module, the user provides a wav file containing a melody, the key i

Julian Kappler 2 May 20, 2022
Deploy AutoML as a service using Flask

AutoML Service Deploy automated machine learning (AutoML) as a service using Flask, for both pipeline training and pipeline serving. The framework imp

Chris Rawles 221 Nov 04, 2022
healthy and lesion models for learning based on the joint estimation of stochasticity and volatility

health-lesion-stovol healthy and lesion models for learning based on the joint estimation of stochasticity and volatility Reference please cite this p

5 Nov 01, 2022
Automated Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning

The mljar-supervised is an Automated Machine Learning Python package that works with tabular data. I

MLJAR 2.4k Jan 02, 2023
A simple python program which predicts the success of a movie based on it's type, actor, actress and director

Movie-Success-Prediction A simple python program which predicts the success of a movie based on it's type, actor, actress and director. The program us

Mahalinga Prasad R N 1 Dec 17, 2021
ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

ArviZ 1.3k Jan 05, 2023
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 06, 2023
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 09, 2023
Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python Overview Bank Jago has attracted investors' attention since the end

Najibulloh Asror 3 Feb 10, 2022