The official implementation of the paper, "SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning"

Overview

SubTab:

Author: Talip Ucar ([email protected])

The official implementation of the paper,

SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning

PWC

Table of Contents:

  1. Model
  2. Environment
  3. Data
  4. Configuration
  5. Training and Evaluation
  6. Adding New Datasets
  7. Results
  8. Experiment tracking
  9. Citing the paper
  10. Citing this repo

Model

SubTab

Click for a slower version of the animation

SubTab

Environment

We used Python 3.7 for our experiments. The environment can be set up by following three steps:

pip install pipenv             # To install pipenv if you don't have it already
pipenv install --skip-lock     # To install required packages. 
pipenv shell                   # To activate virtual env

If the second step results in issues, you can install packages in Pipfile individually by using pip i.e. "pip install package_name".

Data

MNIST dataset is already provided to demo the framework. For your own dataset, follow the instructions in Adding New Datasets.

Configuration

There are two types of configuration files:

1. runtime.yaml
2. mnist.yaml
  1. runtime.yaml is a high-level configuration file used by all datasets to:

    • define the random seed
    • turn on/off mlflow (Default: False)
    • turn on/off python profiler (Default: False)
    • set data directory
    • set results directory
  2. Second configuration file is dataset-specific and is used to configure the architecture of the model, loss functions, and so on.

    • For example, we set up a configuration file for MNIST dataset with the same name. Please note that the name of the configuration file should be same as name of the dataset with all letters in lowercase.
    • We can have configuration files for other datasets such as tcga.yaml and income.yaml for tcga and income datasets respectively.

Training and Evaluation

You can train and evaluate the model by using:

python train.py # For training
python eval.py  # For evaluation
  • train.py will also run evaluation at the end of the training.
  • You can also run evaluation separately by using eval.py.

Adding New Datasets

For each new dataset, you can use the following steps:

  1. Provide a _load_dataset_name() function, similar to MNIST load function

    • For example, you can add _load_tcga() for tcga dataset, or _load_income() for income dataset.
    • The function should return (x_train, y_train, x_test, y_test)
  2. Add a separate elif condition in this section within _load_data() method of TabularDataset() class in utils/load_data.py

  3. Create a new config file with the same name as dataset name.

    • For example, tcga.yaml for tcga dataset, or income.yaml for income dataset.

    • You can also duplicate one of the existing configuration files (e.g. mnist.yaml), and re-name it.

    • Make sure that the new config file is under config/ directory.

  4. Provide data folder with pre-processed training and test set, and place it under ./data/ directory. You can also do train-test split and pre-processing within your custom _load_dataset_name() function.

  5. (Optional) If you want to place the new dataset under a different directory than the local "./data/", then:

    • Place the dataset folder anywhere, and define the root directory to it in this line of /config/runtime.yaml.

    • For example, if the path to tcga dataset is /home/.../data/tcga/, you only need to include /home/.../data/ in runtime.yaml. The code will fill in tcga folder name from the name given in the command line argument (e.g. -d dataset_name. In this case, dataset_name would be tcga).

Structure of the repo

- train.py
- eval.py

- src
    |-model.py
    
- config
    |-runtime.yaml
    |-mnist.yaml
    
- utils
    |-load_data.py
    |-arguments.py
    |-model_utils.py
    |-loss_functions.py
    ...
    
- data
    |-mnist
    ...
    
- results
    |
    ...

Results

Results at the end of training is saved under ./results directory. Results directory structure is as following:

- results
    |-dataset name
            |-evaluation
                |-clusters (for plotting t-SNE and PCA plots of embeddings)
                |-reconstructions (not used)
            |-training
                |-model_mode (e.g. ae for autoencoder)   
                     |-model
                     |-plots
                     |-loss

You can save results of evaluations under "evaluation" folder.

Experiment tracking

MLFlow is used to track experiments. It is turned off by default, but can be turned on by changing option on this line in runtime config file in ./config/runtime.yaml

Citing the paper

@article{ucar2021subtab,
  title={SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning},
  author={Ucar, Talip and Hajiramezanali, Ehsan and Edwards, Lindsay},
  journal={arXiv preprint arXiv:2110.04361},
  year={2021}
}

Citing this repo

If you use SubTab framework in your own studies, and work, please cite it by using the following:

@Misc{talip_ucar_2021_SubTab,
  author =   {Talip Ucar},
  title =    {{SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning}},
  howpublished = {\url{https://github.com/AstraZeneca/SubTab}},
  month        = June,
  year = {since 2021}
}
Owner
AstraZeneca
Data and AI: Unlocking new science insights
AstraZeneca
This project deals with the detection of skin lesions within the ISICs dataset using YOLOv3 Object Detection with Darknet.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Skin Lesion detection using YOLO This project deal

Lalith Veerabhadrappa Badiger 1 Nov 22, 2021
A computational optimization project towards the goal of gerrymandering the results of a hypothetical election in the UK.

A computational optimization project towards the goal of gerrymandering the results of a hypothetical election in the UK.

Emma 1 Jan 18, 2022
OBBDetection: an oriented object detection toolbox modified from MMdetection

OBBDetection note: If you have questions or good suggestions, feel free to propose issues and contact me. introduction OBBDetection is an oriented obj

MIXIAOXIN_HO 3 Nov 11, 2022
an Evolutionary Algorithm assisted GAN

EvoGAN an Evolutionary Algorithm assisted GAN ckpts

3 Oct 09, 2022
Official tensorflow implementation for CVPR2020 paper “Learning to Cartoonize Using White-box Cartoon Representations”

Tensorflow implementation for CVPR2020 paper “Learning to Cartoonize Using White-box Cartoon Representations”.

3.7k Dec 31, 2022
QT Py Media Knob using rotary encoder & neopixel ring

QTPy-Knob QT Py USB Media Knob using rotary encoder & neopixel ring The QTPy-Knob features: Media knob for volume up/down/mute with "qtpy-knob.py" Cir

Tod E. Kurt 56 Dec 30, 2022
:fire: 2D and 3D Face alignment library build using pytorch

Face Recognition Detect facial landmarks from Python using the world's most accurate face alignment network, capable of detecting points in both 2D an

Adrian Bulat 6k Dec 31, 2022
Clockwork Convnets for Video Semantic Segmentation

Clockwork Convnets for Video Semantic Segmentation This is the reference implementation of arxiv:1608.03609: Clockwork Convnets for Video Semantic Seg

Evan Shelhamer 141 Nov 21, 2022
Deep-Learning-Image-Captioning - Implementing convolutional and recurrent neural networks in Keras to generate sentence descriptions of images

Deep Learning - Image Captioning with Convolutional and Recurrent Neural Nets ========================================================================

23 Apr 06, 2022
Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Realistic Full-Body Anonymization with Surface-Guided GANs This is the official

Håkon Hukkelås 30 Nov 18, 2022
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark We propose a benchmark to evaluate different quantization algorithms on vari

494 Dec 29, 2022
Distributed DataLoader For Pytorch Based On Ray

Dpex——用户无感知分布式数据预处理组件 一、前言 随着GPU与CPU的算力差距越来越大以及模型训练时的预处理Pipeline变得越来越复杂,CPU部分的数据预处理已经逐渐成为了模型训练的瓶颈所在,这导致单机的GPU配置的提升并不能带来期望的线性加速。预处理性能瓶颈的本质在于每个GPU能够使用的C

Dalong 23 Nov 02, 2022
[ICLR'21] Counterfactual Generative Networks

This repository contains the code for the ICLR 2021 paper "Counterfactual Generative Networks" by Axel Sauer and Andreas Geiger. If you want to take the CGN for a spin and generate counterfactual ima

88 Jan 02, 2023
Implementation of "Large Steps in Inverse Rendering of Geometry"

Large Steps in Inverse Rendering of Geometry ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), December 2021. Baptiste Nicolet · Alec Jacob

RGL: Realistic Graphics Lab 274 Jan 06, 2023
Official codebase for ICLR oral paper Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling

CLIORA This is the official codebase for ICLR oral paper: Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling. We introduce

Bo Wan 32 Dec 23, 2022
Pre-trained models for a Cascaded-FCN in caffe and tensorflow that segments

Cascaded-FCN This repository contains the pre-trained models for a Cascaded-FCN in caffe and tensorflow that segments the liver and its lesions out of

300 Nov 22, 2022
RetinaFace: Deep Face Detection Library in TensorFlow for Python

RetinaFace is a deep learning based cutting-edge facial detector for Python coming with facial landmarks.

Sefik Ilkin Serengil 512 Dec 29, 2022
Traductor de lengua de señas al español basado en Python con Opencv y MedaiPipe

Traductor de señas Traductor de lengua de señas al español basado en Python con Opencv y MedaiPipe Requerimientos 🔧 Python 3.8 o inferior para evitar

Jahaziel Hernandez Hoyos 3 Nov 12, 2022
A Pytorch Implementation of a continuously rate adjustable learned image compression framework.

GainedVAE A Pytorch Implementation of a continuously rate adjustable learned image compression framework, Gained Variational Autoencoder(GainedVAE). N

39 Dec 24, 2022
Study of human inductive biases in CNNs and Transformers.

Are Convolutional Neural Networks or Transformers more like human vision? This repository contains the code and fine-tuned models of popular Convoluti

Shikhar Tuli 39 Dec 08, 2022