Benchmark for Tuning Accuracy and Efficiency

Overview

The benchmark includes our efforts in using Colossal-AI to train different tasks to achieve SOTA results. We are interested in both validataion accuracy and training speed, and prefer larger batch size to take advantage of more GPU devices. For example, we trained vision transformer with batch size 512 on CIFAR10 and 4096 on ImageNet1k, which are basically not used in existing works. Some of the results in the benchmark trained with 8x A100 are shown below.

Task	Model	Training Time	Top-1 Accuracy
CIFAR10	ViT-Lite-7/4	~ 16 min	~ 90.5%
ImageNet1k	ViT-S/16	~ 16.5 h	~ 74.5%

The train.py script in each task runs training with the specific configuration script in configs/ for different parallelisms. Supported parallelisms include data parallel only (ends with vanilla), 1D (ends with 1d), 2D (ends with 2d), 2.5D (ends with 2p5d), 3D (ends with 3d).

Each configuration scripts basically includes the following elements, taking ImageNet1k task as example:

TOTAL_BATCH_SIZE = 4096
LEARNING_RATE = 3e-3
WEIGHT_DECAY = 0.3

NUM_EPOCHS = 300
WARMUP_EPOCHS = 32

# data parallel only
TENSOR_PARALLEL_SIZE = 1    
TENSOR_PARALLEL_MODE = None

# parallelism setting
parallel = dict(
    pipeline=1,
    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE),
)

fp16 = dict(mode=AMP_TYPE.TORCH, ) # amp setting

gradient_accumulation = 2 # accumulate 2 steps for gradient update

BATCH_SIZE = TOTAL_BATCH_SIZE // gradient_accumulation # actual batch size for dataloader

clip_grad_norm = 1.0 # clip gradient with norm 1.0

Upper case elements are basically what train.py needs, and lower case elements are what Colossal-AI needs to initialize the training.

Usage

To start training, use the following command to run each worker:

$ DATA=/path/to/dataset python train.py --world_size=WORLD_SIZE \
                                        --rank=RANK \
                                        --local_rank=LOCAL_RANK \
                                        --host=MASTER_IP_ADDRESS \
                                        --port=MASTER_PORT \
                                        --config=CONFIG_FILE

It is also recommended to start training with torchrun as:

$ DATA=/path/to/dataset torchrun --nproc_per_node=NUM_GPUS_PER_NODE \
                                 --nnodes=NUM_NODES \
                                 --node_rank=NODE_RANK \
                                 --master_addr=MASTER_IP_ADDRESS \
                                 --master_port=MASTER_PORT \
                                 train.py --config=CONFIG_FILE

ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Related tags

Overview

Benchmark for Tuning Accuracy and Efficiency

Overview

Usage

Owner

HPC-AI Tech

Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Reproduction process of AlexNet

RAMA: Rapid algorithm for multicut problem

PyAF is an Open Source Python library for Automatic Time Series Forecasting built on top of popular pydata modules.

Demonstrates how to divide a DL model into multiple IR model files (division) and introduce a simplest way to implement a custom layer works with OpenVINO IR models.

Simulation of moving particles under microscopic imaging

Pytorch Implementation of "Diagonal Attention and Style-based GAN for Content-Style disentanglement in image generation and translation" (ICCV 2021)

Bayesian Neural Networks in PyTorch

BEAMetrics: Benchmark to Evaluate Automatic Metrics in Natural Language Generation

Double pendulum simulator using a symplectic Euler's method and Hamiltonian mechanics

Robust Partial Matching for Person Search in the Wild

Inference pipeline for our participation in the FeTA challenge 2021.

OBBDetection is a oriented object detection library, which is based on MMdetection.

lightweight python wrapper for vowpal wabbit

GNEE - GAT Neural Event Embeddings

A clear, concise, simple yet powerful and efficient API for deep learning.

Official implementation of "Can You Spot the Chameleon? Adversarially Camouflaging Images from Co-Salient Object Detection" in CVPR 2022.

The project page of paper: Architecture disentanglement for deep neural networks [ICCV 2021, oral]

sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code

Improving Generalization Bounds for VC Classes Using the Hypergeometric Tail Inversion