BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Overview

Pre-trained checkpoint and bert config json file

  1. Location of checkpoint and bert config json file

    This MLCommons members Google Drive location contains these files.

    • TensorFlow checkpoint (tf1_ckpt) containing the pre-trained weights.
    • Config file (bert_config.json) which specifies the hyperparameters of the model.
  2. Checkpoint conversion

python convert_tf_checkpoint.py --tf_checkpoint <path/to/checkpointdir_phase1/model.ckpt-28252.index> --bert_config_path <path/to/checkpointdir_phase1/bert_config.json> --output_checkpoint model.ckpt-28252.pt

Download and preprocess datasets

  1. Download dataset and generate the TFRecords for training data and eval data

    BERT Wikipedia dataset preparation

  2. Convert training data and eval data from TFRecords to HDF5

    TF_INPUT_DIR=<path/to/tfrecord_input_dir> HDF5_OUTPUT_DIR=<path/to/hdf5_output_dir> ./run_trans_tfrecord_to_hdf5.sh
  3. 4bins training data

    We split dataset to enable data-load balacning and it can reduce communication overhead.

    Based on the sequence length distribution, split HDF5 training data into 4 part:

    part 1: 0 < sequence length <= 128

    part 2: 128 < sequence length <= 256

    part 3: 256 < sequence length <= 384

    part 4: 384 < sequence length <= 512

    The output_dir contains 4 sub-directories 128, 256, 384 and 512.

cd cleanup_scripts
python run_split_and_chop_hdf5_files.py --input_dir=<path/to/hdf5_datadir> --output_dir=<path/to/4bins_training_datadir>

Prepare the environment

  • Create a virtualenv and install the required packages:
virtualenv venv -p python3.8.7
source venv/bin/activate
pip install -r requirements.txt

# Install mlperf-logging Python package
git clone https://github.com/mlperf/logging.git mlperf-logging
pip install -e mlperf-logging

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
git reset --hard d06404fecab73f152c6cbb89ac2c2e9b7fc24124
git submodule update --init --recursive
git apply ../patch_for_mlperf_trining_v1.1_by_samsung.patch
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--distributed_lamb" --global-option="--bnp" --global-option="--xentropy" --global-option="--fast_layer_norm" --global-option="--deprecated_fused_adam"  --global-option="--fmha"  --global-option="--fast_multihead_attn" ./

# Compile mhalib
cd mhalib
python setup.py build
cp build/lib*/mhalib* ../
  • Other software requirements
Softeware Version
python 3.8.7
pytorch 1.9.1
NCCL 2.9.9
CUDA 11.3.0
cudnn 8.2.1.32
cublas 11.4.2
nvidia driver 470.57.02
mofed version 5.4-1.0.3

Run the model

  1. Set hosts address in run_multinode.sh
export hosts=('192.168.16.1' '192.168.16.2')
  1. Launch the training

    Use the following command to run the config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh in python virtual environment.

PYTHON=<path/to/python> DGXSYSTEM=Samsung_Supercomputer21_DGXA100_128x8x16x1 INPUT_DIR=<path/to/4bins_training_datadir> EVAL_DIR=<path/to/eval_datadir> CHECKPOINTDIR_PHASE1=<path/to/checkpointdir_phase1> NEXP=10 ./run_multinode.sh

Appendix

Our source code is based on MLPerf BERT v0.7, and all the files newly added and modified are as follows.

File Name Status Description
config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh Newly added The file contains configurations used for 1024 GPUs experiment.
run_split_and_chop_hdf5_files.py Newly added The file is used for generating 4-bin training data.
mhalib/setup.py Modified The file is modified since CUDA upgraded.
optim/init.py Newly added The file is used as the entrance of "optim" module.
optim/acclip.py Newly added The file implements ACClip optimizer for trial.
optim/madgrad.py Newly added The file implements MADGRAD optimizer for trial.
bind_launch.py Newly added The file is added for BERT training on python environment.
bind_pyt.py Modified The file is modified for the following items.
(1) Log compliance;
(2) Add new NUMA binding.
fmha.py Newly added The file is used for adding FMHA operator (refer to MLPerf v1.0).
mlperf_logger.py Modified The file is modified for log compliance.
modeling.py Modified The file is modified for adding FMHA (refer to MLPerf v1.0).
padding.py Modified The file is modified for adding FMHA (refer to MLPerf v1.0).
README.md Modified It is modified to run Samsung optimized implematation.
requirements.txt Modified The file shows required software version.
run_multinode.sh Newly added The file is startup script about how to run BERT training on python environment
run_pretraining.py Modified The file is modified for the following items.
(1) Load splitting training data;
(2) Add exchange padding feature (refer to MLPerf v1.0);
(3) Add NCCL warmup (refer to MLPerf v1.0);
(4) Add SAIT local/group exchange padding;
(5) Add NCCL warmup for group exchange padding;
(6) Add per-device local gradient clipping before all-reduce;
(7) Add pytorch DDP.
schedulers.py Modified The file is modified for optimizing learning rate scheduler
utils.py Modified The file is modified for the following items.
(1) Add get_optimzer() interface;
(2) Add a batch sampler (SplitRandomSampler) for 4-bin splitting training data.
Owner
SAIT (Samsung Advanced Institute of Technology)
SAIT (Samsung Advanced Institute of Technology)
Finding all things on-prem Microsoft for password spraying and enumeration.

msprobe About Installing Usage Examples Coming Soon Acknowledgements About Finding all things on-prem Microsoft for password spraying and enumeration.

205 Jan 09, 2023
Implementation of the paper ''Implicit Feature Refinement for Instance Segmentation''.

Implicit Feature Refinement for Instance Segmentation This repository is an official implementation of the ACM Multimedia 2021 paper Implicit Feature

Lufan Ma 17 Dec 28, 2022
SWA Object Detection

SWA Object Detection This project hosts the scripts for training SWA object detectors, as presented in our paper: @article{zhang2020swa, title={SWA

237 Nov 28, 2022
Official repository for "Intriguing Properties of Vision Transformers" (2021)

Intriguing Properties of Vision Transformers Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, & Ming-Hsuan Yang P

Muzammal Naseer 155 Dec 27, 2022
Python package for downloading ECMWF reanalysis data and converting it into a time series format.

ecmwf_models Readers and converters for data from the ECMWF reanalysis models. Written in Python. Works great in combination with pytesmo. Citation If

TU Wien - Department of Geodesy and Geoinformation 31 Dec 26, 2022
Portfolio asset allocation strategies: from Markowitz to RNNs

Portfolio asset allocation strategies: from Markowitz to RNNs Research project to explore different approaches for optimal portfolio allocation starti

Luigi Filippo Chiara 1 Feb 05, 2022
The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dea

MIC-DKFZ 1.2k Jan 04, 2023
Official repository of "DeepMIH: Deep Invertible Network for Multiple Image Hiding", TPAMI 2022.

DeepMIH: Deep Invertible Network for Multiple Image Hiding (TPAMI 2022) This repo is the official code for DeepMIH: Deep Invertible Network for Multip

Junpeng Jing 67 Nov 22, 2022
《Lerning n Intrinsic Grment Spce for Interctive Authoring of Grment Animtion》

Learning an Intrinsic Garment Space for Interactive Authoring of Garment Animation Overview This is the demo code for training a motion invariant enco

YuanBo 213 Dec 14, 2022
Official code for the CVPR 2021 paper "How Well Do Self-Supervised Models Transfer?"

How Well Do Self-Supervised Models Transfer? This repository hosts the code for the experiments in the CVPR 2021 paper How Well Do Self-Supervised Mod

Linus Ericsson 157 Dec 16, 2022
Multi-view 3D reconstruction using neural rendering. Unofficial implementation of UNISURF, VolSDF, NeuS and more.

Volume rendering + 3D implicit surface Showcase What? previous: surface rendering; now: volume rendering previous: NeRF's volume density; now: implici

Jianfei Guo 682 Jan 04, 2023
A tool for making map images from OpenTTD save games

OpenTTD Surveyor A tool for making map images from OpenTTD save games. This is not part of the main OpenTTD codebase, nor is it ever intended to be pa

Aidan Randle-Conde 9 Feb 15, 2022
A convolutional recurrent neural network for classifying A/B phases in EEG signals recorded for sleep analysis.

CAP-Classification-CRNN A deep learning model based on Inception modules paired with gated recurrent units (GRU) for the classification of CAP phases

Apurva R. Umredkar 2 Nov 25, 2022
IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL.

IJON SPACE EXPLORER IJON is an annotation mechanism that analysts can use to guide fuzzers such as AFL. Using only a small (usually one line) annotati

Chair for Sys­tems Se­cu­ri­ty 146 Dec 16, 2022
Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data

Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data This is the official PyTorch implementation of the SeCo paper: @articl

ElementAI 101 Dec 12, 2022
(3DV 2021 Oral) Filtering by Cluster Consistency for Large-Scale Multi-Image Matching

Scalable Cluster-Consistency Statistics for Robust Multi-Object Matching (3DV 2021 Oral Presentation) Filtering by Cluster Consistency (FCC) is a very

Yunpeng Shi 11 Sep 28, 2022
Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

PAWS-TF 🐾 Implementation of Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples (PAWS)

Sayak Paul 43 Jan 08, 2023
DeepMReye: magnetic resonance-based eye tracking using deep neural networks

DeepMReye: magnetic resonance-based eye tracking using deep neural networks

73 Dec 21, 2022
A simple interface for editing natural photos with generative neural networks.

Neural Photo Editor A simple interface for editing natural photos with generative neural networks. This repository contains code for the paper "Neural

Andy Brock 2.1k Dec 29, 2022