Simple tutorials on Pytorch DDP training

Overview

pytorch-distributed-training

Distribute Dataparallel (DDP) Training on Pytorch

Features

Good Notes

分享一些网上优质的笔记

TODO

  • 完成DP和DDP源码解读笔记(当前进度50%)
  • 修改代码细节, 复现实验结果

Quick start

想直接运行查看结果的可以执行以下命令, 注意一定要用--ip--port来指定主机的ip地址以及空闲的端口,否则可能无法运行

$ python dataparallel.py --gpu 0,1,2,3
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py
  • --ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址

  • --port=int, e.g --port=23456 来指定启动端口号

  • --batch_size=int, e.g --batch_size=128 设定训练batch_size

  • distributed_gradient_accumulation.py

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py
  • --ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
  • --port=int, e.g --port=23456 来指定启动端口号
  • --grad_accu_steps=int, e.g --grad_accu_steps=4' 来指定gradient_step

Comparison

结果不够准确,GPU状态不同结果可能差异较大

默认情况下都使用SyncBatchNorm, 这会导致执行速度变慢一些,因为需要增加进程之间的通讯来计算BatchNorm, 但有利于保证准确率

Concepts

  • apex
  • DP: DataParallel
  • DDP: DistributedDataParallel

Environments

  • 4 × 2080Ti
model dataset training method time(seconds/epoch) Top-1 accuracy
resnet18 cifar100 DP 20s
resnet18 cifar100 DP+apex 18s
resnet18 cifar100 DDP 16s
resnet18 cifar100 DDP+apex 14.5s

Basic Concept

  • group: 表示进程组,默认情况下只有一个进程组。
  • world size: 全局进程个数
    • 比如16张卡单卡单进程: world size = 16
    • 8卡单进程: world size = 1
    • 只有当连接的进程数等于world size, 程序才会执行
  • rank: 进程序号,用于进程间通讯,表示进程优先级,rank=0表示主进程
  • local_rank: 进程内,GPU编号,非显示参数,由torch.distributed.launch内部指定,rank=3, local_rank=0 表示第3个进程的第1GPU

Usage 单机多卡

1. 获取当前进程的index

pytorch可以通过torch.distributed.lauch启动器,在命令行分布式地执行.py文件, 在执行的过程中会将当前进程的index通过参数传递给python

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int,
                    help='node rank for distributed training')
args = parser.parse_args()
print(args.local_rank)

2. 定义 main_worker 函数

主要的训练流程都写在main_worker函数中,main_worker需要接受三个参数(最后一个参数optional):

def main_worker(local_rank, nprocs, args):
    training...
  • local_rank: 接受当前进程的rank值,在一机多卡的情况下对应使用的GPU号
  • nprocs: 进程数量
  • args: 自己定义的额外参数

main_worker,相当于你每个进程需要运行的函数(每个进程执行的函数内容是一致的,只不过传入的local_rank不一样)

3. main_worker函数中的整体流程

main_worker函数中完整的训练流程

import torch
import torch.distributed as dist
import torch.backends.cudnn as cudnn
def main_worker(local_rank, nprocs, args):
    args.local_rank = local_rank
    # 分布式初始化,对于每个进程来说,都需要进行初始化
    cudnn.benchmark = True
    dist.init_process_group(backend='nccl', init_method='tcp://ip:port', world_size=nprocs, rank=local_rank)
    # 模型、损失函数、优化器定义
    model = ...
    criterion = ...
    optimizer = ...
    # 设置进程对应使用的GPU
    torch.cuda.set_device(local_rank)
    model.cuda(local_rank)
    # 使用分布式函数定义模型
    model = model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
    
    # 数据集的定义,使用 DistributedSampler
    mini_batch_size = batch_size / nprocs # 手动划分 batch_size to mini-batch_size
    train_dataset = ...
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., 
                                              sampler=train_sampler)
    
    test_dataset = ...
    test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)
    testloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., 
                                             sampler=test_sampler) 
    
    # 正常的 train 流程
    for epoch in range(300):
       model.train()
       for batch_idx, (images, target) in enumerate(trainloader):
          images = images.cuda(non_blocking=True)
          target = target.cuda(non_blocking=True)
          ...
          pred = model(images)
          loss = loss_function(pred, target)
          ...
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

4. 定义main函数

import argparse
import torch
parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--batch_size','--batch-size', default=256, type=int)
parser.add_argument('--lr', default=0.1, type=float)

def main_worker(local_rank, nprocs, args):
    ...

def main():
    args = parser.parse_args()
    args.nprocs = torch.cuda.device_count()
    # 执行 main_worker
    main_worker(args.local_rank, args.nprocs, args)

if __name__ == '__main__':
    main()

5. Command Line 启动

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py
  • --ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
  • --port=int, e.g --port=23456 来指定启动端口号

参数说明:

  • --nnodes 表示机器的数量
  • --node_rank 表示当前的机器
  • --nproc_per_node 表示每台机器上的进程数量

参考 distributed.py

6. torch.multiprocessing

使用torch.multiprocessing来解决进程自发控制可能产生问题,这种方式比较稳定,推荐使用

import argparse
import torch
import torch.multiprocessing as mp

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--batch_size','--batch-size', default=256, type=int)
parser.add_argument('--lr', default=0.1, type=float)

def main_worker(local_rank, nprocs, args):
    ...

def main():
    args = parser.parse_args()
    args.nprocs = torch.cuda.device_count()
    # 将 main_worker 放入 mp.spawn 中
    mp.spawn(main_worker, nprocs=args.nprocs, args=(args.nprocs, args))

if __name__ == '__main__':
    main()

参考 distributed_mp.py 启动方式如下:

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py
  • --ip=str, e.g --ip='10.24.82.10' 来指定主进程的ip地址
  • --port=int, e.g --port=23456 来指定启动端口号

Implemented Work

参考的文章如下(如果有文章没有引用,但是内容差不多的,可以提issue给我,我会补上,实在抱歉):

Owner
Ren Tianhe
Ren Tianhe
Generalized hybrid model for mode-locked laser diodes with an extended passive cavity

GenHybridMLLmodel Generalized hybrid model for mode-locked laser diodes with an extended passive cavity This hybrid simulation strategy combines a tra

Stijn Cuyvers 3 Sep 21, 2022
Object detection on multiple datasets with an automatically learned unified label space.

Simple multi-dataset detection An object detector trained on multiple large-scale datasets with a unified label space; Winning solution of E

Xingyi Zhou 407 Dec 30, 2022
AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.

AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.

Adelaide Intelligent Machines (AIM) Group 3k Jan 02, 2023
Multi-objective constrained optimization for energy applications via tree ensembles

Multi-objective constrained optimization for energy applications via tree ensembles

C⚙G - Imperial College London 1 Nov 19, 2021
Code for "Typilus: Neural Type Hints" PLDI 2020

Typilus A deep learning algorithm for predicting types in Python. Please find a preprint here. This repository contains its implementation (src/) and

47 Nov 08, 2022
Video Swin Transformer - PyTorch

Video-Swin-Transformer-Pytorch This repo is a simple usage of the official implementation "Video Swin Transformer". Introduction Video Swin Transforme

Haofan Wang 116 Dec 20, 2022
Official pytorch implementation of the AAAI 2021 paper Semantic Grouping Network for Video Captioning

Semantic Grouping Network for Video Captioning Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo. AAAI 2021. [arxiv] Environment Ubuntu 16.04 CU

Hobin Ryu 43 Nov 25, 2022
AirLoop: Lifelong Loop Closure Detection

AirLoop This repo contains the source code for paper: Dasong Gao, Chen Wang, Sebastian Scherer. "AirLoop: Lifelong Loop Closure Detection." arXiv prep

Chen Wang 53 Jan 03, 2023
A High-Performance Distributed Library for Large-Scale Bundle Adjustment

MegBA: A High-Performance and Distributed Library for Large-Scale Bundle Adjustment This repo contains an official implementation of MegBA. MegBA is a

旷视研究院 3D 组 336 Dec 27, 2022
Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Sharpened Cosine Similarity A layer implementation for PyTorch Install At your c

Brandon Rohrer 203 Nov 30, 2022
Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

Microsoft365_devicePhish Abusing Microsoft 365 OAuth Authorization Flow for Phishing Attack This is a simple proof-of-concept script that allows an at

Alex 236 Dec 21, 2022
Implementation of Pooling by Sliced-Wasserstein Embedding (NeurIPS 2021)

PSWE: Pooling by Sliced-Wasserstein Embedding (NeurIPS 2021) PSWE is a permutation-invariant feature aggregation/pooling method based on sliced-Wasser

Navid Naderializadeh 3 May 06, 2022
Pre-Trained Image Processing Transformer (IPT)

Pre-Trained Image Processing Transformer (IPT) By Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Cha

HUAWEI Noah's Ark Lab 332 Dec 18, 2022
Computer Vision Paper Reviews with Key Summary of paper, End to End Code Practice and Jupyter Notebook converted papers

Computer-Vision-Paper-Reviews Computer Vision Paper Reviews with Key Summary along Papers & Codes. Jonathan Choi 2021 The repository provides 100+ Pap

Jonathan Choi 2 Mar 17, 2022
Code for paper Adaptively Aligned Image Captioning via Adaptive Attention Time

Adaptively Aligned Image Captioning via Adaptive Attention Time This repository includes the implementation for Adaptively Aligned Image Captioning vi

Lun Huang 45 Aug 27, 2022
Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

42 Nov 17, 2022
Universal Probability Distributions with Optimal Transport and Convex Optimization

Sylvester normalizing flows for variational inference Pytorch implementation of Sylvester normalizing flows, based on our paper: Sylvester normalizing

Rianne van den Berg 172 Dec 13, 2022
Codes for Causal Semantic Generative model (CSG), the model proposed in "Learning Causal Semantic Representation for Out-of-Distribution Prediction" (NeurIPS-21)

Learning Causal Semantic Representation for Out-of-Distribution Prediction This repository is the official implementation of "Learning Causal Semantic

Chang Liu 54 Dec 01, 2022
Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

FENSE The metric, Fluency ENhanced Sentence-bert Evaluation (FENSE), for audio caption evaluation, proposed in the paper "Can Audio Captions Be Evalua

Zhiling Zhang 13 Dec 23, 2022
Official code release for "GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis"

GRAF This repository contains official code for the paper GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. You can find detailed usage i

349 Dec 29, 2022