FedTorch is an open-source Python package for distributed and federated training of machine learning models using PyTorch distributed API

Overview

FedTorch Logo

FedTorch is an open-source Python package for distributed and federated training of machine learning models using PyTorch distributed API. Various algorithms for federated learning and local SGD are implemented for benchmarking and research, including our own proposed methods:

And other common algorithms such as:

We are actively trying to expand the library to include more training algorithms as well.

NEWS

Recent updates to the package:

Installation

First you need to clone the repo into your computer:

git clone https://github.com/MLOPTPSU/FedTorch.git

The PyPi package will be added soon.

This package is built based on PyTorch Distributed API. Hence, it could be run with any supported distributed backend of GLOO, MPI, and NCCL. Among these three, MPI backend since it can be used for both CPU and CUDA runnings, is the main backend we use for developement. Unfortunately installing the built version of PyTorch does not support MPI backend for distributed training and needed to be built from source with a version of MPI installed that supports CUDA as well. However, do not worry since we got you covered. We provide a docker file that can create an image with all dependencies needed for FedTorch. The Dockerfile can be found here, where you can edit based on your need to create your customized image. In addition, since building this docker image might take a lot of time, we provide different versions that we built before along with this repository in the packages section.

For instance, you can pull one of the images that is built with CUDA 10.2 and OpenMPI 4.0.1 with CUDA support and PyTorch 1.6.0, using the following command:

docker pull docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi

The docker images can be used for cloud services such as Azure Machine Learning API for running FedTorch. The instructions for running on cloud services will be added in near future.

Get Started

Running different trainings with different settings is easy in FedTorch. The only thing you need to take care of is to set the correct parameters for the experiment. The list of all parameters used in this package is in parameters.py file. For different algorithms we will provide examples, so the relevant parameters can be set correctly. YAML support will be added in future for each distinct training. The parameters can be parsed from the input of the commandline using the following method:

from fedtorch.parameters import get_args

args = get_args()

When the parameters are set, we need to setup the nodes using those parameters and start the training. First, we need to setup the distributed backend. For instance, we can use MPI backend as:

import torch.distributed as dist

dist.init_process_group('mpi')

This will initialize the backend for distributed training. Next, we need to setup each node based on the parameters and create the graph of nodes. Then we need to initialize nodes and load their data. For this, we can use the node object in the FedTorch:

from fedtorch.nodes import Client

client = Client(args, dist.get_rank())
# Initialize the node
client.initialize()
# Initialize the dataset if not downloaded
client.initialize_dataset()
# Load the dataset
client.load_local_dataset()
# Generate auxiliary models and params for training
client.gen_aux_models()

Then, we need to call the appropriate training method and run the training. For instance, if the parameters are set for a FedAvg training, we can run:

from fedtorch.comms.trainings.federated import train_and_validate_federated

train_and_validate_federated(client)

Different distributed and federated algorithms in this package can be run using this procedure, and for simplicity, we provide main.py file, where can be used for running those algorithms following the same procedure. To run this file, we should run it using mpi and define number of clients (processes) that will run the same file for training using:

mpirun -np {NUM_CLIENTS} python main.py {args:values}

where {args:values} should be filled with appropriate parameters needed for the training. Next we provide a file to automatically build this command for mpi running for various situations.

Running Examples

To make the process easier, we have provided a run_mpi.py file that covers most of parameters needed for running different training algrithms. We first get into the details of different parameters and then provide some examples for running.

Dataset Parameters

For setting up the dataset there are some parameters involved. The main parameters are:

  • --data : Defines the dataset name for training.
  • --batch_size : Defines the size of the batch in the training.
  • --partition_data : Defines whether the data should be partitioned or each client access to the whole dataset.
  • --reshuffle_per_epoch : This can be set True for distributed training to have iid data accross clients and faster convergence. This is not inline with Federated Learning settings.
  • --iid_data : If set True, the data is randomly distributed accross clients. If is set to False, either the dataset itself is non-iid by default (like the EMNIST dataset) or it can be manullay distributed to be non-iid (like the MNIST dataset using parameter num_class_per_client). The default is True in the package, but in the run_mpi.py file the default is False.
  • --num_class_per_client: If the parameter iid is set to False, we can distribute the data heterogeneously by attributing certain number of classes to each client. For instance if setting --num_class_per_client 2, then each client will only has access to two randomly selected classes' data in the entire training process.

Federated Learning Parameters

To run the training using federated learning setups some main parameters are:

  • --federated : If set to True the training will be in one of federated learning setups. If not, it will be in a distributed mode using local SGD and with periodic averaging (that could be set using --local_step) and possibly reshuffling after each epoch. The default is False.
  • --federated_type : defines the type of fderated learning algorithm we want to use. The default is fedavg.
  • --federated_sync_type : It could be either epoch or local_step and it will be used to determine when to synchronize the models. If set to epoch, then the parameter --num_epochs_per_comm should be set as well. If set to local_step, then the parameter --local_steps should be set. The default is epoch.
  • --num_comms : Defines the number of communication rounds needed for trainings. This is only for federated learning, while in normal distirbuted mode the number of total iterations should be set either by --num_epochs or --num_iterations, and hence the --stop_criteria should be either epoch or iteration.
  • --online_client_rate : Defines the ratio of clients that are online and active during each round of communication. This is only for federated learning. The default value is 1.0, which means all clients will be active.

Learning Rate Schedule

Different learning rate schedules can be set using their corresponding parameters. The main parameter is --lr_schedule_scheme, which defines the scheme for learning rate scheduling. For more information about different learning rate schedulers, please see learning.py file.

Examples

Now we provide some simple examples for running some of the training algorithms on a single node with multiple processes using mpi. To do so, we first need to run the docker container with installed dependencies.

docker run --rm -it --mount type=bind,source="{path/to/FedTorch}",target=/FedTorch docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi
cd /FedTorch

This will run the container and will mount the FedTorch repo to it. The {path/to/FedTorch} should be replaced with your local path to the FedTorch repo directory. Now we can run the training on it.

FedAvg and FedGATE

Now, we can run the FedAvg algorithm for training an MLP model using MNIST data by the following command.

python run_mpi.py -f -ft fedavg -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2

This will run the training on 10 nodes with initial learning rate of 0.1, the batch size of 50, for 20 communication rounds each with 10 local steps of SGD. The dataset is distributed hetergeneously with each client has access to only 2 classes of data from the MNIST dataset.

Changing -ft fedavg to -ft fedgate will run the same training using the FedGATE algorithm. To run the FedCOMGATE algorithm we need to add -q to the parameter to enable quantization as well. Hence the command will be:

python run_mpi.py -f -ft fedgate -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -q

APFL

To run APFL algorithm a simple command will be:

python run_mpi.py -f -ft apfl -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -pa 0.5 -fp

where -pa 0.5 sets the alpha parameter of the APFL algorithm. The last parameter -fp will turn on the fed_personal parameter, which evaluate the personalized or the localized model using a local validation dataset. This will be mostly used for personalization algorithms such as APFL.

DRFA

To run a DRFA training we can use the following command:

python run_mpi.py -f -fd -ft fedavg -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -dg 0.1 

where -dg 0.1 sets the gamma parameter in the DRFA algorithm. Note that DRFA is a framework that can be run using any federated learning aggergator such as FedAvg or FedGATE. Hence the parameter -fd will enable DRFA training and -ft will define the federated type to be used for aggregation.

References

For this repository there are several different references used for each training procedure. If you use this repository in your research, please cite the following paper:

@article{haddadpour2020federated,
  title={Federated learning with compression: Unified analysis and sharp guarantees},
  author={Haddadpour, Farzin and Kamani, Mohammad Mahdi and Mokhtari, Aryan and Mahdavi, Mehrdad},
  journal={arXiv preprint arXiv:2007.01154},
  year={2020}
}

Our other papers developed using this repository should be cited using the following bibitems:

@inproceedings{haddadpour2019local,
  title={Local sgd with periodic averaging: Tighter analysis and adaptive synchronization},
  author={Haddadpour, Farzin and Kamani, Mohammad Mahdi and Mahdavi, Mehrdad and Cadambe, Viveck},
  booktitle={Advances in Neural Information Processing Systems},
  pages={11082--11094},
  year={2019}
}
@article{deng2020distributionally,
  title={Distributionally Robust Federated Averaging},
  author={Deng, Yuyang and Kamani, Mohammad Mahdi and Mahdavi, Mehrdad},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}
@article{deng2020adaptive,
  title={Adaptive Personalized Federated Learning},
  author={Deng, Yuyang and Kamani, Mohammad Mahdi and Mahdavi, Mehrdad},
  journal={arXiv preprint arXiv:2003.13461},
  year={2020}
}

Acknowledgement and Disclaimer

This repository is developed, mainly by MM. Kamani, based on our group's research on distributed and federated learning algorithms. We also developed the works of other groups' proposed methods using FedTorch for a better comparison. However, this repo is not the official code for those methods other than our group's. Some parts of the initial stages of this repository were based on a forked repo of Local SGD code from Tao Lin, which is not public now.

Comments
  • Errors when running the code using Docker

    Errors when running the code using Docker

    Hi, I followed your README instructions to run your algorithm but I encountered multiple errors along the way:

    • I tried to pull the image you provided on This issue #3 and run the following command: python run_mpi.py -f -ft apfl -n 10 -d cifar10 -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -pa 0.5 -fp -oc I keep getting warnings about my RTX 2080ti card which I think is not been recognized.
    warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
    Traceback (most recent call last):
      File "main.py", line 49, in <module>
        main(args)
      File "main.py", line 21, in main
        client.initialize()
      File "/workspace/fedtorch/fedtorch/nodes/nodes.py", line 44, in initialize
        init_config(self.args)
      File "/workspace/fedtorch/fedtorch/utils/init_config.py", line 36, in init_config
        torch.cuda.set_device(args.graph.device)
      File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 281, in set_device
        torch._C._cuda_setDevice(device)
    RuntimeError: cuda runtime error (101) : invalid device ordinal at /workspace/pytorch/torch/csrc/cuda/Module.cpp:59
    THCudaCheck FAIL file=/workspace/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
    /usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:125: UserWarning:
    GeForce RTX 2080 Ti with CUDA capability sm_75 is not compatible with the current PyTorch installation.
    The current PyTorch install supports CUDA capabilities sm_35 sm_37 sm_52 sm_60 sm_61 sm_70 compute_70.
    If you want to use the GeForce RTX 2080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
    
      warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
    Traceback (most recent call last):
      File "main.py", line 49, in <module>
        main(args)
      File "main.py", line 21, in main
        client.initialize()
      File "/workspace/fedtorch/fedtorch/nodes/nodes.py", line 44, in initialize
        init_config(self.args)
      File "/workspace/fedtorch/fedtorch/utils/init_config.py", line 36, in init_config
        torch.cuda.set_device(args.graph.device)
      File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 281, in set_device
        torch._C._cuda_setDevice(device)
    RuntimeError: cuda runtime error (101) : invalid device ordinal at /workspace/pytorch/torch/csrc/cuda/Module.cpp:59
    

    Regarding these warnings the code does not collapse but not even one log regarding the training procedure is printed.

    image

    In addition when I run nvidia-smi I see the following gpu utilization:

    image The utilization percentage across all 3 GPUs remains zero.

    • As second option I tried to build the image given the Dockerfile located in the docker folder. Yet again I encountered the following run error:
    Traceback (most recent call last):
      File "main.py", line 4, in <module>
        import torch.distributed as dist
      File "/usr/local/lib/python3.8/dist-packages/torch/__init__.py", line 189, in <module>
        from torch._C import *
    RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
    
    --------------------------------------------------------------------------
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    mpirun detected that one or more processes exited with non-zero status, thus causing
    the job to be terminated. The first process to do so was:
    
      Process name: [[51835,1],1]
      Exit code:    1
    --------------------------------------------------------------------------
    

    It looks like there is a problem with PyTorch installation so I tried something simple:

    image

    Upgrading Numpy's version didn't help me either. Can you please help and explain how to run your code?

    Thank you!

    opened by AvivSham 5
  • A question about APFL algorithm

    A question about APFL algorithm

    Hi, I read your paper and am interested in the algorithm of APFL. I have a question about the code. It seems that the code of APFL is a little different from the algorithm written on the paper. In the paper, each client maintains 3 models and first update the local version of global model, then update the local model and finally mix them to get the new personalized model. However, in this code, it just maintains 2 models and first update the local version of global model, then get the grad of personalized model by mixing the output of local version of global model and personalized model.

    The code: `

    inference and get current performance.

                        client.optimizer.zero_grad()
                        loss, performance = inference(client.model, client.criterion, client.metrics, _input, _target)
    
                        # compute gradient and do local SGD step.
                        loss.backward()
                        client.optimizer.step(
                            apply_lr=True,
                            apply_in_momentum=client.args.in_momentum, apply_out_momentum=False
                        )
                        
                        client.optimizer.zero_grad()
                        client.optimizer_personal.zero_grad()
                        loss_personal, performance_personal = inference_personal(client.model_personal, client.model, 
                                                                                 client.args.fed_personal_alpha, client.criterion, 
                                                                                 client.metrics, _input, _target)
    
                        # compute gradient and do local SGD step.
                        loss_personal.backward()
                        client.optimizer_personal.step(
                            apply_lr=True,
                            apply_in_momentum=client.args.in_momentum, apply_out_momentum=False
                        )
    

    ` Are they the same? Looking forward to your reply. Thank you!

    opened by BrightHaozi 3
  • a question about fedprox

    a question about fedprox

    in this page "https://github.com/MLOPTPSU/FedTorch/blob/main/fedtorch/comms/trainings/federated/main.py". line 123 -> 129, code is below.

    elif client.args.federated_type == 'fedprox':
        # Adding proximal gradients and loss for fedprox
        for client_param, server_param in zip(client.model.parameters(), client.model_server.parameters()):
            if client.args.graph.rank == 0:
                print("distance norm for prox is:{}".format(torch.norm(client_param.data - server_param.data )))
            loss += client.args.fedprox_mu /2 * torch.norm(client_param.data - server_param.data)
            client_param.grad.data += client.args.fedprox_mu * (client_param.data - server_param.data)
    

    client.args.fedprox_mu /2 * torch.norm(client_param.data - server_param.data) may mean $\frac{\mu}{2} \left\|w-w_t\right\|$. But, $\frac{\mu}{2} \left\|w-w_t\right\|^2$ is used in fedprox.

    I'm not sure what I said above is true. Thank you very much for your kind consideration.

    opened by bird-two 2
  • Could you release the code in

    Could you release the code in "Federated Learning with Compression: Unified Analysis and Sharp Guarantees"?

    Hi,

    I just read the paper of "Federated Learning with Compression: Unified Analysis and Sharp Guarantees", it looks very interesting While searching for the code I was directed to this repo, however, I didn't found the implementation of FedCOM, so I would like to ask if you would release it?

    Thanks!

    opened by Dzhange 1
  • Does BrokenPipeError Matters?

    Does BrokenPipeError Matters?

    Hi, I am very interested in your proposed method. It is really nice of you to share such a helpful and clear repo. I follow your repo and use your docker images to run the code, but I keep getting some errors as follows after some random starting round of the traning.

    Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed send_bytes(obj) File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

    After all, I still can get a result, but I wonder whether the brokenpipeerror matters, am I still getting the correct results? Hope to hear your response soon.

    opened by RongDai430 1
  • No basic auth credentials

    No basic auth credentials

    I read your paper and am interested in your proposed method. It is really nice of you to share such a helpful and clear repo. I pulled one of the images built with CUDA 10.2 and OpenMPI 4.0.1 with CUDA support and PyTorch 1.6.0, using the following command: $ docker pull docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi

    However it reported an error "Error response from daemon: Head https://docker.pkg.github.com/v2/mloptpsu/fedtorch/fedtorch/manifests/cuda10.2-mpi: no basic auth credentials".
    How can I solve this error?

    opened by Sybil-Huang 1
  • Adding EMNIST full dataset

    Adding EMNIST full dataset

    Adding EMNIST full dataset in addition to its digits_only version, which was implemented before. Now, we have emnist and emnist_full datasets. The emnist dataset has 3383 clients with 10 classes, but the emnist_full has 3400 clients with 62 classes. Resolve the Batch Norm issue for training with 1 sample in the batch.

    opened by mmkamani7 1
  • setup problem in docker

    setup problem in docker

    Hello,when I ran docker pull docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi in my docker,there's a problem err do you have any ideas to solve it? Thanks.

    opened by b856741 0
  • A question about drawing the plots

    A question about drawing the plots

    Hi, Recently I've read your paper "Distributionally Robust Federated Averaging" carefully and I'm trying to reproduce the experiments in it. However, I don't know how to get the plots from the codes, I think modifying main.py and run_mpi.py would help but it seems really complicated. Could you please give me some instructions? Thanks!

    opened by Zhg9300 2
  • Models not saved for Federated Centered

    Models not saved for Federated Centered

    Hi While running main_centered.py the model checkpoints not saved. It is not Implemented?

    Is there anything else missing for main_centered.py?

    Regards, Ahmed

    enhancement 
    opened by aldahdooh 1
Releases(v0.1.1)
Owner
Machine Learning and Optimization Lab @PennState
This is the GitHub repository of the Machine Learning and Optimization lab at Penn State University.
Machine Learning and Optimization Lab @PennState
Semantic Bottleneck Scene Generation

SB-GAN Semantic Bottleneck Scene Generation Coupling the high-fidelity generation capabilities of label-conditional image synthesis methods with the f

Samaneh Azadi 41 Nov 28, 2022
Official implementation of the paper "Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering"

Light Field Networks Project Page | Paper | Data | Pretrained Models Vincent Sitzmann*, Semon Rezchikov*, William Freeman, Joshua Tenenbaum, Frédo Dur

Vincent Sitzmann 130 Dec 29, 2022
Data, notebooks, and articles associated with the RSNA AI Deep Learning Lab at RSNA 2021

RSNA AI Deep Learning Lab 2021 Intro Welcome Deep Learners! This document provides all the information you need to participate in the RSNA AI Deep Lea

RSNA 65 Dec 16, 2022
CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability

This is the official repository of the paper: CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability A private copy of the

Fadi Boutros 33 Dec 31, 2022
CSAC - Collaborative Semantic Aggregation and Calibration for Separated Domain Generalization

CSAC Introduction This repository contains the implementation code for paper: Co

ScottYuan 5 Jul 22, 2022
bespoke tooling for offensive security's Windows Usermode Exploit Dev course (OSED)

osed-scripts bespoke tooling for offensive security's Windows Usermode Exploit Dev course (OSED) Table of Contents Standalone Scripts egghunter.py fin

epi 268 Jan 05, 2023
Official PyTorch implementation of BlobGAN: Spatially Disentangled Scene Representations

BlobGAN: Spatially Disentangled Scene Representations Official PyTorch Implementation Paper | Project Page | Video | Interactive Demo BlobGAN.mp4 This

148 Dec 29, 2022
3D-printable hand-strapped keyboard

Note: This repo has not been cleaned up and prepared for general consumption at all. This is just a dump of the project files. If there is any interes

Wojciech Baranowski 41 Dec 31, 2022
PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

Yang Li 12 May 30, 2022
LSTM Neural Networks for Spectroscopic Studies of Type Ia Supernovae

Package Description The difficulties in acquiring spectroscopic data have been a major challenge for supernova surveys. snlstm is developed to provide

7 Oct 11, 2022
Code for Recurrent Mask Refinement for Few-Shot Medical Image Segmentation (ICCV 2021).

Recurrent Mask Refinement for Few-Shot Medical Image Segmentation Steps Install any missing packages using pip or conda Preprocess each dataset using

XIE LAB @ UCI 39 Dec 08, 2022
Lightweight plotting to the terminal. 4x resolution via Unicode.

Uniplot Lightweight plotting to the terminal. 4x resolution via Unicode. When working with production data science code it can be handy to have plotti

Olav Stetter 203 Dec 29, 2022
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

Spotify 10.6k Jan 04, 2023
Machine Learning Model deployment for Container (TensorFlow Serving)

try_tf_serving ├───dataset │ ├───testing │ │ ├───paper │ │ ├───rock │ │ └───scissors │ └───training │ ├───paper │ ├───rock

Azhar Rizki Zulma 5 Jan 07, 2022
End-to-end machine learning project for rices detection

Basmatinet Welcome to this project folks ! Whether you like it or not this project is all about riiiiice or riz in french. It is also about Deep Learn

Béranger 47 Jun 18, 2022
一个多语言支持、易使用的 OCR 项目。An easy-to-use OCR project with multilingual support.

AgentOCR 简介 AgentOCR 是一个基于 PaddleOCR 和 ONNXRuntime 项目开发的一个使用简单、调用方便的 OCR 项目 本项目目前包含 Python Package 【AgentOCR】 和 OCR 标注软件 【AgentOCRLabeling】 使用指南 Pytho

AgentMaker 98 Nov 10, 2022
LEAP: Learning Articulated Occupancy of People

LEAP: Learning Articulated Occupancy of People Paper | Video | Project Page This is the official implementation of the CVPR 2021 submission LEAP: Lear

Neural Bodies 60 Nov 18, 2022
This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine

LSHTM_RCS This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine (LSHTM) in collabo

Lukas Kopecky 3 Jan 30, 2022
Cross-platform CLI tool to generate your Github profile's stats and summary.

ghs Cross-platform CLI tool to generate your Github profile's stats and summary. Preview Hop on to examples for other usecases. Jump to: Installation

HackerRank 134 Dec 20, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.datasets: The raw text iterators for common NLP datasets torchtext.data: Some basic NLP building bloc

3.2k Jan 08, 2023