This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

Last update: Dec 28, 2022

Overview

private-transformers

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

What is this? Why an extra codebase?

This codebase provides a privacy engine that builds off Opacus, but works way more smoothly with Hugging Face's transformers library.
Additionally, we support the ghost clipping technique (see Section 4 of this preprint on how it works) which allows privately training large transformers with considerably reduced memory cost -- in many cases, almost as light as non-private training -- at a modest run-time overhead.
With this codebase, we have fine-tuned very large pretrained models, yielding some of the best performing differentially private NLP models to date. Some of these models have performance matching strong non-private baseline approaches. We see strong empirical evidence that highly performant DP NLP models could be built on modest datasets.

Installation

Make sure you have python>=3.8; run the following command:

pip install git+https://github.com/lxuechen/private-transformers.git

To check the package is installed properly, be sure to run the test suite (requires pytest and a GPU) via the following command:

pytest -s tests

Usage

Basic usage

Privately training Hugging Face transformers with our codebase simply consists of 4 steps:

Create your favourite transformer model and optimizer; attach this optimizer to a PrivacyEngine
Compute a per-example loss (1-D tensor) for a mini-batch of data
Pass the loss to optimizer.step or optimizer.virtual_step as a keyword argument
Repeat from step 2

Below is a quick example:

import transformers, torch
from private_transformers import PrivacyEngine
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = transformers.GPT2LMHeadModel.from_pretrained('distilgpt2').to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-4)
privacy_engine = PrivacyEngine(
    model,
    batch_size=10,
    sample_size=50000,
    epochs=3,
    max_grad_norm=0.1,
    target_epsilon=3,
)
privacy_engine.attach(optimizer)

batch_size, seq_len = 10, 20
# Inputs are batch-first format, i.e., the first dimension of tensors must be batch dimension.
input_ids = torch.randint(size=[batch_size, seq_len], low=0, high=100, device=device)
# Calling `.train()` is very important; otherwise underlying forward and backward hooks don't run.
model.train()
outputs = model(input_ids=input_ids, return_dict=True)
labels = input_ids[:, 1:, ]
logits = outputs.logits[:, :-1, :].permute(0, 2, 1)
# `loss` is a 1-D tensor of shape (batch_size,).
loss = F.cross_entropy(logits, labels, reduction="none").mean(dim=1)
# This step is different from existing workflows: 
#   Don't call `loss.backward`; leave it to `optimizer.step` to handle backward.
optimizer.step(loss=loss)

The biggest differences compared to Opacus are:

We require the per-example loss (a 1-D tensor) be passed into optimizer.step (or optimizer.virtual_step)
The per-example loss must be passed in as a keyword argument.
loss.backward() shouldn't be called on the user end; it's called internally in optimizer.step ( or optimizer.virtual_step).
Inputs should be in batch-first format; there isn't a toggle to switch between different formats in the engine.

Ghost clipping: memory saving differentially private learning

Turning on ghost clipping requires changing only 1 line. You should notice a drastic reduction in peak GPU memory usage once this is turned on, at a potential cost of slower training speed. One might find this especially useful when constrained to only use older GPUs with small VRAMs or fitting super large models.

import transformers, torch
from private_transformers import PrivacyEngine

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = transformers.GPT2LMHeadModel.from_pretrained('distilgpt2').to(device)
optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-4)
privacy_engine = PrivacyEngine(
    model,
    batch_size=10,
    sample_size=50000,
    epochs=3,
    max_grad_norm=0.1,
    target_epsilon=3,
    ghost_clipping=True,  # The only change you need to make!
)
privacy_engine.attach(optimizer)

We ran stringent numerical tests to ensure the double-backward implementation is correct. Check out files in the tests folder for more on this.

Examples

Code in the examples folder roughly reproduces our results for the table-to-text and classification tasks. There may be some minor discrepancies, since hyperparameters there aren't exactly what's used in the paper. Nevertheless, it should be sufficient to get things started. Detailed instructions are in the readme file of each subfolder.

Currently supported Hugging Face models

Not all models in the Hugging Face library are supported. The main additional work here is to

support per-example gradients for bespoke modules (e.g., T5LayerNorm), and
ensure position_ids are repeated.

We plan to support more models in the future if there's such a need. Feel free to open an issue if you may want to try out specific models that aren't in the current list.

FAQ

I wrote some answers to potential questions here.

Acknowledgements

It would have been impossible to develop this codebase without cool past works and existing codebases. We roughly follow the PrivacyEngine design in Opacus==0.13.0. We directly use an off-the-shelf package for tightly tracking tradeoff functions while composing multiple private mechanisms.

Disclaimer

This codebase is not yet production-grade, e.g., cryptographically secure PRNGs are required for sampling noise -- our codebase currently does not use these strong PRNGs.
This codebase is born out of the need to experiment with various things for differentially private NLP in rapidly succession. I've tried my best to write clean code, though parts of this codebase may be less tidy than I had hoped given the extremely tight timeline.

Citation

If you found this codebase useful in your research, please consider citing:

@misc{li2021large,
      title={Large Language Models Can Be Strong Differentially Private Learners}, 
      author={Xuechen Li and Florian Tramèr and Percy Liang and Tatsunori Hashimoto},
      year={2021},
      eprint={2110.05679},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Comments

Support BART model

Hi, I'm trying to apply your code in BART model. But I got the error like the below:

ValueError: Ghost clipping does not support parameter sharing. Parameter sharing may be due to default parameter sharing between lm_head and embedding.Please use a model without parameter sharing for ghost clipping.

Does it not support BART model yet??

opened by SeolhwaLee 7

Set another seed won't change the result

Hi Xuechen,

I have another issue with the training seed. I would like to relax the random seed so that I can get some statistical results. Tried many different ways but even comment out the set_seed() function, the eva acc is the same until the last digit. May I ask how to relax the random seed? I'm doing experiments on examples/classification.

Thanks!

opened by JunyiZhu-AI 6
Customize loss function / adding regularizer under privacy setting?

Hi, thanks again for the great work and the codebase!

I have a question -- how I'd want to customize loss function in the codebase? I've been trying to do that, e.g. adding a per-example L1 regularization term to vector_loss in trainer, but I didn't manage to get it running after several attempts.

There's a related discussion/PR in Opacus codebase https://github.com/pytorch/opacus/issues/249.

However, there're a few tricky things I can see: -- In private-transformers, backward() behavior is not managed on the user end. -- also, 1-D vector_loss is required for private gradient update - optimizer.stepor optimizer.virtual_step

My intuition is that I can add to vector_loss (per-example loss) at this line before the loss gets passed to the privacy engine.

However I am afraid privacy concern is also an issue. I am aware of that Private-Transformers overrides compute_loss() in HF trainer, to exclude regularization terms that might mess up with privacy accounting.

Sorry my question is not super detailed but I hope this makes sense and really appreciate for any comments.

Thank you!

opened by shi-kejian 4
Using dataloader with fixed batch size

Hi, thanks for providing this codebase!

So for a while I've been using Opacus to experiment with DP-SGD and RoBERTa, but I wanted to check out your PrivacyEngine, mainly because of the training speed and memory optimizations. With Opacus, I always trained with their UniformWithReplacementSampler for accurate RDP accounting and as far as I can tell, you're training with fixed size batches in your examples. I'm wondering if there's a reason the UniformWithReplacementSampler isn't needed in your codebase anymore, and if the uniform sampler is compatible with your modified PrivacyEngine because the optimizer needs to be able to deal with variations in batch size?

opened by xplip 4

How to set max_compositions

Hi Chen, do you know how to set the max_compositions/steps param? The default at https://github.com/lxuechen/private-transformers/blob/684e27fcd9978539fbabc357c7ea506c0353c771/private_transformers/privacy_utils/privacy_engine.py#L148 is 0 but would raise an error

private-transformers/private_transformers/privacy_utils/accounting/gdp_accounting.py:33: RuntimeWarning: invalid value encountered in double_scalars return norm.cdf(-eps / mu + mu / 2) - np.exp(eps) * norm.cdf(-eps / mu - mu / 2)
rv_accountant/accountant.py:55: RuntimeWarning: divide by zero encountered in double_scalars mesh_size = 2*eps_error / np.sqrt(2*max_compositions*np.log(2/eta0))

opened by hlzhang109 3

Setting small target epsilon like 0.1 fails training

Hi, @lxuechen I tried to set epsilon as 0.1 on SST-2, but it results in a large noise_multiplier: 20853.95 and fails the training where the accuracy is near 0.5 However, setting epsilon as 1 works well. Any idea about this?

opened by LinkToPast1900 3
Private gradient seemingly has been overwritten by non-private gradient.
Hi Xuechen, thanks for providing this codebase!

I tried tweaking the code in examples/classification but the network does not perform as expected. In particular, I tried zeroing out all gradient by this command in _step() and _ghost_step() functions in privacy_engine.py:

param.grad /= self.batch_size param.grad.mul_(0)

After adding this multiplication the network has been trained as normally. And because with the same seed, the network has been even trained to give out the same eval acc. Could you reproduce this result at your private repo? If it behaves like this, then I suppose that the private gradient has been overwritten by the non-private one.
opened by JunyiZhu-AI 3
Questions about sigma search and epsilon from composed tradeoff functions
(Making a new issue for this because you probably weren't notified of my comment in the closed original issue)

Sorry for having to reopen this, but I do have two more (perhaps related) questions after all and would really appreciate if you could help clarify them.

When using the automated sigma search (based on a specified target epsilon and N epochs), the final epsilon computed by the PrivacyEngine after training for N epochs is always much higher than the target epsilon, so it seems that the sigma chosen by get_sigma_from_rdp is too high. This also happens when I run the sentence classification and the table2test examples in the repo. E.g., instead of my target epsilon 8, I will end up with something like epsilon 10-11. How did you get your final epsilon to match the target epsilon in the experiments in your paper?

How do you compute the converted epsilon from composed tradeoff functions when let's say training SST-2 with the default hyperparameters from the examples? Do you reduce the num_compositions=1000 in _eps_from_glw to something way lower than 1000 because the script only runs for ~400 optimization steps and would otherwise always throw the Numerical composition of tradeoff functions failed! Double check privacy parameters. error?

Originally posted by @xplip in https://github.com/lxuechen/private-transformers/issues/7#issuecomment-987020758
opened by xplip 3
What is the best way to handle large models?

Hi all, I was trying to fine-tune GPT-J 6B but I encounter Out Of Memory errors if I use a single-gpu, for non-private training I managed to solve them by using deepspeed but it seems that I cannot use that with opacus or with this codebase. Do you know how I could solve this problem? Thank you in advance:)

opened by Pier297 2
No such file or directory

I want to finetune qqp and here comes an error:

File "/private-transformers-main/examples/classification/run_classification.py", line 545, in main train_dataset = FewShotDataset(data_args, tokenizer=tokenizer, mode="train", use_demo=use_demo) File "/private-transformers-main/examples/classification/src/dataset.py", line 377, in init with FileLock(lock_path): File "/home/anaconda3/envs/fuck/lib/python3.8/site-packages/filelock/_api.py", line 214, in enter self.acquire() File "/home/anaconda3/envs/fuck/lib/python3.8/site-packages/filelock/_api.py", line 170, in acquire self._acquire() File "/home/anaconda3/envs/fuck/lib/python3.8/site-packages/filelock/_unix.py", line 35, in _acquire fd = os.open(self._lock_file, open_mode) FileNotFoundError: [Errno 2] No such file or directory: 'classification/data/original/QQP/cached_train_RobertaTokenizer_256_qqp_few_shot.lock'

how can I get this file? thanks.

opened by trestad 2
[DistilBERT] RuntimeError: stack expects each tensor to be equal size

Hi, @lxuechen, thanks for your repo.

I met a problem as follows when I tied to finetune DistilBERT. Both BERT and Roberta work well. Any idea about this? Thanks!

Traceback (most recent call last): ... File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/private_transformers/privacy_utils/privacy_engine.py", line 360, in step self._ghost_step(loss=kwargs.pop("loss")) File "/opt/conda/lib/python3.8/site-packages/private_transformers/privacy_utils/privacy_engine.py", line 261, in _ghost_step self._ghost_helper(loss) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/private_transformers/privacy_utils/privacy_engine.py", line 334, in _ghost_helper coef_sample = self.get_coef_sample() File "/opt/conda/lib/python3.8/site-packages/private_transformers/privacy_utils/privacy_engine.py", line 348, in get_coef_sample norm_sample = self.get_norm_sample() File "/opt/conda/lib/python3.8/site-packages/private_transformers/privacy_utils/privacy_engine.py", line 343, in get_norm_sample norm_sample = torch.stack([param.norm_sample for name, param in self.named_params], dim=0).norm(2, dim=0) RuntimeError: stack expects each tensor to be equal size, but got [50] at entry 0 and [1] at entry 1

(50 is my batch size)

opened by LinkToPast1900 1
v0.3.0 fixes
Non-structural fixes.

[ ] Convert to make_private style to avoid bad syntax highlighting during static analysis

[ ] Improve the cleanliness of examples

[ ] Refactor test file and use functorch to simplify ground truth gradients' logic

[ ] Don't compute per-sample gradients for weights which don't require gradients

[ ] Use the new smart resizer for tokenizer and model

[ ] Refactor decoding to use new left padding based construction
opened by lxuechen 0

Releases(v0.2.3)

v0.2.3(Aug 15, 2022)

Merge in rebuttal stuff on spectral analysis.
Source code(tar.gz)
Source code(zip)
v0.2.2(Jul 21, 2022)

Address minor issues in #25.
Source code(tar.gz)
Source code(zip)
v0.2.1(Jul 18, 2022)
The codebase now supports the following additional models

BartForConditionalGeneration (when positional embedding layers are frozen)

T5ForConditionalGeneration

OPTForCausalLM

ViTForImageClassification (when isolated parameters are frozen; see this example)

DeiTForImageClassification (when isolated parameters are frozen)

BeitForImageClassification (when isolated parameters are frozen)

Minor rewrite of the privacy engine, sharing code between ghost and default clipping, thus making future maintaining potentially easier.

Add example file for fine-tuning of vision Transformers on CIFAR-10.

Use callback style implementation for playing with params and gradients after clipping.

Allow forward and backward hooks to use additional args and kwargs (not necessarily torch.Tensors).

Rewrite privacy accounting and remove GDP + CLT, since it under accounts.

Source code(tar.gz)
Source code(zip)
v0.1.0(Oct 28, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Xuechen Li

learning to learn

GitHub Repository

A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

647 Dec 25, 2022

Material for GW4SHM workshop, 16/03/2022.

GW4SHM Workshop Wednesday, 16th March 2022 (13:00 – 15:15 GMT): Presented by: Dr. Rhodri Nelson, Imperial College London Project website: https://www.

1 Mar 16, 2022

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

64 May 08, 2022

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

13 Sep 08, 2022

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

57 Dec 07, 2022

AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

12 Feb 16, 2022

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

2k Jan 01, 2023

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

4 Mar 20, 2022

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Subformer This repository contains the code for the Subformer. To help overcome this we propose the Subformer, allowing us to retain performance while

10 Dec 27, 2022

📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

2 May 26, 2022

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

142 Jan 06, 2023

To classify the News into Real/Fake using Features from the Text Content of the article

Hoax-Detector Authenticity of news has now become a major problem. The Idea is to classify the News into Real/Fake using Features from the Text Conten

1 Feb 09, 2022

Lumped-element impedance calculator and frequency-domain plotter.

fastZ: Lumped-Element Impedance Calculator fastZ is a small tool for calculating and visualizing electrical impedance in Python. Features include: Sup

47 Nov 18, 2022

Conversational-AI-ChatBot - Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users!

Conversational AI ChatBot Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users! In this project? Thi

6 Nov 30, 2022

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

Related tags

Overview

private-transformers

What is this? Why an extra codebase?

Installation

Usage

Basic usage

Ghost clipping: memory saving differentially private learning

Examples

Currently supported Hugging Face models

FAQ

Acknowledgements

Disclaimer

Citation

Comments

Releases(v0.2.3)

v0.2.3(Aug 15, 2022)

v0.2.2(Jul 21, 2022)

v0.2.1(Jul 18, 2022)

v0.1.0(Oct 28, 2021)

Owner

Xuechen Li

A PyTorch Implementation of End-to-End Models for Speech-to-Text

Material for GW4SHM workshop, 16/03/2022.

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

Pipeline for fast building text classification TF-IDF + LogReg baselines.

AllenNLP integration for Shiba: Japanese CANINE model

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

To classify the News into Real/Fake using Features from the Text Content of the article

Lumped-element impedance calculator and frequency-domain plotter.

Conversational-AI-ChatBot - Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users!

A framework for implementing federated learning

String Gen + Word Checker

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Conditional probing: measuring usable information beyond a baseline