This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

Last update: Dec 24, 2022

Related tags

Overview

MoEBERT

This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

Installation

Create and activate conda environment.

conda env create -f environment.yml

Install Transformers locally.

pip install -e .

Note: The code is adapted from this codebase. Arguments regarding LoRA and adapter can be safely ignored.

Instructions

MoEBERT targets task-specific distillation. Before running any distillation code, a pre-trained BERT model should be fine-tuned on the target task. Path to the fine-tuned model should be passed to --model_name_or_path.

Importance Score Computation

Use bert_base_mnli_example.sh to compute the importance scores, add a --preprocess_importance argument, remove the --do_train argument.
If multiple GPUs are used to compute the importance scores, a importance_[rank].pkl file will be saved for each GPU. Use merge_importance.py to merge these files.
To use the pre-computed importance scores, pass the file name to --moebert_load_importance.

Knowledge Distillation

For GLUE tasks, see examples/text-classification/run_glue.py.
For question answering tasks, see examples/question-answering/run_qa.py.
Run bash bert_base_mnli_example.sh as an example.
The codebase supports different routing strategies: gate-token, gate-sentence, hash-random and hash-balance. Choices should be passed to --moebert_route_method.
- To use hash-balance, a balanced hash list needs to be pre-computed using hash_balance.py. Path to the saved hash list should be passed to --moebert_route_hash_list.
- Add a load balancing loss by setting --moebert_load_balance when using trainable gating mechanisms.
- The sentence-based gating mechanism (gate-sentence) is advantageous for inference because it induces significantly less communication overhead compared with token-level routing methods.

This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

Related tags

Overview

MoEBERT

Installation

Instructions

Importance Score Computation

Knowledge Distillation

Owner

Simiao Zuo

Code for "Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search"

Kindle is an easy model build package for PyTorch.

Real-Time High-Resolution Background Matting

Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private Server services

[ICML 2020] "When Does Self-Supervision Help Graph Convolutional Networks?" by Yuning You, Tianlong Chen, Zhangyang Wang, Yang Shen

[CVPR 2020] Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

Adversarial examples to the new ConvNeXt architecture

Official implementation of "CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding" (CVPR, 2022)

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

A unified 3D Transformer Pipeline for visual synthesis

A PyTorch Implementation of Single Shot Scale-invariant Face Detector.

InterfaceGAN++: Exploring the limits of InterfaceGAN

Controlling a game using mediapipe hand tracking

Neural-fractal - Create Fractals Using Complex-Valued Neural Networks!

(CVPR 2021) Lifting 2D StyleGAN for 3D-Aware Face Generation

High performance distributed framework for training deep learning recommendation models based on PyTorch.

Source code and Dataset creation for the paper "Neural Symbolic Regression That Scales"

Code release for "Conditional Adversarial Domain Adaptation" (NIPS 2018)

Memoized coduals - Shows that it is possible to implement reverse mode autodiff using a variation on the dual numbers called the codual numbers

Integrated physics-based and ligand-based modeling.