SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Last update: Dec 25, 2022

Related tags

Overview

SmallInitEmb

LayerNorm(SmallInit(Embedding)) in a Transformer

I find that when training a transformer, the embedding matrix moves slowly, hence it's difficult for the model to jump out of the initial noisy embedding.

(initial embedding)
[[-0.0073  0.0062 -0.0261 ...  0.0086  0.0107 -0.008 ] ... ]
 (after 1 step, the directions of the embedding vectors are not moved much because the numbers change by ~LR = ~4e-4)
[[-0.0069  0.0066 -0.0265 ...  0.009   0.0111 -0.0084] ... ]

So I propose initializing the embedding matrix to tiny values, and put another LayerNorm after it (before all the SA & FFN layers):

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
if self.config.USE_SMALL_EMB and self.layer_id == 0:
    x = self.lnPre(x) # LN(SmallInit(Emb))
x = x + self.att(self.ln1(x))
x = x + self.ffn(self.ln2(x))

And then you get improved convergence (especially for BPE models) because the model can quickly jump out of the tiny initial embedding (small changes after 1 step -> significant changes of directions -> significant changes after LayerNorm).

Loss curve comparison: https://wandb.ai/blinkdl/SmallEmbTest

(the gap between LayerNorm(SmallEmb)) and baseline persists after more training)

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

if isinstance(module, (nn.Embedding)):
    nn.init.uniform_(module.weight, a=-1e-4, b=1e-4) # SmallInit(Emb)
...
x = self.ln1(x) # this plays the same role as the lnPre in the above PreLN code
x = x + self.att(x)
x = self.ln2(x)
x = x + self.ffn(x)
(note you shall have another LN after the final ffn)

SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

Related tags

Overview

SmallInitEmb

Moreover, you can directly train PostLN models without warmup with SmallInit(Emb)

Owner

PENG Bo

traiNNer is an open source image and video restoration (super-resolution, denoising, deblurring and others) and image to image translation toolbox based on PyTorch.

A repo to show how to use custom dataset to train s2anet, and change backbone to resnext101

Parallel and High-Fidelity Text-to-Lip Generation; AAAI 2022 ; Official code

Chinese Advertisement Board Identification(Pytorch)

Python-kafka-reset-consumergroup-offset-example - Python Kafka reset consumergroup offset example

PushForKiCad - AISLER Push for KiCad EDA

Attention-driven Robot Manipulation (ARM) which includes Q-attention

Code release for the paper “Worldsheet Wrapping the World in a 3D Sheet for View Synthesis from a Single Image”, ICCV 2021.

Python版OpenCVのTracking APIのサンプルです。DaSiamRPNアルゴリズムまで対応しています。

The MLOps platform for innovators 🚀

Code for CoMatch: Semi-supervised Learning with Contrastive Graph Regularization

SANet: A Slice-Aware Network for Pulmonary Nodule Detection

Predicts an answer in yes or no.

Python implementation of MULTIseq barcode alignment using fuzzy string matching and GMM barcode assignment

EigenGAN Tensorflow, EigenGAN: Layer-Wise Eigen-Learning for GANs

'A C2C E-COMMERCE TRUST MODEL BASED ON REPUTATION' Python implementation

Addon and nodes for working with structural biology and molecular data in Blender.

This is the official code release for the paper Shape and Material Capture at Home

Dense Contrastive Learning (DenseCL) for self-supervised representation learning, CVPR 2021.

A python library for implementing a recommender system