The Noise Contrastive Estimation for softmax output written in Pytorch

Overview

An NCE implementation in pytorch

About NCE

Noise Contrastive Estimation (NCE) is an approximation method that is used to work around the huge computational cost of large softmax layer. The basic idea is to convert the prediction problem into classification problem at training stage. It has been proved that these two criterions converges to the same minimal point as long as noise distribution is close enough to real one.

NCE bridges the gap between generative models and discriminative models, rather than simply speedup the softmax layer. With NCE, you can turn almost anything into posterior with less effort (I think).

Refs:

NCE:

http://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann10AISTATS.pdf

NCE on rnnlm:

https://pdfs.semanticscholar.org/144e/357b1339c27cce7a1e69f0899c21d8140c1f.pdf

Comparison with other methods

A review of softmax speedup methods:

http://ruder.io/word-embeddings-softmax/

NCE vs. IS (Importance Sampling): Nce is a binary classification while IS is sort of multi-class classification problem.

http://demo.clab.cs.cmu.edu/cdyer/nce_notes.pdf

NCE vs. GAN (Generative Adversarial Network):

https://arxiv.org/abs/1412.6515

On improving NCE

Sampling methods

In NCE, unigram distribution is usually used to approximate the noise distribution because it's fast to sample from. Sampling from a unigram is equal to multinomial sampling, which is of complexity $O(\log(N))$ via binary search tree. The cost of sampling becomes significant when noise ratio increases.

Since the unigram distribution can be obtained before training and remains unchanged across training, some works are proposed to make use of this property to speedup the sampling procedure. Alias method is one of them.

diagram of constructing auxiliary data structure

By constructing data structures, alias method can reduce the sampling complexity from $O(log(N))$ to $O(1)$, and it's easy to parallelize.

Refs:

alias method:

https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/

Generic NCE (full-NCE)

Conventional NCE only perform the contrasting on linear(softmax) layer, that is, given an input of a linear layer, the model outputs are $p(noise|input)$ and $p(target|input)$. In fact NCE can be applied to more general situations where models are capable to output likelihood values for both real data and noise data.

In this code base, I use a variant of generic NCE named full-NCE (f-NCE) to clarify. Unlike normal NCE, f-NCE samples the noises at input embedding.

Refs:

whole sentence language model by IBM (ICASSP2018)

Bi-LSTM language model by speechlab,SJTU (ICSLP2016?)

Batched NCE

Conventional NCE requires different noise samples per data token. Such computational pattern is not fully GPU-efficient because it needs batched matrix multiplication. A trick is to share the noise samples across the whole mini-batch, thus sparse batched matrix multiplication is converted to more efficient dense matrix multiplication. The batched NCE is already supported by Tensorflow.

A more aggressive approach is to called self contrasting (named by myself). Instead of sampling from noise distribution, the noises are simply the other training tokens the within the same mini-batch.

Ref:

batched NCE

https://arxiv.org/pdf/1708.05997.pdf

self contrasting:

https://www.isi.edu/natural-language/mt/simple-fast-noise.pdf

Run the word language model example

There's an example illustrating how to use the NCE module in example folder. This example is forked from the pytorch/examples repo.

Requirements

Please run pip install -r requirements first to see if you have the required python lib.

  • tqdm is used for process bar during training
  • dill is a more flexible replacement for pickle

NCE related Arguments

  • --nce: whether to use NCE as approximation
  • --noise-ratio <50>: numbers of noise samples per batch, the noise is shared among the tokens in a single batch, for training speed.
  • --norm-term <9>: the constant normalization term Ln(z)
  • --index-module <linear>: index module to use for NCE module (currently and available, does not support PPL calculating )
  • --train: train or just evaluation existing model
  • --vocab <None>: use vocabulary file if specified, otherwise use the words in train.txt
  • --loss [full, nce, sampled, mix]: choose one of the loss type for training, the loss is converted to full for PPL evaluation automatically.

Examples

Run NCE criterion with linear module:

python main.py --cuda --noise-ratio 10 --norm-term 9 --nce --train

Run NCE criterion with gru module:

python main.py --cuda --noise-ratio 10 --norm-term 9 --nce --train --index-module gru

Run conventional CE criterion:

python main.py --cuda --train

A small benchmark in swbd+fisher dataset

It's a performance showcase. The dataset is not bundled in this repo however. The model is trained on concatenated sentences,but the hidden states are not passed across batches. An <s> is inserted between sentences. The model is evaluated on <s> padded sentences separately.

Generally a model trained on concatenated sentences performs slightly worse than the one trained on separate sentences. But we saves 50% of training time by reducing the sentence padding operation.

dataset statistics

  • training samples: 2200000 sentences, 22403872 words
  • built vocabulary size: ~30K

testbed

  • 1080 Ti
  • i7 7700K
  • pytorch-0.4.0
  • cuda-8.0
  • cudnn-6.0.1

how to run:

python main.py --train --batch-size 96 --cuda --loss nce --noise-ratio 500 --nhid 300 \
  --emsize 300 --log-interval 1000 --nlayers 1 --dropout 0 --weight-decay 1e-8 \
  --data data/swb --min-freq 3 --lr 2 --save nce-500-swb --concat

Running time

  • crossentropy: 6.5 mins/epoch (56K tokens/sec)
  • nce: 2 mins/epoch (187K tokens/sec)

performance

The rescore is performed on swbd 50-best, thanks to HexLee.

training loss type evaluation type PPL WER
3gram normed ?? 19.4
CE(no concat) normed(full) 53 13.1
CE normed(full) 55 13.3
NCE unnormed(NCE) invalid 13.4
NCE normed(full) 55 13.4
importance sample normed(full) 55 13.4
importance sample sampled(500) invalid 19.0(worse than w/o rescore)

File structure

  • example/log/: some log files of this scripts
  • nce/: the NCE module wrapper
    • nce/nce_loss.py: the NCE loss
    • nce/alias_multinomial.py: alias method sampling
    • nce/index_linear.py: an index module used by NCE, as a replacement for normal Linear module
    • nce/index_gru.py: an index module used by NCE, as a replacement for the whole language model module
  • sample.py: a simple script for NCE linear.
  • example: a word langauge model sample to use NCE as loss.
    • example/vocab.py: a wrapper for vocabulary object
    • example/model.py: the wrapper of all nn.Modules.
    • example/generic_model.py: the model wrapper for index_gru NCE module
    • example/main.py: entry point
    • example/utils.py: some util functions for better code structure

Modified README from Pytorch/examples

This example trains a multi-layer LSTM on a language modeling task. By default, the training script uses the PTB dataset, provided.

python main.py --train --cuda --epochs 6        # Train a LSTM on PTB with CUDA

The model will automatically use the cuDNN backend if run on CUDA with cuDNN installed.

During training, if a keyboard interrupt (Ctrl-C) is received, training is stopped and the current model is evaluated against the test dataset.

The main.py script accepts the following arguments:

optional arguments:
  -h, --help         show this help message and exit
  --data DATA        location of the data corpus
  --emsize EMSIZE    size of word embeddings
  --nhid NHID        humber of hidden units per layer
  --nlayers NLAYERS  number of layers
  --lr LR            initial learning rate
  --lr-decay         learning rate decay when no progress is observed on validation set
  --weight-decay     weight decay(L2 normalization)
  --clip CLIP        gradient clipping
  --epochs EPOCHS    upper epoch limit
  --batch-size N     batch size
  --dropout DROPOUT  dropout applied to layers (0 = no dropout)
  --seed SEED        random seed
  --cuda             use CUDA
  --log-interval N   report interval
  --save SAVE        path to save the final model
  --bptt             max length of truncated bptt
  --concat           use concatenated sentence instead of individual sentence

CHANGELOG

  • 2019.09.09: Improve numeric stability by directly calculation on logits
Comments
  • truncated bptt without padding?

    truncated bptt without padding?

    Hi,

    Thanks for the great example. I noticed that you pad sentences to the max length per mini-batch, which is a bit different from the truncated bptt approach of the original word_language_model without NCE on pytorch. I wonder if you have compared the two approach and investigated if it makes a difference in terms of the final ppl?

    I'm also interested to know how much better this model can be on the ptb dataset. I'm also reading the torch blog post on NCE

    opened by eric-haibin-lin 2
  • Error in NCE expression?

    Error in NCE expression?

    In line 227 of nce_loss.py: logit_true = logit_model - logit_noise - math.log(self.noise_ratio), but this is not the same as the log of line 223:# p_true = logit_model.exp() / (logit_model.exp() + self.noise_ratio * logit_noise.exp()) where is the logit_model in the denominator? shouldn't be logit_true = logit_model - log( logit_model.exp() + self.noise_ratio * logit_noise.exp())?

    opened by bczhu 1
  • why the labels in sampled_softmax_loss func are all zero?

    why the labels in sampled_softmax_loss func are all zero?

    I read the code in nce_loss.py:

        def sampled_softmax_loss(self, logit_target_in_model, logit_noise_in_model, logit_noise_in_noise, logit_target_in_noise):
            """Compute the sampled softmax loss based on the tensorflow's impl"""
            logits = torch.cat([logit_target_in_model.unsqueeze(2), logit_noise_in_model], dim=2)
            q_logits = torch.cat([logit_target_in_noise.unsqueeze(2), logit_noise_in_noise], dim=2)
            # subtract Q for correction of biased sampling
            logits = logits - q_logits
            labels = torch.zeros_like(logits.narrow(2, 0, 1)).squeeze(2).long()
            loss = self.ce(
                logits.view(-1, logits.size(-1)),
                labels.view(-1),
            ).view_as(labels)
    
            return loss
    

    The labels are created by 'torch.zeros_like' function, so they are all zeros. Is this a bug? Because the target label should be one?

    opened by universewill 1
  • Why the nec_linear output loss while output prob for testing?

    Why the nec_linear output loss while output prob for testing?

    i don't understand the code below in sample.py:

    # training mode
    loss = nce_linear(target, input).mean()
    print(loss.item())
    
    # evaluation mode for fast probability computation
    nce_linear.eval()
    prob = nce_linear(target, input).mean()
    print(prob.item())
    

    Besides, why need target input parameter for inference?

    opened by universewill 1
  • Why need to sub math.log(self.noise_ratio)

    Why need to sub math.log(self.noise_ratio)

    Hi Stonesjtu,

    Thanks for the sharing this NCE implement. I have a question about details. I'd like to know why we need to sub math.log(self.noise_ratio) here: https://github.com/Stonesjtu/Pytorch-NCE/blob/1fae107a92e24e39f25dd69b766806709c70d414/nce/nce_loss.py#L228

    In this tutorial https://www.tensorflow.org/extras/candidate_sampling.pdf, see the Table of Candidate Sampling Algorithms. The input to training loss of NCE is G(x, y) = F(x, y) - log(Q(y|x)).

    Thanks, Bin

    opened by gbuion 1
  • Target Sample can be included in Noise sample

    Target Sample can be included in Noise sample

    Hello. Thanks you for your NCE code in pytorch. It is very helpful. I have some question about noise sampling. In your code, target sample can be sampled as noise sample. And "K" noise sample can be overlap. Is it OK ? I think it is not valid in theory, but practically OK. Do you have any idea for this ?

    opened by adonisues 1
  • main.py does not run 'as is' on penn data

    main.py does not run 'as is' on penn data

    Hi there,

    I'm trying out your code and couldn't run it 'as is' on penn data. I changed the import of data_sms to data in main.py. Maybe you left this from some tryouts on another dataset?

    Thanks for your implementation anyways. F

    opened by francoishernandez 1
  • why squeeze here?

    why squeeze here?

    Hi, I think there is a bug here:

    https://github.com/Stonesjtu/Pytorch-NCE/blob/862afc666445dca4ce9d24a3eb1e073255edb92e/nce.py#L198

    For RNN model which the last layer before softmax has shape [B * N * D] where time steps N>1, I believe the squeeze do not have any effect. Maybe for batch size B=1? If that is the case, squeeze(0) might be a better choice.

    I am using your code for predicting the last state (in other words, N=1). The squeeze here will give a model_loss.shape = (B , 1) and noise_loss.shape = (B,) and then the total loss.shape = (B, B), which should be (B,1) I think.

    opened by chaoqing 3
Releases(neat-nce)
  • neat-nce(Nov 15, 2017)

    • The main file contains the minimal details required. Many helper functions are moved into utils file.
    • Model's API is simplified a lot.

    • Speed issues remain to be solved.
    Source code(tar.gz)
    Source code(zip)
Owner
Kaiyu Shi
Studying Language Model
Kaiyu Shi
Yas CRNN model training - Yet Another Genshin Impact Scanner

Yas-Train Yet Another Genshin Impact Scanner 又一个原神圣遗物导出器 介绍 该仓库为 Yas 的模型训练程序 相关资料 MobileNetV3 CRNN 使用 假设你会设置基本的pytorch环境。 生成数据集 python main.py gen 训练

wormtql 18 Jan 08, 2023
SOTA model in CIFAR10

A PyTorch Implementation of CIFAR Tricks 调研了CIFAR10数据集上各种trick,数据增强,正则化方法,并进行了实现。目前项目告一段落,如果有更好的想法,或者希望一起维护这个项目可以提issue或者在我的主页找到我的联系方式。 0. Requirement

PJDong 58 Dec 21, 2022
Message Passing on Cell Complexes

CW Networks This repository contains the code used for the papers Weisfeiler and Lehman Go Cellular: CW Networks (Under review) and Weisfeiler and Leh

Twitter Research 108 Jan 05, 2023
Existing Literature about Machine Unlearning

Machine Unlearning Papers 2021 Brophy and Lowd. Machine Unlearning for Random Forests. In ICML 2021. Bourtoule et al. Machine Unlearning. In IEEE Symp

Jonathan Brophy 213 Jan 08, 2023
Charsiu: A transformer-based phonetic aligner

Charsiu: A transformer-based phonetic aligner [arXiv] Note. This is a preview version. The aligner is under active development. New functions, new lan

jzhu 166 Dec 09, 2022
Transfer SemanticKITTI labeles into other dataset/sensor formats.

LiDAR-Transfer Transfer SemanticKITTI labeles into other dataset/sensor formats. Content Convert datasets (NUSCENES, FORD, NCLT) to KITTI format Minim

Photogrammetry & Robotics Bonn 64 Nov 21, 2022
Preparation material for Dropbox interviews

Dropbox-Onsite-Interviews A guide for the Dropbox onsite interview! The Dropbox interview question bank is very small. The bank has been in a Chinese

386 Dec 31, 2022
Open-sourcing the Slates Dataset for recommender systems research

FINN.no Recommender Systems Slate Dataset This repository accompany the paper "Dynamic Slate Recommendation with Gated Recurrent Units and Thompson Sa

FINN.no 48 Nov 28, 2022
TensorFlow implementation of AlexNet and its training and testing on ImageNet ILSVRC 2012 dataset

AlexNet training on ImageNet LSVRC 2012 This repository contains an implementation of AlexNet convolutional neural network and its training and testin

Matteo Dunnhofer 161 Nov 25, 2022
Official code for UnICORNN (ICML 2021)

UnICORNN (Undamped Independent Controlled Oscillatory RNN) [ICML 2021] This repository contains the implementation to reproduce the numerical experime

Konstantin Rusch 21 Dec 22, 2022
A Quick and Dirty Progressive Neural Network written in TensorFlow.

prog_nn .▄▄ · ▄· ▄▌ ▐ ▄ ▄▄▄· ▐ ▄ ▐█ ▀. ▐█▪██▌•█▌▐█▐█ ▄█▪ •█▌▐█ ▄▀▀▀█▄▐█▌▐█▪▐█▐▐▌ ██▀

SynPon 53 Dec 12, 2022
Commonsense Ability Tests

CATS Commonsense Ability Tests Dataset and script for paper Evaluating Commonsense in Pre-trained Language Models Use making_sense.py to run the exper

XUHUI ZHOU 28 Oct 19, 2022
Contenido del curso Bases de datos del DCC PUC versión 2021-2

IIC2413 - Bases de Datos Tabla de contenidos Equipo Profesores Ayudantes Contenidos Calendario Evaluaciones Resumen de notas Foro Política de integrid

54 Nov 23, 2022
EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at

9 Oct 28, 2022
Make a Turtlebot3 follow a figure 8 trajectory and create a robot arm and make it follow a trajectory

HW2 - ME 495 Overview Part 1: Makes the robot move in a figure 8 shape. The robot starts moving when launched on a real turtlebot3 and can be paused a

Devesh Bhura 0 Oct 21, 2022
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022
Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

Arsenii Senya Ashukha 80 Sep 17, 2022
Implementation for paper "STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement" (ICCV 2021).

STAR-pytorch Implementation for paper "STAR: A Structure-aware Lightweight Transformer for Real-time Image Enhancement" (ICCV 2021). CVF (pdf) STAR-DC

43 Dec 21, 2022
Multiple custom object count and detection using YOLOv3-Tiny method

Electronic-Component-YOLOv3 Introduce This project created to detect, count, and recognize multiple custom object using YOLOv3-Tiny method. The target

Derwin Mahardika 2 Nov 14, 2022
Various operations like path tracking, counting, etc by using yolov5

Object-tracing-with-YOLOv5 Various operations like path tracking, counting, etc by using yolov5

Pawan Valluri 5 Nov 28, 2022