Training RNNs as Fast as CNNs

Overview

News

SRU++, a new SRU variant, is released. [tech report] [blog]

The experimental code and SRU++ implementation are available on the dev branch which will be merged into master later.

About

SRU is a recurrent unit that can run over 10 times faster than cuDNN LSTM, without loss of accuracy tested on many tasks.


Average processing time of LSTM, conv2d and SRU, tested on GTX 1070

For example, the figure above presents the processing time of a single mini-batch of 32 samples. SRU achieves 10 to 16 times speed-up compared to LSTM, and operates as fast as (or faster than) word-level convolution using conv2d.

Reference:

Simple Recurrent Units for Highly Parallelizable Recurrence [paper]

@inproceedings{lei2018sru,
  title={Simple Recurrent Units for Highly Parallelizable Recurrence},
  author={Tao Lei and Yu Zhang and Sida I. Wang and Hui Dai and Yoav Artzi},
  booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute [paper]

@article{lei2021srupp,
  title={When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute},
  author={Tao Lei},
  journal={arXiv preprint arXiv:2102.12459},
  year={2021}
}

Requirements

Install requirements via pip install -r requirements.txt.


Installation

From source:

SRU can be installed as a regular package via python setup.py install or pip install ..

From PyPi:

pip install sru

Directly use the source without installation:

Make sure this repo and CUDA library can be found by the system, e.g.

export PYTHONPATH=path_to_repo/sru
export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Examples

The usage of SRU is similar to nn.LSTM. SRU likely requires more stacking layers than LSTM. We recommend starting by 2 layers and use more if necessary (see our report for more experimental details).

import torch
from sru import SRU, SRUCell

# input has length 20, batch size 32 and dimension 128
x = torch.FloatTensor(20, 32, 128).cuda()

input_size, hidden_size = 128, 128

rnn = SRU(input_size, hidden_size,
    num_layers = 2,          # number of stacking RNN layers
    dropout = 0.0,           # dropout applied between RNN layers
    bidirectional = False,   # bidirectional RNN
    layer_norm = False,      # apply layer normalization on the output of each layer
    highway_bias = -2,        # initial bias of highway gate (<= 0)
)
rnn.cuda()

output_states, c_states = rnn(x)      # forward pass

# output_states is (length, batch size, number of directions * hidden size)
# c_states is (layers, batch size, number of directions * hidden size)

Contributing

Please read and follow the guidelines.

Other Implementations

@musyoku had a very nice SRU implementaion in chainer.

@adrianbg implemented the first CPU version.


Comments
  • Enable both Pytorch native AMP and Nvidia APEX AMP for SRU

    Enable both Pytorch native AMP and Nvidia APEX AMP for SRU

    Hi!

    I was happily using SRUs with Pytorch native AMP, however I started experimenting with training using Microsoft DeepSpeed and bumped in to an issue.

    Basically the issues is that I observed that FP16 training using DeepSpeed doesn't work for both GRUs and SRUs. However when using Nvidia APEX AMP, DeepSpeed training using GRUs does work.

    So, based on the tips in one of your issues, I started looking in to how I could enable Pytorch native AMP and Nvidia APEX AMP for SRUs, so I could train models based on SRUs using DeepSpeed.

    That is why I created this pull request. Basically, I found that by making the code simpler, I can make SRUs work with both methods of AMP.

    Now amp_recurrence_fp16 can be used for both types of AMP. When amp_recurrence_fp16=True, the tensor's are cast to float16, otherwise nothing special happens. So, I also removed the torch.cuda.amp.autocast(enabled=False) region; I might be wrong, but it seems that we don't need it.

    I did some tests with my own code and it works in the different scenarios of interest:

    • Using PyTorch native AMP, not using DeepSpeed
    • Not using PyTorch native AMP, not using DeepSpeed
    • Using Nvidia APEX AMP, using DeepSpeed
    • Not using Nvidia APEX AMP, using DeepSpeed

    It would be beneficial if we can test this with an official SRU repo test, maybe repurposing the language_model/train_lm.py?

    opened by visionscaper 13
  • float16 handling

    float16 handling

    When I convert my model, which using this SRU unit, into float16 enabled one, it fails. Is this SRU not implemented to use in float16 environment, or is it hard to fix it?

    bug 
    opened by ywatanabe1989 11
  • support GPU inference in torchscript

    support GPU inference in torchscript

    This is on 3.0.0-dev branch for now

    A non-trivial PR to support GPU inference in torchscript

    • Load CUDA kernels as non-python modules; this is needed for torchscript compilation
    • Refactored CUDA APIs as functions that return output as tensors, instead of procedures that modify some passed-in tensors.
    • Added a workaround in case TS tries to locate and compile CUDA methods on machines that don't have CUDA / GPUs

    The refactored code has passed the forward() & backward() test. I also checked the outputs are the same for the non-torchscript and torchscript versions of the same model.

    opened by taoleicn 8
  • Error unpacking PackedSequence on latest version

    Error unpacking PackedSequence on latest version

    Hello @taolei87 , After updating to the latest version, my code broke. It works great on the previous 2.3.5 version and with nn.LSTM.

    File "C:\xxx\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
      result = self.forward(*input, **kwargs)
    File "C:\xxx\lib\site-packages\sru\modules.py", line 576, in forward
      mask_pad = (mask_pad >= batch_sizes.view(length, 1)).contiguous()
    RuntimeError: shape '[393, 1]' is invalid for input of size 384
    

    I can see that in the previous version the unpacking code on forward was different:

            input_packed = isinstance(input, nn.utils.rnn.PackedSequence)
            if input_packed:
                input, lengths = nn.utils.rnn.pad_packed_sequence(input)
                max_length = lengths.max().item()
                mask_pad = torch.ByteTensor([[0] * l + [1] * (max_length - l) for l in lengths.tolist()])
                mask_pad = mask_pad.to(input.device).transpose(0, 1).contiguous()
    

    Now is:

    
            orig_input = input
            if isinstance(orig_input, PackedSequence):
                input, batch_sizes, sorted_indices, unsorted_indices = input
                length = input.size(0)
                batch_size = input.size(1)
                mask_pad = torch.arange(batch_size,
                                        device=batch_sizes.device).expand(length, batch_size)
                mask_pad = (mask_pad >= batch_sizes.view(length, 1)).contiguous()
    
    bug 
    opened by bratao 8
  • Increasing GPU Usage each epoch

    Increasing GPU Usage each epoch

    I'm trying to implement a model that includes a SRUCell. This are my specs:

    Tesla M60 GPU torch.version: 0.4.1.post2 torch.cuda.version: 9.0.176

    Although its training, every epoch the memory usage in the GPU increases until it fills it. I made a toy example where this error occurs:

    import torch
    from torch.autograd import Variable
    from sru import SRUCell
    
    
    batch_size = 5
    seq_len = 60
    epochs = 1000
    cuda = torch.cuda.is_available()
    
    model = SRUCell(100, 100)
    
    if cuda:
        model.cuda(0)
    
    optimizer = torch.optim.Adam([
            {'params':model.parameters()}], lr=1e-3)
    
    loss_function = torch.nn.MSELoss()
        
    seq = Variable(torch.rand(batch_size,seq_len,100))
    y = Variable(torch.rand(batch_size,100))
    
    
    if cuda:
        seq = seq.cuda(0)
        y = y.cuda(0)
    
    
    model.train()
    
    for e in range(epochs):
        model.zero_grad()
        
        h = Variable(torch.zeros(batch_size, 100))
        c = Variable(torch.zeros(batch_size, 100))
        
        if cuda:
            h = h.cuda(0)
            c = c.cuda(0)
        
        for i in range(seq_len):
            x = seq[:,i,:]
            h, c = model(x, c)
        loss = loss_function(h, y)
        loss.backward()
        optimizer.step()
        print('Epoch: {} - Loss: {}'.format(e, loss))
    
    opened by santiag0m 8
  • Can i put hidden states in sru cell forward like in vanilla pytorch?

    Can i put hidden states in sru cell forward like in vanilla pytorch?

    In vanilla it work like this

    rnn = nn.LSTMCell(10, 20)
    input = torch.randn(6, 3, 10)
    hx = torch.randn(3, 20)
    cx = torch.randn(3, 20)
    output = []
    for i in range(6):
        hx, cx = rnn(input[i], (hx, cx))
        output.append(hx)
    

    How can i do same for sru cell?

    opened by hadaev8 7
  • AttributeError when preprocessing data for DrQA

    AttributeError when preprocessing data for DrQA

    Firstly i ran download.sh, and it succesfully downloaded glove and train/dev jsons for SQuAD. However, python prepro.py gave me this:

    Traceback (most recent call last):
      File "prepro.py", line 243, in <module>
        vocab_tag = list(nlp.tagger.tag_names)
    AttributeError: 'Tagger' object has no attribute 'tag_names'
    

    My Spacy version is 2.0.3, and it seems like something broke in update from 1.x that is written in requirements, and I didn't succeed in fixing it myself. Any suggests?

    opened by mojesty 7
  • Calculating Backwards For SRU Results in CUDA error.

    Calculating Backwards For SRU Results in CUDA error.

    I'm not sure how, but I'm seeing this error when I try to compute the backwards function. Don't know if you've come across this during your debug?

    Traceback (most recent call last):
      File "gan_language.py", line 341, in <module>
        G.backward(one)
      File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 156, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
      File "/usr/local/lib/python2.7/dist-packages/torch/autograd/__init__.py", line 98, in backward
        variables, grad_variables, retain_graph)
      File "/home/nick/wgan-gp/sru/cuda_functional.py", line 417, in backward
        stream=SRU_STREAM
      File "cupy/cuda/function.pyx", line 129, in cupy.cuda.function.Function.__call__ (cupy/cuda/function.cpp:4010)  File "cupy/cuda/function.pyx", line 111, in cupy.cuda.function._launch (cupy/cuda/function.cpp:3647)
      File "cupy/cuda/driver.pyx", line 127, in cupy.cuda.driver.launchKernel (cupy/cuda/driver.cpp:2541)
      File "cupy/cuda/driver.pyx", line 62, in cupy.cuda.driver.check_status (cupy/cuda/driver.cpp:1446)
    cupy.cuda.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
    
    opened by NickShahML 7
  • Speed up data loading / batching for ONE BILLION WORD experiment

    Speed up data loading / batching for ONE BILLION WORD experiment

    The data loading was inefficient and was found to be the bottleneck of BILLION WORD training. This PR rewrote the sharding (which data goes to a certain GPU / training process), and improved the training speed significantly.

    The figure compares a previous run and a new test run. We see 40% reduction on training time.

    This means our reported training efficiency will be much stronger from 59 GPU days to 36 GPU days, and 4x more efficient than FairSeq Transformer results.

    opened by taoleicn 6
  • Different input dimention compared to output dimension

    Different input dimention compared to output dimension

    Hi, I'm trying to implement a naive version of this paper in Keras, and was wondering how is the case that - n_in != n_out handled.

    I went through the code a few times, and couldn't understand the element wise multiplication of (1 - r_t) with x_t, if x_t is of a different shape than r_t.

    question 
    opened by titu1994 6
  • support GPU inference in torchscript model for v2.5 / v2.6

    support GPU inference in torchscript model for v2.5 / v2.6

    This PR works for master branch, v2.5 and v2.6 release

    A non-trivial PR to support GPU inference in torchscript

    • Load CUDA kernels as non-python modules; this is needed for torchscript compilation
    • Refactored CUDA APIs as functions that return output as tensors, instead of procedures that modify some passed-in tensors.
    • Added a workaround in case TS tries to locate and compile CUDA methods on machines that don't have CUDA / GPUs
    • The refactored code has passed the forward() & backward() test.
    • I also checked the outputs are the same for the non-torchscript and torchscript versions of the same model.
    opened by taoleicn 5
  • Mixed Precision Training

    Mixed Precision Training

    Hi,

    first of all I want to thank you for your great work. I'm using SRUs for speech enhancement, they do very well on a reasonable computational cost.

    I would like to know if there is a possibility to train SRUs in mixed precision mode? I tried to enable it, by setting precision=16 in the pytorch lightning trainer, but that didn't do the trick.

    Kind of regards, Zadagu

    opened by Zadagu 1
  • Any documentation on using SRU++ ?

    Any documentation on using SRU++ ?

    Hello, I've read and really appreciated your team's wonderful works on SRU++. I want to implement this architecture in other tasks, but i'm having problem finding the documentation on SRU++, as how I can use SRU++ the same way as SRU (calling directly from sru library after installing by pip install sru). I have looked into the dev-3.0.0 branch, which seems like the latest updated branch, but I still have no clues how to call and integrate sru++ modules into my custom defined pytorch modules. Could you help me ?

    opened by thangld201 1
  • FAILED: sru_cuda_kernel.cuda.o

    FAILED: sru_cuda_kernel.cuda.o

    when i run example, i meet this issue:FAILED: sru_cuda_kernel.cuda.o ,and in the end, it report ninja: build stopped: subcommand failed. what should i do to slove this problem?

    opened by xianyu-123 0
  • Avoid unintended eager cuda initialization

    Avoid unintended eager cuda initialization

    We noticed the package initialization for sru is eagerly triggering the initialization because of the following stack of module imports sru.modules -> sru.ops -> cuda_functional and this last module is executing the function load of torch.utils.cpp_extension.

    This was detected because of issues caused when running with the server framework in SUBPROCESS_MODE, that is forking a new process for it to run the model. We got an error complaining that CUDA had been already initialized in the parent process, which was not necessary because it is not meant to run the inference in the model.

    This PR changes this loading to be more lazy, more concretely we changed the code in sru.modules to avoid the eager import of sru.ops and instead postpone it to the instantiation of a first SRUCell.

    The changes in this PR have been tested doing a checkout of this branch in an AWS instance with GPU and running pytest -sv test which resulted in 141 passed, 161 warnings and no failures. So we understand this is working as expected for both CPU and GPU settings.

    opened by dkasapp 0
  • Unknown builtin op: sru_cuda::sru_bi_forward_simple

    Unknown builtin op: sru_cuda::sru_bi_forward_simple

    When using a bidirectional SRU, regular usage seems to be fine, and compilation to torchscript proceeds without error, but upon trying to infer with the compiled torchscript I get:

    Unknown builtin op: sru_cuda::sru_bi_forward_simple.

    Using pytorch 1.10, sru 2.6.0, cuda 11.3

    opened by ctlaltdefeat 2
Releases(v2.7.0-rc1)
Owner
ASAPP Research
AI for Enterprise
ASAPP Research
ICRA 2021 - Robust Place Recognition using an Imaging Lidar

Robust Place Recognition using an Imaging Lidar A place recognition package using high-resolution imaging lidar. For best performance, a lidar equippe

Tixiao Shan 293 Dec 27, 2022
Colour detection is necessary to recognize objects, it is also used as a tool in various image editing and drawing apps.

Colour Detection On Image Colour detection is the process of detecting the name of any color. Simple isn’t it? Well, for humans this is an extremely e

Astitva Veer Garg 1 Jan 13, 2022
BirdCLEF 2021 - Birdcall Identification 4th place solution

BirdCLEF 2021 - Birdcall Identification 4th place solution My solution detail kaggle discussion Inference Notebook (best submission) Environment Use K

tattaka 42 Jan 02, 2023
PyTorch implementation DRO: Deep Recurrent Optimizer for Structure-from-Motion

DRO: Deep Recurrent Optimizer for Structure-from-Motion This is the official PyTorch implementation code for DRO-sfm. For technical details, please re

Alibaba Cloud 56 Dec 12, 2022
Official implementation of "Accelerating Reinforcement Learning with Learned Skill Priors", Pertsch et al., CoRL 2020

Accelerating Reinforcement Learning with Learned Skill Priors [Project Website] [Paper] Karl Pertsch1, Youngwoon Lee1, Joseph Lim1 1CLVR Lab, Universi

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 134 Dec 06, 2022
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer

CSAW-M This repository contains code for CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer. Source code for tr

Yue Liu 7 Oct 11, 2022
Implementation of the paper Scalable Intervention Target Estimation in Linear Models (NeurIPS 2021), and the code to generate simulation results.

Scalable Intervention Target Estimation in Linear Models Implementation of the paper Scalable Intervention Target Estimation in Linear Models (NeurIPS

0 Oct 25, 2021
Few-shot Learning of GPT-3

Few-shot Learning With Language Models This is a codebase to perform few-shot "in-context" learning using language models similar to the GPT-3 paper.

Tony Z. Zhao 224 Dec 28, 2022
Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Optimization for Oriented Object Detection via Representation Invariance Loss By Qi Ming, Zhiqiang Zhou, Lingjuan Miao, Xue Yang, and Yunpeng Dong. Th

ming71 56 Nov 28, 2022
Convolutional Neural Network for Text Classification in Tensorflow

This code belongs to the "Implementing a CNN for Text Classification in Tensorflow" blog post. It is slightly simplified implementation of Kim's Convo

Denny Britz 5.5k Jan 02, 2023
Deep Dual Consecutive Network for Human Pose Estimation (CVPR2021)

Beanie - is an asynchronous ODM for MongoDB, based on Motor and Pydantic. It uses an abstraction over Pydantic models and Motor collections to work wi

295 Dec 29, 2022
A 35mm camera, based on the Canonet G-III QL17 rangefinder, simulated in Python.

c is for Camera A 35mm camera, based on the Canonet G-III QL17 rangefinder, simulated in Python. The purpose of this project is to explore and underst

Daniele Procida 146 Sep 26, 2022
PyTorch implementation of a Real-ESRGAN model trained on custom dataset

Real-ESRGAN PyTorch implementation of a Real-ESRGAN model trained on custom dataset. This model shows better results on faces compared to the original

Sber AI 160 Jan 04, 2023
JAX + dataclasses

jax_dataclasses jax_dataclasses provides a wrapper around dataclasses.dataclass for use in JAX, which enables automatic support for: Pytree registrati

Brent Yi 35 Dec 21, 2022
Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

Language Emergence in Multi Agent Dialog Code for the Paper Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog Satwik Kottur, José M.

Karan Desai 105 Nov 25, 2022
A deep learning model for style-specific music generation.

DeepJ: A model for style-specific music generation https://arxiv.org/abs/1801.00887 Abstract Recent advances in deep neural networks have enabled algo

Henry Mao 704 Nov 23, 2022
Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

Motionformer This is an official pytorch implementation of paper Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. In this rep

Facebook Research 192 Dec 23, 2022
Robot Servers and Server Manager software for robo-gym

robo-gym-server-modules Robot Servers and Server Manager software for robo-gym. For info on how to use this package please visit the robo-gym website

JR ROBOTICS 4 Aug 16, 2021
Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem

Benchmarking nearest neighbors Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem, but so far t

Erik Bernhardsson 3.2k Jan 03, 2023
Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021

DIFFNet This repo is for Self-Supervised Monocular DepthEstimation with Internal Feature Fusion(arXiv), BMVC2021 A new backbone for self-supervised de

Hang 94 Dec 25, 2022