PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

Last update: Nov 24, 2022

Related tags

Deep Learning ContextNet

Overview

ContextNet

ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules.
Also, ContextNet supports three size models: small, medium, and large. ContextNet uses the global parameter alpha to control the scaling of the model by changing the number of channels in the convolution filter.

This repository contains only model code, but you can train with ContextNet at openspeech.

Model Architecuture

Configuration of the ContextNet encoder

If you choose the model size among small, medium, and large, the number of channels in the convolution filter is set using the global parameter alpha. If the stride of a convolution block is 2, its last conv layer has a stride of two while the rest of the conv layers has a stride of one.

A convolution block architecuture

ContextNet has 23 convolution blocks C0, .... ,C22. All convolution blocks have five layers of convolution except C0 and C22 which only have one layer of convolution each. A skip connection with projection is applied on the output of the squeeze-and-excitation(SE) block.

1D Squeeze-and-excitation(SE) module

Average pooling is applied to condense the convolution result into a 1D vector and then followed two fully connected (FC) layers with activation functions. The output goes through a Sigmoid function to be mapped to (0, 1) and then tiled and applied on the convolution output using pointwise multiplications.

Please check the paper for more details.

Installation

pip install -e .

Usage

from contextnet.model import ContextNet
import torch

BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE, NUM_VOCABS = 3, 500, 80, 10

cuda = torch.cuda.is_available()
device = torch.device('cuda' if cuda else 'cpu')

model = ContextNet(
    model_size='large',
    num_vocabs=10,
).to(device)

inputs = torch.FloatTensor(BATCH_SIZE, SEQ_LENGTH, INPUT_SIZE).to(device)
input_lengths = torch.IntTensor([500, 450, 350])
targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                            [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                            [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)
target_lengths = torch.LongTensor([9, 8, 7])

# Forward propagate
outputs = model(inputs, input_lengths, targets, target_lengths)

# Recognize input speech
outputs = model.recognize(inputs, input_lengths)

Reference

License

Copyright 2021 Sangchun Ha.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

Related tags

Overview

ContextNet

Model Architecuture

Configuration of the ContextNet encoder

A convolution block architecuture

1D Squeeze-and-excitation(SE) module

Installation

Usage

Reference

License

Owner

Sangchun Ha

[ICCV 2021 Oral] PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers

Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

The codes and related files to reproduce the results for Image Similarity Challenge Track 2.

PyTorch code for our ECCV 2018 paper "Image Super-Resolution Using Very Deep Residual Channel Attention Networks"

Dynamic Capacity Networks using Tensorflow

Ejemplo Algoritmo Viterbi - Example of a Viterbi algorithm applied to a hidden Markov model on DNA sequence

Implementation of Diverse Semantic Image Synthesis via Probability Distribution Modeling

OCRA (Object-Centric Recurrent Attention) source code

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations. [2021]

This is an example of a reproducible modelling project

SSL_SLAM2: Lightweight 3-D Localization and Mapping for Solid-State LiDAR (mapping and localization separated) ICRA 2021

QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision

GRaNDPapA: Generator of Rad Names from Decent Paper Acronyms

BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting

An implementation of MobileFormer

VolumeGAN - 3D-aware Image Synthesis via Learning Structural and Textural Representations

Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation

Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography