Textlesslib - Library for Textless Spoken Language Processing

Overview

textlesslib

License: MIT Python 3.8 Code style: black

Textless NLP is an active area of research that aims to extend NLP techniques to work directly on spoken language. By using self-supervisedly learnt discrete speech representations, the area promises to unlock interesting NLP applications on languages without written form or on facets of spoken language that are unaccessable for text-based approaches, e.g. prosody. To learn more, please check some of the papers.

textlesslib is a library aimed to facilitate research in Textless NLP. The goal of the library is to speed up the research cycle and lower the learning curve for those who want to start. We provide highly configurable, off-the-shelf available tools to encode speech as sequences of discrete values and tools to decode such streams back into the audio domain.

Table of Contents

Installation

git clone [email protected]:facebookresearch/textlesslib.git
cd textlesslib
pip install -e .
pip install git+git://github.com:pytorch/[email protected]

Usage examples

We include a set of examples in the examples folder:

There is also a [Jupyter notebook] and a [Google Colab] that combine discrete resynthesis and speech continuation examples in a step-by-step mini-tutorial.

We believe those examples can serve both as illustrations for the provided components and provide a starting point for tinkering in interesting directions.

Encoding speech

Below is an example on loading an audio example and encoding it as a sequence of HuBERT-based discrete tokens (aka pseudo-units). Downloading of the required checkpoints is handled by textlesslib itself (by default they are stored in ~/.textless):

import torchaudio
from textless.data.speech_encoder import SpeechEncoder

dense_model_name = "hubert-base-ls960"
quantizer_name, vocab_size = "kmeans", 100
input_file = "input.wav"

# now let's load an audio example
waveform, sample_rate = torchaudio.load(input_file)

# We can build a speech encoder module using names of pre-trained
# dense and quantizer models.  The call below will download
# appropriate checkpoints as needed behind the scenes. We can
# also construct an encoder by directly passing model instances
encoder = SpeechEncoder.by_name(
    dense_model_name=dense_model_name,
    quantizer_model_name=quantizer_name,
    vocab_size=vocab_size,
    deduplicate=True,
).cuda()


# now convert it in a stream of deduplicated units (as in GSLM)
encoded = encoder(waveform.cuda())
# encoded is a dict with keys ('dense', 'units', 'durations').
# It can also contain 'f0' if SpeechEncoder was initialized
# with need_f0=True flag.
units = encoded["units"]  # tensor([71, 12, 57, ...], ...)

Now it can be casted back into the audio domain:

# as with encoder, we can setup vocoder by passing checkpoints
# directly or by specifying the expected format by the names
# of dense and quantizer models (these models themselves
# won't be loaded)
vocoder = TacotronVocoder.by_name(
    dense_model_name,
    quantizer_name,
    vocab_size,
).cuda()

# now we turn those units back into the audio.
audio = vocoder(units)

# save the audio
torchaudio.save(output_file, audio.cpu().float().unsqueeze(0), vocoder.output_sample_rate)

Dataset helpers

Below is an example on using textless view on the LibriSpeech dataset:

encoder = SpeechEncoder.by_name(
  dense_model_name=dense_model_name,
  quantizer_model_name=quantizer_name,
  vocab_size=vocab_size,
  deduplicate=True,
).cuda()

quantized_dataset = QuantizedLibriSpeech(
  root=existing_root, speech_encoder=encoder, url=url)

datum = quantized_dataset[0]
sample_rate, utterance, speaker_id, chapter_id, utterance_id = datum['rest']
# datum['units'] = tensor([71, 12, 63, ...])

In the probing example we illustrate how such a dataset can be used with a standard Pytorch dataloader in a scalable manner.

Data preprocessing

We also provide a multi-GPU/multi-node preprocessing tool for the cases where on-the-fly processing of audio should be avoided.

Provided models

We provide implementations and pre-trained checkpoints for the following models:

  • Dense representations: HuBERT-base (trained on LibriSpeech 960h) and CPC (trained on 6Kh subset of LibriLight);
  • Quantizers: k-means quantizers with vocabulary sizes of 50, 100, 200 for both the dense models (trained on LibriSpeech 960h);
  • Decoders: Tacotron2 models for all (dense model x quantizer) combinations (trained on LJSpeech).

Finally, the pitch extraction is done via YAAPT.

Testing

We use pytest (pip install pytest pytest-xdist ). Our unit tests are located in the tests directory:

cd tests && pytest -n 8

Licence

textlesslib is licensed under MIT, the text of the license can be found here. Internally, it uses

Owner
Meta Research
Meta Research
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

M. Yusuf Sarıgöz 13 Oct 10, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 07, 2023
Text Analysis & Topic Extraction on Android App user reviews

AndroidApp_TextAnalysis Hi, there! This is code archive for Text Analysis and Topic Extraction from user_reviews of Android App. Dataset Source : http

Fitrie Ratnasari 1 Feb 14, 2022
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
Machine learning classifiers to predict American Sign Language .

ASL-Classifiers American Sign Language (ASL) is a natural language that serves as the predominant sign language of Deaf communities in the United Stat

Tarek idrees 0 Feb 08, 2022
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context This repository contains the code in both PyTorch and TensorFlow for our paper

Zhilin Yang 3.3k Dec 28, 2022
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP(Natural Language Processing)是指自然语言处理,他的目的是让计算机可以听懂人话。 下面是我将2万条豆瓣影评训练之后,随意输入一段新影评交给神经网络,最终AI推断出的结果。 "很好,演技不错

3 Apr 15, 2022
Non-Autoregressive Predictive Coding

Non-Autoregressive Predictive Coding This repository contains the implementation of Non-Autoregressive Predictive Coding (NPC) as described in the pre

Alexander H. Liu 43 Nov 15, 2022
Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

capbot-siic Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021. Problem Inspiration A plethora

Aryan Kargwal 19 Feb 17, 2022
HuggingTweets - Train a model to generate tweets

HuggingTweets - Train a model to generate tweets Create in 5 minutes a tweet generator based on your favorite Tweeter Make my own model with the demo

Boris Dayma 318 Jan 04, 2023
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 02, 2022
ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ChatterBot ChatterBot is a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on

Gunther Cox 12.8k Jan 03, 2023
Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

ProphetNet-X This repo provides the code for reproducing the experiments in ProphetNet. In the paper, we propose a new pre-trained language model call

Microsoft 394 Dec 17, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022
RecipeReduce: Simplified Recipe Processing for Lazy Programmers

RecipeReduce This repo will help you figure out the amount of ingredients to buy for a certain number of meals with selected recipes. RecipeReduce Get

Qibin Chen 9 Apr 22, 2022
Voilà turns Jupyter notebooks into standalone web applications

Rendering of live Jupyter notebooks with interactive widgets. Introduction Voilà turns Jupyter notebooks into standalone web applications. Unlike the

Voilà Dashboards 4.5k Jan 03, 2023
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022