Model for recasing and repunctuating ASR transcripts

Overview

Recasing and punctuation model based on Bert

Benoit Favre 2021

This system converts a sequence of lowercase tokens without punctuation to a sequence of cased tokens with punctuation.

It is trained to predict both aspects at the token level in a multitask fashion, from fine-tuned BERT representations.

The model predicts the following recasing labels:

  • lower: keep lowercase
  • upper: convert to upper case
  • capitalize: set first letter as upper case
  • other: left as is

And the following punctuation labels:

  • o: no punctuation
  • period: .
  • comma: ,
  • question: ?
  • exclamation: !

Input tokens are batched as sequences of length 256 that are processed independently without overlap.

In training, batches containing less that 256 tokens are simulated by drawing uniformly a length and replacing all tokens and labels after that point with padding (called Cut-drop).

Changelong:

  • Fix generation when input is smaller than max length

Installation

Use your favourite method for installing Python requirements. For example:

python -mvenv env
. env/bin/activate
pip3 install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Prediction

Predict from raw text:

python recasepunc.py predict checkpoint/path.iteration < input.txt > output.txt

Models

  • French: fr-txt.large.19000 trained on 160M tokens from Common Crawl
    • Iterations: 19000
    • Batch size: 16
    • Max length: 256
    • Seed: 871253
    • Cut-drop probability: 0.1
    • Train loss: 0.021128975618630648
    • Valid loss: 0.015684964135289192
    • Recasing accuracy: 96.73
    • Punctuation accuracy: 95.02
      • All punctuation F-score: 67.79
      • Comma F-score: 67.94
      • Period F-score: 72.91
      • Question F-score: 57.57
      • Exclamation mark F-score: 15.78
    • Training data: First 100M words from Common Crawl

Training

Notes: You need to modify file names adequately. Training tensors are precomputed and loaded in CPU memory.

Stage 0: download text data

Stage 1: tokenize and normalize text with Moses tokenizer, and extract recasing and repunctuation labels

python recasepunc.py preprocess < input.txt > input.case+punc

Stage 2: sub-tokenize with Flaubert tokenizer, and generate pytorch tensors

python recasepunc.py tensorize input.case+punc input.case+punc.x input.case+punc.y

Stage 3: train model

python recasepunc.py train train.x train.y valid.x valid.y checkpoint/path

Stage 4: evaluate performance on a test set

python recasepunc.py eval checkpoint/path.iteration test.x test.y
Comments
  • Is it possible to customize for new language?

    Is it possible to customize for new language?

    Dear Benoit Favre,

    Your project is really important! Is it possible to customize for new language? If yes, could you tell short hints for it?

    Thank you in advance!

    opened by ican24 5
  • Can't get attribute 'WordpieceTokenizer'

    Can't get attribute 'WordpieceTokenizer'

    Hi thanks for your effort on developing recasepunc! I know that you can't provide help for models not trained by you, but maybe you have an idea what's going wrong here:

    I'm loading the model vosk-recasepunc-de-0.21 from https://alphacephei.com/vosk/models. When I do so, torch tells me that it can't find WordpieceTokenizer. Do you know why? Is the model incompatible?

    Punc predict path: C:\Users\admin\meety\vosk-recasepunc-de-0.21\checkpoint Traceback (most recent call last): File "main2.py", line 120, in t = transcriber() File "main2.py", line 32, in init self.casePuncPredictor = CasePuncPredictor(punc_predict_path, lang="de") File "C:\Users\admin\meety\recasepunc.py", line 273, in init loaded = torch.load(checkpoint_path, map_location=device if torch.cuda.is_available() else 'cpu') File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 882, in _load result = unpickler.load() File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 875, in find_class return super().find_class(mod_name, name) AttributeError: Can't get attribute 'WordpieceTokenizer' on <module 'main' from 'main2.py'>

    opened by padmalcom 4
  • Can't do inference

    Can't do inference

    Hello, I'm trying to use example.py on a french model (fr.22000 or fr-txt.large.19000) But I have this error: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids". I also tried with the following command, same error in output. python recasepunc.py predict fr.22000 < toto.txt > output.txt Do you have any advice? Thanks

    opened by MatFrancois 3
  • Memory usage

    Memory usage

    Hi, on start punctuation app use about 9Gb RAM, but in one moment(in load model ). Then we need about 1.5GB. Can we reduce 9GB on start? maybe on start we check our model and it feature can be turn off?

    opened by gubri 1
  • Russian model doesn't work, while English does

    Russian model doesn't work, while English does

    When I use Russian model, it gives me this error:

    WARNING: reverting to cpu as cuda is not available
    Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
    
     File "C:\pypy\rus\recasepunc.py", line 741, in <module>
        main(config, config.action, config.action_args)
      File "C:\pypy\rus\recasepunc.py", line 715, in main
        generate_predictions(config, *args)
      File "C:\pypy\rus\recasepunc.py", line 349, in generate_predictions
        for line in sys.stdin:
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\codecs.py", line 322, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte
    
     File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\site-packages\flask\app.py", line 1796, in dispatch_request
        return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)      
      File "C:\pypy\app.py", line 32, in process_audio
        cased = subprocess.check_output('python rus/recasepunc.py predict rus/checkpoint', shell=True, text=True, input=text)
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
    line 420, in check_output
        return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
      File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
    line 524, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command 'python rus/recasepunc.py predict rus/checkpoint' returned non-zero exit status 1.
    

    Sorry for a long message, I'm not sure which of these messages are the most important. Should I use another version of transformers? I use transformers==4.16.2 and it works fine with English model.

    opened by xenia19 0
  • Export model to be Used in C++

    Export model to be Used in C++

    Is it possible that export model to something that can be used in C++ using libtorch?

    1. export existing model(checkpoint provided in this repo)
    2. export model after I train with my own data which option above possible, or both?
    opened by leohuang2013 0
  • RuntimeError when predicting with the french models

    RuntimeError when predicting with the french models

    I tried to use the french models (both fr.22000 and fr-txt.large.19000) on a very simple text:

    j'aime les fleurs les olives et la raclette

    When running python3 recasepunc.py predict fr.22000 < input.txt > output.txt (or with the other model), I get the following RuntimeError:

    Traceback (most recent call last):
      File "/home/mael/charly/recasepunc/recasepunc.py", line 733, in <module>
        main(config, config.action, config.action_args)
      File "/home/mael/charly/recasepunc/recasepunc.py", line 707, in main
        generate_predictions(config, *args)
      File "/home/mael/charly/recasepunc/recasepunc.py", line 336, in generate_predictions
        model.load_state_dict(loaded['model_state_dict'])
      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for Model:
    	Unexpected key(s) in state_dict: "bert.position_ids".
    

    I tried the same with the english model, and it worked perfectly. Looks like something is broken with the french ones?

    opened by maelchiotti 2
  • parameters like --dab_rate can't be set from cmd line bc they are bool

    parameters like --dab_rate can't be set from cmd line bc they are bool

    look at parameters below. They really became bool, i find this bug while debugging it. ''' if name == 'main': parser = argparse.ArgumentParser() parser.add_argument("action", help="train|eval|predict|tensorize|preprocess", type=str) ... parser.add_argument("--updates", help="number of training updates to perform", default=default_config.updates, type=bool) parser.add_argument("--period", help="validation period in updates", default=default_config.period, type=bool) parser.add_argument("--lr", help="learning rate", default=default_config.lr, type=bool) parser.add_argument("--dab-rate", help="drop at boundaries rate", default=default_config.dab_rate, type=bool) config = Config(**parser.parse_args().dict)

    main(config, config.action, config.action_args)
    

    '''

    opened by al-zatv 0
  • Cannot use trained model for validation or prediction

    Cannot use trained model for validation or prediction

    Hi, thank you for this repo! I'm trying to reproduce results for different language, so I'm using multilingual-bert fine-tuned to my language dataset. Everything goes well during preprocessing and training, the resuls are comparable with those for English and French (97-99% for case and punctuation).

    But when I try to use trained model, it gives very poor results even for sentences from training dataset. It works, sometimes it puts capital letters or dots, but it's rare and mostly model can't handle. Also when I try to evaluate model with command from the README (also tried it for already used validation sets, for instance with command python recasepunc.py eval bertugan_casepunc.24000 valid.case+punc.x valid.case+punc.y) it gives error:

    File "recasepunc.py", line 220, in batchify x = x[:(len(x) // max_length) * max_length].reshape(-1, max_length) TypeError: unhashable type: 'slice'

    Sorry for pointing to two different problems in one Issue, but I though maybe it can be one common mistake for both cases.

    opened by khusainovaidar 5
Releases(0.3)
Owner
Benoit Favre
Benoit Favre
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

M. Yusuf Sarıgöz 13 Oct 10, 2022
[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Compact Transformers Preprint Link: Escaping the Big Data Paradigm with Compact Transformers By Ali Hassani[1]*, Steven Walton[1]*, Nikhil Shah[1], Ab

SHI Lab 367 Dec 31, 2022
ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Description: ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39

GOKHAN OZSARI 5 Dec 16, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization This repo is for our paper "Enhanced Seq2Seq Autoencode

Rachel Zheng 14 Nov 01, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

7 Aug 25, 2022
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network) BERTAC is a framework that combines a

6 Jan 24, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its

SaiVenkatDhulipudi 2 Nov 17, 2021
precise iris segmentation

PI-DECODER Introduction PI-DECODER, a decoder structure designed for Precise Iris Segmentation and Location. The decoder structure is shown below: Ple

8 Aug 08, 2022
☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Accuracy of BBC Weather forecasts for Honolulu This repository records the forecasts made by BBC Weather for the city of Honolulu, USA. Essentially, t

Max Halford 12 Oct 15, 2022
Text to speech for Vietnamese, ez to use, ez to update

Chào mọi người, đây là dự án mở nhằm giúp việc đọc được trở nên dễ dàng hơn. Rất cảm ơn đội ngũ Zalo đã cung cấp hạ tầng để mình có thể tạo ra app này

Trần Cao Minh Bách 32 Jul 29, 2022
List of GSoC organisations with number of times they have been selected.

Welcome to GSoC Organisation Frequency And Details 👋 List of GSoC organisations with number of times they have been selected, techonologies, topics,

Shivam Kumar Jha 41 Oct 01, 2022