Model for recasing and repunctuating ASR transcripts

Last update: Dec 29, 2022

Related tags

Text Data & NLP recasepunc

Overview

Recasing and punctuation model based on Bert

Benoit Favre 2021

This system converts a sequence of lowercase tokens without punctuation to a sequence of cased tokens with punctuation.

It is trained to predict both aspects at the token level in a multitask fashion, from fine-tuned BERT representations.

The model predicts the following recasing labels:

lower: keep lowercase
upper: convert to upper case
capitalize: set first letter as upper case
other: left as is

And the following punctuation labels:

o: no punctuation
period: .
comma: ,
question: ?
exclamation: !

Input tokens are batched as sequences of length 256 that are processed independently without overlap.

In training, batches containing less that 256 tokens are simulated by drawing uniformly a length and replacing all tokens and labels after that point with padding (called Cut-drop).

Changelong:

Fix generation when input is smaller than max length

Installation

Use your favourite method for installing Python requirements. For example:

python -mvenv env
. env/bin/activate
pip3 install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Prediction

Predict from raw text:

python recasepunc.py predict checkpoint/path.iteration < input.txt > output.txt

Models

French: fr-txt.large.19000 trained on 160M tokens from Common Crawl
- Iterations: 19000
- Batch size: 16
- Max length: 256
- Seed: 871253
- Cut-drop probability: 0.1
- Train loss: 0.021128975618630648
- Valid loss: 0.015684964135289192
- Recasing accuracy: 96.73
- Punctuation accuracy: 95.02
  - All punctuation F-score: 67.79
  - Comma F-score: 67.94
  - Period F-score: 72.91
  - Question F-score: 57.57
  - Exclamation mark F-score: 15.78
- Training data: First 100M words from Common Crawl

Training

Notes: You need to modify file names adequately. Training tensors are precomputed and loaded in CPU memory.

Stage 0: download text data

Stage 1: tokenize and normalize text with Moses tokenizer, and extract recasing and repunctuation labels

python recasepunc.py preprocess < input.txt > input.case+punc

Stage 2: sub-tokenize with Flaubert tokenizer, and generate pytorch tensors

python recasepunc.py tensorize input.case+punc input.case+punc.x input.case+punc.y

Stage 3: train model

python recasepunc.py train train.x train.y valid.x valid.y checkpoint/path

Stage 4: evaluate performance on a test set

python recasepunc.py eval checkpoint/path.iteration test.x test.y

Comments

Is it possible to customize for new language?

Dear Benoit Favre,

Your project is really important! Is it possible to customize for new language? If yes, could you tell short hints for it?

Thank you in advance!

opened by ican24 5
Can't get attribute 'WordpieceTokenizer'

Hi thanks for your effort on developing recasepunc! I know that you can't provide help for models not trained by you, but maybe you have an idea what's going wrong here:

I'm loading the model vosk-recasepunc-de-0.21 from https://alphacephei.com/vosk/models. When I do so, torch tells me that it can't find WordpieceTokenizer. Do you know why? Is the model incompatible?

Punc predict path: C:\Users\admin\meety\vosk-recasepunc-de-0.21\checkpoint Traceback (most recent call last): File "main2.py", line 120, in t = transcriber() File "main2.py", line 32, in init self.casePuncPredictor = CasePuncPredictor(punc_predict_path, lang="de") File "C:\Users\admin\meety\recasepunc.py", line 273, in init loaded = torch.load(checkpoint_path, map_location=device if torch.cuda.is_available() else 'cpu') File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 882, in _load result = unpickler.load() File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 875, in find_class return super().find_class(mod_name, name) AttributeError: Can't get attribute 'WordpieceTokenizer' on <module 'main' from 'main2.py'>

opened by padmalcom 4
Can't do inference

Hello, I'm trying to use example.py on a french model (fr.22000 or fr-txt.large.19000) But I have this error: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids". I also tried with the following command, same error in output. python recasepunc.py predict fr.22000 < toto.txt > output.txt Do you have any advice? Thanks

opened by MatFrancois 3
Memory usage

Hi, on start punctuation app use about 9Gb RAM, but in one moment(in load model ). Then we need about 1.5GB. Can we reduce 9GB on start? maybe on start we check our model and it feature can be turn off?

opened by gubri 1

Russian model doesn't work, while English does

When I use Russian model, it gives me this error:

WARNING: reverting to cpu as cuda is not available
Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']

 File "C:\pypy\rus\recasepunc.py", line 741, in <module>
    main(config, config.action, config.action_args)
  File "C:\pypy\rus\recasepunc.py", line 715, in main
    generate_predictions(config, *args)
  File "C:\pypy\rus\recasepunc.py", line 349, in generate_predictions
    for line in sys.stdin:
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte

 File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)      
  File "C:\pypy\app.py", line 32, in process_audio
    cased = subprocess.check_output('python rus/recasepunc.py predict rus/checkpoint', shell=True, text=True, input=text)
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python rus/recasepunc.py predict rus/checkpoint' returned non-zero exit status 1.

Sorry for a long message, I'm not sure which of these messages are the most important. Should I use another version of transformers? I use transformers==4.16.2 and it works fine with English model.

opened by xenia19 0

Export model to be Used in C++
Is it possible that export model to something that can be used in C++ using libtorch?

export existing model(checkpoint provided in this repo)

export model after I train with my own data which option above possible, or both?
opened by leohuang2013 0
While running pretrained German model: AttributeError: Can't get attribute 'Trie' on

I am trying to use pretrained German model:

https://alphacephei.com/vosk/models/vosk-recasepunc-de-0.21.zip

and as mentioned in readme file, I run:

python example.py de-test.txt

but I keep getting following error:

AttributeError: Can't get attribute 'Trie' on <module 'transformers.tokenization_utils' from '/home/ali/ali_initos_work/internal/data_science/speech_to_text/vosk/vosk_env/lib/python3.7/site-packages/transformers/tokenization_utils.py'>

Any idea if the model itself is wrong?

opened by alihashaam 2

RuntimeError when predicting with the french models

I tried to use the french models (both fr.22000 and fr-txt.large.19000) on a very simple text:

j'aime les fleurs les olives et la raclette

When running python3 recasepunc.py predict fr.22000 < input.txt > output.txt (or with the other model), I get the following RuntimeError:

Traceback (most recent call last): File "/home/mael/charly/recasepunc/recasepunc.py", line 733, in <module> main(config, config.action, config.action_args) File "/home/mael/charly/recasepunc/recasepunc.py", line 707, in main generate_predictions(config, *args) File "/home/mael/charly/recasepunc/recasepunc.py", line 336, in generate_predictions model.load_state_dict(loaded['model_state_dict']) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids".

I tried the same with the english model, and it worked perfectly. Looks like something is broken with the french ones?

opened by maelchiotti 2
parameters like --dab_rate can't be set from cmd line bc they are bool
look at parameters below. They really became bool, i find this bug while debugging it. ''' if name == 'main': parser = argparse.ArgumentParser() parser.add_argument("action", help="train|eval|predict|tensorize|preprocess", type=str) ... parser.add_argument("--updates", help="number of training updates to perform", default=default_config.updates, type=bool) parser.add_argument("--period", help="validation period in updates", default=default_config.period, type=bool) parser.add_argument("--lr", help="learning rate", default=default_config.lr, type=bool) parser.add_argument("--dab-rate", help="drop at boundaries rate", default=default_config.dab_rate, type=bool) config = Config(**parser.parse_args().dict)

main(config, config.action, config.action_args)

'''
opened by al-zatv 0
Cannot use trained model for validation or prediction

Hi, thank you for this repo! I'm trying to reproduce results for different language, so I'm using multilingual-bert fine-tuned to my language dataset. Everything goes well during preprocessing and training, the resuls are comparable with those for English and French (97-99% for case and punctuation).

But when I try to use trained model, it gives very poor results even for sentences from training dataset. It works, sometimes it puts capital letters or dots, but it's rare and mostly model can't handle. Also when I try to evaluate model with command from the README (also tried it for already used validation sets, for instance with command python recasepunc.py eval bertugan_casepunc.24000 valid.case+punc.x valid.case+punc.y) it gives error:

File "recasepunc.py", line 220, in batchify x = x[:(len(x) // max_length) * max_length].reshape(-1, max_length) TypeError: unhashable type: 'slice'

Sorry for pointing to two different problems in one Issue, but I though maybe it can be one common mistake for both cases.

opened by khusainovaidar 5

Releases(0.3)

0.3(Feb 3, 2022)

Checkpoint release
Source code(tar.gz)
Source code(zip)
en.23000(1249.49 MB)
fr-txt.large.19000(523.93 MB)
fr.22000(1575.50 MB)
zh.24000(1166.63 MB)
0.2(Sep 26, 2021)

Fix predictions when input is shorter than max length
Source code(tar.gz)
Source code(zip)
0.1(Sep 20, 2021)

First French model trained on 160M tokens from common crawl.
Source code(tar.gz)
Source code(zip)
fr-txt.large.19000(1571.78 MB)

Owner

Benoit Favre

GitHub Repository

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classifi

186 Dec 24, 2022

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

spaCyOpenTapioca A spaCy wrapper of OpenTapioca for named entity linking on Wikidata. Table of contents Installation How to use Local OpenTapioca Vizu

80 Jan 03, 2023

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

44 Jul 28, 2022

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

A Infomation Grathering tool that reverse search phone numbers and get their details ! What is phomber? Phomber is one of the best tools available fo

121 Dec 27, 2022

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA Introduction ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using

2.1k Dec 28, 2022

SciBERT is a BERT model trained on scientific text.

1.2k Dec 24, 2022

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Reddit text to speech generator A basic reddit tts video generator Current functionality Generate videos for subs based on comments,(askreddit) so rea

17 Dec 19, 2022

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre

2.3k Jan 08, 2023

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022

Model for recasing and repunctuating ASR transcripts

Related tags

Overview

Recasing and punctuation model based on Bert

Installation

Prediction

Models

Training

Comments

Releases(0.3)

0.3(Feb 3, 2022)

0.2(Sep 26, 2021)

0.1(Sep 20, 2021)

Owner

Benoit Favre

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

SciBERT is a BERT model trained on scientific text.

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

Tools to download and cleanup Common Crawl data

Code for Editing Factual Knowledge in Language Models

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

TFIDF-based QA system for AIO2 competition

Opal-lang - A WIP programming language based on Python

NLP tool to extract emotional phrase from tweets 🤩

中文空间语义理解评测

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.