📝An easy-to-use package to restore punctuation of the text.

Last update: Dec 30, 2022

Related tags

Overview

✏️ rpunct - Restore Punctuation

This repo contains code for Punctuation restoration.

This package is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks. It uses HuggingFace's bert-base-uncased model weights that have been fine-tuned for Punctuation restoration.

Punctuation restoration works on arbitrarily large text. And uses GPU if it's available otherwise will default to CPU.

List of punctuations we restore:

Upper-casing
Period: .
Exclamation: !
Question Mark: ?
Comma: ,
Colon: :
Semi-colon: ;
Apostrophe: '
Dash: -

🚀 Usage

Below is a quick way to get up and running with the model.

First, install the package.

pip install rpunct

Sample python code.

from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

🎯 Accuracy

Here is the number of product reviews we used for finetuning the model:

Language	Number of text samples
English	560,000

We found the best convergence around 3 epochs, which is what presented here and available via a download.

The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:

Accuracy	Overall F1	Eval Support
91%	90%	45,990

💻 🎯 Further Fine-Tuning

To start fine-tuning or training please look into training/train.py file. Running python training/train.py will replicate the results of this model.

☕ Contact

Contact Daulet Nurmanbetov for questions, feedback and/or requests for similar models.

Comments

Update requirements.txt

ERROR: Could not find a version that satisfies the requirement torch==1.8.1 (from rpunct) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0) ERROR: No matching distribution found for torch==1.8.1

opened by Rukaya-lab 0
Forked repo with fixes
I forked this repository (link here) to fix the outdated dependencies and incompatibility with non-CUDA machines. If anyone needs these fixes, feel free to install from the fork:

pip install git+https://github.com/samwaterbury/rpunct.git

Hopefully this repository is updated or another maintainer is assigned. And thanks to the creator @Felflare, this is a useful tool!
opened by samwaterbury 2
Requirements shouldn't ask for such specific versions

First, thanks a lot for providing this package :)

Currently, the requirements.txt, and thus the dependencies in the setup.py are for very specific versions of Pytorch etc. This shouldn't be the case if you want this package to be used as a general library (think of a second package that would do the same but ask for an incompatible version of PyTorch and would prevent any possible installation of the two together). The end user might also be needing a more recent version of PyTorch. Given that PyTorch is almost always backward compatible, and quite stable, I think the requirements for it could be changed from ==1.8.1 to >=1.8.1. I believe the same would be true for the other packages.

opened by adefossez 2
Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.
Thanks for the great library! When running this without a GPU I had problems. I think there is a simple fix. The simple transformer NER model defaults to enabling cuda. This PR allows the user to pass a dictionary of arguments specifically for the simpletransformers NER model. So you can now run the code on a CPU by initializing rpunct like so

rpunct = RestorePuncts(ner_args={"use_cuda": False})

Before this change, when running rpunct examples on the CPU the following error occurs:

from rpunct import RestorePuncts # The default language is 'english' rpunct = RestorePuncts() rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated 3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")

ValueError Traceback (most recent call last) /var/folders/hx/dhzhl_x51118fm5cd13vzh2h0000gn/T/ipykernel_10548/194907560.py in 1 from rpunct import RestorePuncts 2 # The default language is 'english' ----> 3 rpunct = RestorePuncts() 4 rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record 5 by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were

~/repos/rpunct/rpunct/punctuate.py in init(self, wrds_per_pred, ner_args) 19 if ner_args is None: 20 ner_args = {} ---> 21 self.model = NERModel("bert", "felflare/bert-restore-punctuation", labels=self.valid_labels, 22 args={"silent": True, "max_seq_length": 512}, **ner_args) 23

~/repos/transformers/transformer-env/lib/python3.8/site-packages/simpletransformers/ner/ner_model.py in init(self, model_type, model_name, labels, args, use_cuda, cuda_device, onnx_execution_provider, **kwargs) 209 self.device = torch.device(f"cuda:{cuda_device}") 210 else: --> 211 raise ValueError( 212 "'use_cuda' set to True when cuda is unavailable." 213 "Make sure CUDA is available or set use_cuda=False."

ValueError: 'use_cuda' set to True when cuda is unavailable.Make sure CUDA is available or set use_cuda=False.
opened by nbertagnolli 1
add use_cuda parameter

using the package in an environment without cuda support causes it to fail. Adding the parameter to shut it off if necessary allows it to function normall.

opened by mjfox3 1

Releases(1.0.1)

1.0.1(May 24, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Daulet Nurmanbetov

Deep Learning, AI and Finance

GitHub Repository

GSoC'2021 | TensorFlow implementation of Wav2Vec2

73 Nov 28, 2022

This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition （CoNLL-2003

1.2k Dec 26, 2022

VMD Audio/Text control with natural language

This repository is a proof of principle for performing Molecular Dynamics analysis, in this case with the program VMD, via natural language commands.

13 Jun 09, 2022

String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022

A PyTorch implementation of VIOLET

VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling A PyTorch implementation of VIOLET Overview VIOLET is an implementati

119 Dec 30, 2022

Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

21 Dec 14, 2022

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

409 Oct 28, 2022

TFIDF-based QA system for AIO2 competition

AIO2 TF-IDF Baseline This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition. In the traini

4 Feb 19, 2022

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

VampiresVsWerewolves Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition. Our Algorithm finish

1 Jan 21, 2022

📝An easy-to-use package to restore punctuation of the text.

Related tags

Overview

✏️ rpunct - Restore Punctuation

🚀 Usage

🎯 Accuracy

💻 🎯 Further Fine-Tuning

☕ Contact

Comments

Update requirements.txt

Forked repo with fixes

Requirements shouldn't ask for such specific versions

Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.

add use_cuda parameter

Releases(1.0.1)

1.0.1(May 24, 2021)

Owner

Daulet Nurmanbetov

GSoC'2021 | TensorFlow implementation of Wav2Vec2

This is Assignment1 code for the Web Data Processing System.

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

VMD Audio/Text control with natural language

String Gen + Word Checker

A PyTorch implementation of VIOLET

Edge-Augmented Graph Transformer

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

TFIDF-based QA system for AIO2 competition

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Pytorch NLP library based on FastAI

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

Levenshtein and Hamming distance computation

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine