An open-source NLP library: fast text cleaning and preprocessing.

Last update: Mar 18, 2022

Overview

🌴 dobbi 🦕

Takes care of all of this boring NLP stuff

Description

An open-source NLP library: fast text cleaning and preprocessing.

TL;DR

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.

Installation

To download dobbi, either fork this GitHub repo or simply use Pypi via pip:

$ pip install dobbi

Usage

Import the library:

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('Check here: https://some-url.com')

Supported methods and patterns

The process consists of three stages:

Initialization methods: initialize a dobbi Work object
Intermediate methods: chain patterns in the needed order
Terminal methods: choose if you need a function or a result

Initialization functions:

dobbi.clean()
dobbi.collect()
dobbi.replace()

Intermediate methods (pattern processing choice):

regexp() - custom regular expressions
url() - URLs
html() - HTML and "<...>" type markups
punctuation() - punctuation
hashtag() - hashtags
emoji() - emoji
emoticons() - emoticons
whitespace() - any type of whitespaces
nickname() - @-starting nicknames

Terminal methods:

execute(str) - executes chosen methods on the provided string.
function() - returns a function which is a combination of the chosen methods.

Examples

1) Clean a random Twitter message

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

2) Replace nicknames and urls with tokens

dobbi.replace() \
    .hashtag('') \
    .nickname() \
    .url('__CUSTOM_URL_TOKEN__') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'

3) Get the text cleanup function (one-liner)

~~Please, try to avoid the in-line method chaining, as it is less readable.~~ Do as your heart tells you.

func = dobbi.clean().url().hashtag().punctuation().whitespace().html().function()
func('\t #fun #lol    Why  @Alex33 is so... funny? 
    
    \nCheck
    \there: https://some-url.com'
   )

Result:

'Why Alex33 is so funny Check here'

Chain regexp methods

dobbi.clean() \
    .regexp('#\w+') \
    .regexp('@\w+') \
    .regexp('https?://\S+') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

Additional

Please pay attention that the functions are applied in the order you've specified them. So, you're better to chain .punctuation() as one of the last functions.

Call for collaboration 🤗

If you enjoyed the project I would be grateful if you supported it :)

Below is the list of useful features I would be happy to share with you:

Finding bugs
Making code optimizations
Writing tests
Help with new features development

Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

290 Dec 26, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

2.3k Dec 29, 2022

2k Feb 9, 2021

Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

2 Nov 28, 2021

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

2 Oct 22, 2022

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 2, 2023

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

15.3k Dec 30, 2022

An open-source NLP library: fast text cleaning and preprocessing.

Related tags

Overview

🌴 dobbi 🦕

Description

TL;DR

Installation

Usage

Interaction

Supported methods and patterns

Examples

1) Clean a random Twitter message

2) Replace nicknames and urls with tokens

3) Get the text cleanup function (one-liner)

Additional

Call for collaboration 🤗

You might also like...

Task-based datasets, preprocessing, and evaluation for sequence models.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Data preprocessing rosetta parser for python

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Releases(v0_13)

v0_13(Oct 29, 2021)

v0_10(Oct 19, 2021)

v0_06(Oct 18, 2021)

v0_03(Oct 16, 2021)

v0_02(Oct 16, 2021)

v0_01(Oct 16, 2021)

Owner

Iaroslav

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

A multi-voice TTS system trained with an emphasis on quality

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

SurvTRACE: Transformers for Survival Analysis with Competing Events

Repository for the paper: VoiceMe: Personalized voice generation in TTS

A framework for implementing federated learning

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

Understanding the Difficulty of Training Transformers

Gpt2-WebAPI - The objective of this API is to provide the 3 best possible responses to sentences that the user would input via http GET request as a parameter

多语言降噪预训练模型MBart的中文生成任务

Anomaly Detection 이상치 탐지 전처리 모듈

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Transformer related optimization, including BERT, GPT

A natural language modeling framework based on PyTorch

Translation for Trilium Notes. Trilium Notes 中文版.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

Blue Brain text mining toolbox for semantic search and structured information extraction