Basic yet complete Machine Learning pipeline for NLP tasks

Last update: Aug 22, 2022

Related tags

Text Data & NLP ml-pipeline

Overview

Basic yet complete Machine Learning pipeline for NLP tasks

This repository accompanies the article on building basic yet complete ML pipelines for solving NLP tasks.

Requirements

Docker

telnet

Please refer to installation instructions for your system if needed.

Running the pipeline

The whole pipeline of 4 services (mail server, database, prediction service and orchestrator) can be started with one command:

docker-compose -f docker-compose.yaml up --build

It should start printing log messages from the services.

Sending an email

The pipeline is triggered by an unread email appearing in the mailbox. In order to send one, telnet util can be used.

Connecting to the IMAP mail server: telnet localhost 3025

Sending the email with telnet:

EHLO user
MAIL FROM:<[email protected]>
RCPT TO:<user>
DATA
Subject: Hello World
 
Hello!

She works at Apple now but before that she worked at Microsoft.
.
QUIT

If everything went well, something like this should appear in logs:

orchestrator_1                   | Polling mailbox...
prediction-worker_1              | INFO:     172.19.0.5:55294 - "POST /predict HTTP/1.1" 200 OK
orchestrator_1                   | Recorded to DB with id=34: [{'entity_text': 'Apple', 'start': 24, 'end': 29}, {'entity_text': 'Microsoft', 'start': 58, 'end': 67}]

Checking the result

The data must also be recorded to the database. In order to check that, any DB client can be used with the following connection parameters:

host: localhost
port: 5432
database: maildb
username: pguser
pasword: password

and running SELECT * FROM mail LIMIT 10 query.

Basic yet complete Machine Learning pipeline for NLP tasks

Related tags

Overview

Basic yet complete Machine Learning pipeline for NLP tasks

Requirements

Running the pipeline

Running the pipeline

Sending an email

Checking the result

Owner

Ivan

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

Python3 to Crystal Translation using Python AST Walker

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

Python library for processing Chinese text

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

Installation, test and evaluation of Scribosermo speech-to-text engine

Tools for curating biomedical training data for large-scale language modeling

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Long text token classification using LongFormer

A telegram bot to translate 100+ Languages

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

A versatile token stream for handwritten parsers.

Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Random Directed Acyclic Graph Generator

A Practitioner's Guide to Natural Language Processing

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.