A simple implementation of N-gram language model.

Last update: Nov 24, 2021

Related tags

Text Data & NLP n-gram

Overview

About

A simple implementation of N-gram language model.

Requirements

numpy

Data preparation

Corpus

Training data for the N-gram model, a text file like this:

曼联加油
懂球直播
有也免费高清的额
直播挺全的
曼联这局肯定胜利

Text lines will be split into tokens by a delimiter when training. By default, no delimiter given, text lines will be split into characters.

Tokens

The dictionary for the model, a text file, each line of which is a token. Every token is unique in the file.

光
衰
戒
颅
阖

Training

Run the script train_n_gram.py to train an N-gram model.

python train_n_gram.py --corpus_path data/tieba.dialogues --token_path data/charset.txt --model_path data/2-gram.model --n 2

Testing

Run the script test_n_gram.py to test the trained N-gram model.

python test_n_gram.py --token_path data/charset.txt --model_path data/2-gram.model --text 哈哈

The testing output will like:

INFO - Loaded model from data/2-gram.model
INFO - Model info:
	n: 2
	head2tail length: 5947
	tokens: 5952
The most probable next token of the '哈哈' is '哈'.

A simple implementation of N-gram language model.

Related tags

Overview

About

Requirements

Data preparation

Corpus

Tokens

Training

Testing

Owner

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

LeBenchmark: a reproducible framework for assessing SSL from speech

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Example code for "Real-World Natural Language Processing"

Easy-to-use CPM for Chinese text generation

OpenAI CLIP text encoders for multiple languages!

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

Watson Natural Language Understanding and Knowledge Studio

Deduplication is the task to combine different representations of the same real world entity.

Chinese segmentation library

A demo of chinese asr

A paper list for aspect based sentiment analysis.

Sentence Embeddings with BERT & XLNet

Huggingface Transformers + Adapters = ❤️

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production