Blazing fast language detection using fastText model

Last update: Dec 20, 2022

Overview

Luga

A blazing fast language detection using fastText's language models

Luga is a Swahili word for language. fastText provides a blazing fast language detection. It is though a bit funky to download and load models. fastText API is also beauty-less. This is why luga was born.

Installation

python -m pip install -U luga

Usage:

Note: First usage downloads the model for you. This is done only once.

from luga import language

print(language("the world has ended yesterday"))

Comming soon ...

TODO:

refactor artifacts.py
auto checkers with pre-commit | invoke
write more tests
write github actions
create a smart data checker (a fast List[str], what do with none strings)
make it faster with Cython

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.6k Dec 27, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.1k Feb 14, 2021

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

Comments

fix: Fix invalid pytest dependency version
poetry does not want to accept flake8 as a valid versionFixes issue #13

fix: Fix invalid pytest dependency version

fix: Use fasttext-wheel instead of fasttext
opened by saevarb 1
Installation fails with recent poetry due to `fasttext` issues

Hey!

As is explained in this issue: https://github.com/python-poetry/poetry/issues/6113 trying to install fasttext with a recent poetry version fails. This is because fasttext does some really funky things and tries to run a global pip during install. So this means that building luga or using any package that depends on it doesn't work. :/

This means that columbus doesn't build either, since it depends on luga. However, as is outlined in the issue there is a solution: using fasttext-wheel.

I pulled down luga and columbus and updated luga to use fasttext-wheel instead, and managed to get it to install, which also allowed me to build a new version of columbus using the new luga build.

opened by saevarb 1

SSL WRONG_VERSION_NUMBER

Solution from httpx

import httpx
import ssl

ssl_context = httpx.create_ssl_context()
ssl_context.options ^= ssl.OP_NO_TLSv1  # Enable TLS 1.0 back
resp = httpx.get(..., verify=ssl_context)
```

opened by Proteusiq 0

Return array for compatibility with pandas

This fails since pandas expects an array and luga returns a list

texts.loc[languages(texts["texts"].to_list(), only_language=True) == "da"]

But this works

texts.loc[np.array(languages(texts["texts"].to_list(), only_language=True) == "da")]

opened by nthomsencph 0

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.7-py3-none-any.whl(5.55 KB)
luga-0.2.7.tar.gz(5.34 KB)
v0.2.6(Sep 28, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.6-py3-none-any.whl(5.51 KB)
luga-0.2.6.tar.gz(5.32 KB)
v0.2.5(Apr 19, 2022)

Source code(tar.gz)
Source code(zip)
luga-0.2.5-py3-none-any.whl(5.50 KB)
luga-0.2.5.tar.gz(5.39 KB)
v0.2.4(Dec 23, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.4-py3-none-any.whl(4.60 KB)
luga-0.2.4.tar.gz(4.52 KB)
v0.2.3(Dec 22, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.3-py3-none-any.whl(4.56 KB)
luga-0.2.3.tar.gz(4.46 KB)
v0.2.2(Dec 3, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.2-py3-none-any.whl(4.42 KB)
luga-0.2.2.tar.gz(4.28 KB)
v0.2.1(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.1-py3-none-any.whl(4.07 KB)
luga-0.2.1.tar.gz(3.95 KB)
v0.2.0(Nov 26, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.2.0-py3-none-any.whl(4.07 KB)
luga-0.2.0.tar.gz(3.95 KB)
v0.1.8(Nov 20, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.8-py3-none-any.whl(3.88 KB)
luga-0.1.8.tar.gz(3.76 KB)
v0.1.7(Nov 17, 2021)

Source code(tar.gz)
Source code(zip)
luga-0.1.7-py3-none-any.whl(3.81 KB)
luga-0.1.7.tar.gz(3.66 KB)

Owner

Prayson Wilfred Daniel

🍺 Data Scientist | | 🍺 Automating Data Mining & Analysis With Python

GitHub Repository

Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

49 Dec 30, 2022

Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

72 Dec 07, 2022

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

1.6k Dec 25, 2022

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in

84 Dec 20, 2022

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

4 Oct 07, 2022

SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

2.5k Jan 07, 2023

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

2C 构建一个多源（公众号、RSS）、干净、个性化的阅读环境作为一名微信公众号的重度用户，公众号一直被我设为汲取知识的地方。随着使用程度的增加，相信大家或多或少会有一个比较头疼的问题——广告问题。假设你关注的公众号有十来个，若一个公众号两周接一次广告，理论上你会面临二十多次广告，实际上会更多，运

678 Dec 28, 2022

Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

7 Aug 25, 2022

Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

1.2k Jan 09, 2023

Adversarial Examples for Extreme Multilabel Text Classification

Adversarial Examples for Extreme Multilabel Text Classification The code is adapted from the source codes of BERT-ATTACK [1], APLC_XLNet [2], and Atte

1 May 14, 2022

justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

25 Dec 27, 2022

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 07, 2021

The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Language Models are Few-shot Multilingual Learners Paper This is the source code of the paper [Arxiv] [ACL Anthology]: This code has been written usin

45 Nov 21, 2022

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network) BERTAC is a framework that combines a

6 Jan 24, 2022

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Emily's Symbol Dictionary Design This dictionary was created with the following goals in mind: Have a consistent method to type (pretty much) every sy

68 Jan 07, 2023

Blazing fast language detection using fastText model

Related tags

Overview

Luga

Installation

Usage:

Comming soon ...

TODO:

You might also like...

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

Comments

fix: Fix invalid pytest dependency version

Installation fails with recent poetry due to `fasttext` issues

SSL WRONG_VERSION_NUMBER

Return array for compatibility with pandas

Releases(v0.2.7)

v0.2.7(Dec 18, 2022)

v0.2.6(Sep 28, 2022)

v0.2.5(Apr 19, 2022)

v0.2.4(Dec 23, 2021)

v0.2.3(Dec 22, 2021)

v0.2.2(Dec 3, 2021)

v0.2.1(Nov 26, 2021)

v0.2.0(Nov 26, 2021)

v0.1.8(Nov 20, 2021)

v0.1.7(Nov 17, 2021)

Owner

Prayson Wilfred Daniel

Biterm Topic Model (BTM): modeling topics in short texts

Package for controllable summarization

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

Huggingface Transformers + Adapters = ❤️

Adversarial Examples for Extreme Multilabel Text Classification

justCTF [*] 2020 challenges sources

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Unsupervised text tokenizer focused on computational efficiency

Natural Language Processing with transformers

Google AI 2018 BERT pytorch implementation

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

NLP codes implemented with Pytorch (w/o library such as huggingface)