Natural language Understanding Toolkit

Related tags

Text Data & NLPnut
Overview

Natural language Understanding Toolkit

TOC

Requirements

To install nut you need:

  • Python 2.5 or 2.6
  • Numpy (>= 1.1)
  • Sparsesvd (>= 0.1.4) [1] (only CLSCL)

Installation

To clone the repository run,

git clone git://github.com/pprett/nut.git

To build the extension modules inplace run,

python setup.py build_ext --inplace

Add project to python path,

export PYTHONPATH=$PYTHONPATH:$HOME/workspace/nut

Documentation

CLSCL

An implementation of Cross-Language Structural Correspondence Learning (CLSCL). See [Prettenhofer2010] for a detailed description and [Prettenhofer2011] for more experiments and enhancements.

The data for cross-language sentiment classification that has been used in the above study can be found here [2].

clscl_train

Training script for CLSCL. See ./clscl_train --help for further details.

Usage:

$ ./clscl_train en de cls-acl10-processed/en/books/train.processed cls-acl10-processed/en/books/unlabeled.processed cls-acl10-processed/de/books/unlabeled.processed cls-acl10-processed/dict/en_de_dict.txt model.bz2 --phi 30 --max-unlabeled=50000 -k 100 -m 450 --strategy=parallel

|V_S| = 64682
|V_T| = 106024
|V| = 170706
|s_train| = 2000
|s_unlabeled| = 50000
|t_unlabeled| = 50000
debug: DictTranslator contains 5012 translations.
mutualinformation took 5.624 sec
select_pivots took 7.197 sec
|pivots| = 450
create_inverted_index took 59.353 sec
Run joblib.Parallel
[Parallel(n_jobs=-1)]: Done   1 out of 450 |elapsed:    9.1s remaining: 67.8min
[Parallel(n_jobs=-1)]: Done   5 out of 450 |elapsed:   15.2s remaining: 22.6min
[..]
[Parallel(n_jobs=-1)]: Done 449 out of 450 |elapsed: 14.5min remaining:    1.9s
train_aux_classifiers took 881.803 sec
density: 0.1154
Ut.shape = (100,170706)
learn took 903.588 sec
project took 175.483 sec

Note

If you have access to a hadoop cluster, you can use --strategy=hadoop to train the pivot classifiers even faster, however, make sure that the hadoop nodes have Bolt (feature-mask branch) [3] installed.

clscl_predict

Prediction script for CLSCL.

Usage:

$ ./clscl_predict cls-acl10-processed/en/books/train.processed model.bz2 cls-acl10-processed/de/books/test.processed 0.01
|V_S| = 64682
|V_T| = 106024
|V| = 170706
load took 0.681 sec
load took 0.659 sec
classes = {negative,positive}
project took 2.498 sec
project took 2.716 sec
project took 2.275 sec
project took 2.492 sec
ACC: 83.05

Named-Entity Recognition

A simple greedy left-to-right sequence labeling approach to named entity recognition (NER).

pre-trained models

We provide pre-trained named entity recognizers for place, person, and organization names in English and German. To tag a sentence simply use:

>>> from nut.io import compressed_load
>>> from nut.util import WordTokenizer

>>> tagger = compressed_load("model_demo_en.bz2")
>>> tokenizer = WordTokenizer()
>>> tokens = tokenizer.tokenize("Peter Prettenhofer lives in Austria .")

>>> # see tagger.tag.__doc__ for input format
>>> sent = [((token, "", ""), "") for token in tokens]
>>> g = tagger.tag(sent)  # returns a generator over tags
>>> print(" ".join(["/".join(tt) for tt in zip(tokens, g)]))
Peter/B-PER Prettenhofer/I-PER lives/O in/O Austria/B-LOC ./O

You can also use the convenience demo script ner_demo.py:

$ python ner_demo.py model_en_v1.bz2

The feature detector modules for the pre-trained models are en_best_v1.py and de_best_v1.py and can be found in the package nut.ner.features. In addition to baseline features (word presence, shape, pre-/suffixes) they use distributional features (brown clusters), non-local features (extended prediction history), and gazetteers (see [Ratinov2009]). The models have been trained on CoNLL03 [4]. Both models use neither syntactic features (e.g. part-of-speech tags, chunks) nor word lemmas, thus, minimizing the required pre-processing. Both models provide state-of-the-art performance on the CoNLL03 shared task benchmark for English [Ratinov2009]:

processed 46435 tokens with 4946 phrases; found: 4864 phrases; correct: 4455.
accuracy:  98.01%; precision:  91.59%; recall:  90.07%; FB1:  90.83
              LOC: precision:  91.69%; recall:  90.53%; FB1:  91.11  1648
              ORG: precision:  87.36%; recall:  85.73%; FB1:  86.54  1630
              PER: precision:  95.84%; recall:  94.06%; FB1:  94.94  1586

and German [Faruqui2010]:

processed 51943 tokens with 2845 phrases; found: 2438 phrases; correct: 2168.
accuracy:  97.92%; precision:  88.93%; recall:  76.20%; FB1:  82.07
              LOC: precision:  87.67%; recall:  79.83%; FB1:  83.57  957
              ORG: precision:  82.62%; recall:  65.92%; FB1:  73.33  466
              PER: precision:  93.00%; recall:  78.02%; FB1:  84.85  1015

To evaluate the German model on the out-domain data provided by [Faruqui2010] use the raw flag (-r) to write raw predictions (without B- and I- prefixes):

./ner_predict -r model_de_v1.bz2 clner/de/europarl/test.conll - | clner/scripts/conlleval -r
loading tagger... [done]
use_eph:  True
use_aso:  False
processed input in 40.9214s sec.
processed 110405 tokens with 2112 phrases; found: 2930 phrases; correct: 1676.
accuracy:  98.50%; precision:  57.20%; recall:  79.36%; FB1:  66.48
              LOC: precision:  91.47%; recall:  71.13%; FB1:  80.03  563
              ORG: precision:  43.63%; recall:  83.52%; FB1:  57.32  1673
              PER: precision:  62.10%; recall:  83.85%; FB1:  71.36  694

Note that the above results cannot be compared directly to the resuls of [Faruqui2010] since they use a slighly different setting (incl. MISC entity).

ner_train

Training script for NER. See ./ner_train --help for further details.

To train a conditional markov model with a greedy left-to-right decoder, the feature templates of [Rationov2009]_ and extended prediction history (see [Ratinov2009]) use:

./ner_train clner/en/conll03/train.iob2 model_rr09.bz2 -f rr09 -r 0.00001 -E 100 --shuffle --eph
________________________________________________________________________________
Feature extraction

min count:  1
use eph:  True
build_vocabulary took 24.662 sec
feature_extraction took 25.626 sec
creating training examples... build_examples took 42.998 sec
[done]
________________________________________________________________________________
Training

num examples: 203621
num features: 553249
num classes: 9
classes:  ['I-LOC', 'B-ORG', 'O', 'B-PER', 'I-PER', 'I-MISC', 'B-MISC', 'I-ORG', 'B-LOC']
reg: 0.00001000
epochs: 100
9 models trained in 239.28 seconds.
train took 282.374 sec

ner_predict

You can use the prediction script to tag new sentences formatted in CoNLL format and write the output to a file or to stdout. You can pipe the output directly to conlleval to assess the model performance:

./ner_predict model_rr09.bz2 clner/en/conll03/test.iob2 - | clner/scripts/conlleval
loading tagger... [done]
use_eph:  True
use_aso:  False
processed input in 11.2883s sec.
processed 46435 tokens with 5648 phrases; found: 5605 phrases; correct: 4799.
accuracy:  96.78%; precision:  85.62%; recall:  84.97%; FB1:  85.29
              LOC: precision:  87.29%; recall:  88.91%; FB1:  88.09  1699
             MISC: precision:  79.85%; recall:  75.64%; FB1:  77.69  665
              ORG: precision:  82.90%; recall:  78.81%; FB1:  80.80  1579
              PER: precision:  88.81%; recall:  91.28%; FB1:  90.03  1662

References

[1] http://pypi.python.org/pypi/sparsesvd/0.1.4
[2] http://www.webis.de/research/corpora/corpus-webis-cls-10/cls-acl10-processed.tar.gz
[3] https://github.com/pprett/bolt/tree/feature-mask
[4] For German we use the updated version of CoNLL03 by Sven Hartrumpf.
[Prettenhofer2010] Prettenhofer, P. and Stein, B., Cross-language text classification using structural correspondence learning. In Proceedings of ACL '10.
[Prettenhofer2011] Prettenhofer, P. and Stein, B., Cross-lingual adaptation using structural correspondence learning. ACM TIST (to appear). [preprint]
[Ratinov2009] (1, 2, 3) Ratinov, L. and Roth, D., Design challenges and misconceptions in named entity recognition. In Proceedings of CoNLL '09.
[Faruqui2010] (1, 2, 3) Faruqui, M. and Padó S., Training and Evaluating a German Named Entity Recognizer with Semantic Generalization. In Proceedings of KONVENS '10

Developer Notes

  • If you copy a new version of bolt into the externals directory make sure to run cython on the *.pyx files. If you fail to do so you will get a PickleError in multiprocessing.
Owner
Peter Prettenhofer
Peter Prettenhofer
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
Saptak Bhoumik 14 May 24, 2022
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword

Google Research 409 Jan 06, 2023
A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

Nick Schorr 23 Dec 25, 2022
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

ArrowLuo 456 Jan 06, 2023
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 07, 2023
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 06, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
A versatile token stream for handwritten parsers.

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. Th

Valentin Berlier 8 Nov 30, 2022
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

Dani El-Ayyass 57 Dec 07, 2022
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
基于“Seq2Seq+前缀树”的知识图谱问答

KgCLUE-bert4keras 基于“Seq2Seq+前缀树”的知识图谱问答 简介 博客:https://kexue.fm/archives/8802 环境 软件:bert4keras=0.10.8 硬件:目前的结果是用一张Titan RTX(24G)跑出来的。 运行 第一次运行的时候,会给知

苏剑林(Jianlin Su) 65 Dec 12, 2022
[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Learning Signal-Agnostic Manifolds of Neural Fields This is the uncleaned code for the paper Learning Signal-Agnostic Manifolds of Neural Fields. The

60 Dec 12, 2022
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

Maarten van Gompel 46 Dec 14, 2022
The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

speech-recognition-py Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to huma

Deepangshi 1 Apr 03, 2022