SpikeX - SpaCy Pipes for Knowledge Extraction

Overview

SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Build Status pypi Version Code style: black

What's new in SpikeX 0.5.0

WikiGraph has never been so lightning fast:

  • 🌕 Performance mooning, thanks to the adoption of a sparse adjacency matrix to handle pages graph, instead of using igraph
  • 🚀 Memory optimization, with a consumption cut by ~40% and a compressed size cut by ~20%, introducing new bidirectional dictionaries to manage data
  • 📖 New APIs for a faster and easier usage and interaction
  • 🛠 Overall fixes, for a better graph and a better pages matching

Pipes

  • WikiPageX links Wikipedia pages to chunks in text
  • ClusterX picks noun chunks in a text and clusters them based on a revisiting of the Ball Mapper algorithm, Radial Ball Mapper
  • AbbrX detects abbreviations and acronyms, linking them to their long form. It is based on scispacy's one with improvements
  • LabelX takes labelings of pattern matching expressions and catches them in a text, solving overlappings, abbreviations and acronyms
  • PhraseX creates a Doc's underscore extension based on a custom attribute name and phrase patterns. Examples are NounPhraseX and VerbPhraseX, which extract noun phrases and verb phrases, respectively
  • SentX detects sentences in a text, based on Splitta with refinements

Tools

  • WikiGraph with pages as leaves linked to categories as nodes
  • Matcher that inherits its interface from the spaCy's one, but built using an engine made of RegEx which boosts its performance

Install SpikeX

Some requirements are inherited from spaCy:

  • spaCy version: 2.3+
  • Operating system: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
  • Python version: Python 3.6+ (only 64 bit)
  • Package managers: pip

Some dependencies use Cython and it needs to be installed before SpikeX:

pip install cython

Remember that a virtual environment is always recommended, in order to avoid modifying system state.

pip

At this point, installing SpikeX via pip is a one line command:

pip install spikex

Usage

Prerequirements

SpikeX pipes work with spaCy, hence a model its needed to be installed. Follow official instructions here. The brand new spaCy 3.0 is supported!

WikiGraph

A WikiGraph is built starting from some key components of Wikipedia: pages, categories and relations between them.

Auto

Creating a WikiGraph can take time, depending on how large is its Wikipedia dump. For this reason, we provide wikigraphs ready to be used:

Date WikiGraph Lang Size (compressed) Size (memory)
2021-04-01 enwiki_core EN 1.1GB 5.9GB
2021-04-01 simplewiki_core EN 19MB 120MB
2021-04-01 itwiki_core IT 189MB 1.1GB
More coming...

SpikeX provides a command to shortcut downloading and installing a WikiGraph (Linux or macOS, Windows not supported yet):

spikex download-wikigraph simplewiki_core

Manual

A WikiGraph can be created from command line, specifying which Wikipedia dump to take and where to save it:

spikex create-wikigraph \
  <YOUR-OUTPUT-PATH> \
  --wiki <WIKI-NAME, default: en> \
  --version <DUMP-VERSION, default: latest> \
  --dumps-path <DUMPS-BACKUP-PATH> \

Then it needs to be packed and installed:

spikex package-wikigraph \
  <WIKIGRAPH-RAW-PATH> \
  <YOUR-OUTPUT-PATH>

Follow the instructions at the end of the packing process and install the distribution package in your virtual environment. Now your are ready to use your WikiGraph as you wish:

from spikex.wikigraph import load as wg_load

wg = wg_load("enwiki_core")
page = "Natural_language_processing"
categories = wg.get_categories(page, distance=1)
for category in categories:
    print(category)

>>> Category:Speech_recognition
>>> Category:Artificial_intelligence
>>> Category:Natural_language_processing
>>> Category:Computational_linguistics

Matcher

The Matcher is identical to the spaCy's one, but faster when it comes to handle many patterns at once (order of thousands), so follow official usage instructions here.

A trivial example:

from spikex.matcher import Matcher
from spacy import load as spacy_load

nlp = spacy_load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("TEST", [[{"LOWER": "nlp"}]])
doc = nlp("I love NLP")
for _, s, e in matcher(doc):
  print(doc[s: e])

>>> NLP

WikiPageX

The WikiPageX pipe uses a WikiGraph in order to find chunks in a text that match Wikipedia page titles.

from spacy import load as spacy_load
from spikex.wikigraph import load as wg_load
from spikex.pipes import WikiPageX

nlp = spacy_load("en_core_web_sm")
doc = nlp("An apple a day keeps the doctor away")
wg = wg_load("simplewiki_core")
wpx = WikiPageX(wg)
doc = wpx(doc)
for span in doc._.wiki_spans:
  print(span._.wiki_pages)

>>> ['An']
>>> ['Apple', 'Apple_(disambiguation)', 'Apple_(company)', 'Apple_(tree)']
>>> ['A', 'A_(musical_note)', 'A_(New_York_City_Subway_service)', 'A_(disambiguation)', 'A_(Cyrillic)')]
>>> ['Day']
>>> ['The_Doctor', 'The_Doctor_(Doctor_Who)', 'The_Doctor_(Star_Trek)', 'The_Doctor_(disambiguation)']
>>> ['The']
>>> ['Doctor_(Doctor_Who)', 'Doctor_(Star_Trek)', 'Doctor', 'Doctor_(title)', 'Doctor_(disambiguation)']

ClusterX

The ClusterX pipe takes noun chunks in a text and clusters them using a Radial Ball Mapper algorithm.

from spacy import load as spacy_load
from spikex.pipes import ClusterX

nlp = spacy_load("en_core_web_sm")
doc = nlp("Grab this juicy orange and watch a dog chasing a cat.")
clusterx = ClusterX(min_score=0.65)
doc = clusterx(doc)
for cluster in doc._.cluster_chunks:
  print(cluster)

>>> [this juicy orange]
>>> [a cat, a dog]

AbbrX

The AbbrX pipe finds abbreviations and acronyms in the text, linking short and long forms together:

from spacy import load as spacy_load
from spikex.pipes import AbbrX

nlp = spacy_load("en_core_web_sm")
doc = nlp("a little snippet with an abbreviation (abbr)")
abbrx = AbbrX(nlp.vocab)
doc = abbrx(doc)
for abbr in doc._.abbrs:
  print(abbr, "->", abbr._.long_form)

>>> abbr -> abbreviation

LabelX

The LabelX pipe matches and labels patterns in text, solving overlappings, abbreviations and acronyms.

from spacy import load as spacy_load
from spikex.pipes import LabelX

nlp = spacy_load("en_core_web_sm")
doc = nlp("looking for a computer system engineer")
patterns = [
  [{"LOWER": "computer"}, {"LOWER": "system"}],
  [{"LOWER": "system"}, {"LOWER": "engineer"}],
]
labelx = LabelX(nlp.vocab, ("TEST", patterns), validate=True, only_longest=True)
doc = labelx(doc)
for labeling in doc._.labelings:
  print(labeling, f"[{labeling.label_}]")

>>> computer system engineer [TEST]

PhraseX

The PhraseX pipe creates a custom Doc's underscore extension which fulfills with matches from phrase patterns.

from spacy import load as spacy_load
from spikex.pipes import PhraseX

nlp = spacy_load("en_core_web_sm")
doc = nlp("I have Melrose and McIntosh apples, or Williams pears")
patterns = [
  [{"LOWER": "mcintosh"}],
  [{"LOWER": "melrose"}],
]
phrasex = PhraseX(nlp.vocab, "apples", patterns)
doc = phrasex(doc)
for apple in doc._.apples:
  print(apple)

>>> Melrose
>>> McIntosh

SentX

The SentX pipe splits sentences in a text. It modifies tokens' is_sent_start attribute, so it's mandatory to add it before parser pipe in the spaCy pipeline:

from spacy import load as spacy_load
from spikex.pipes import SentX
from spikex.defaults import spacy_version

if spacy_version >= 3:
  from spacy.language import Language

    @Language.factory("sentx")
    def create_sentx(nlp, name):
        return SentX()

nlp = spacy_load("en_core_web_sm")
sentx_pipe = SentX() if spacy_version < 3 else "sentx"
nlp.add_pipe(sentx_pipe, before="parser")
doc = nlp("A little sentence. Followed by another one.")
for sent in doc.sents:
  print(sent)

>>> A little sentence.
>>> Followed by another one.

That's all folks

Feel free to contribute and have fun!

Owner
Erre Quadro Srl
Erre Quadro Srl
In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Jan 03, 2023
中文空间语义理解评测

中文空间语义理解评测 最新消息 2021-04-10 🚩 排行榜发布: Leaderboard 2021-04-05 基线系统发布: SpaCE2021-Baseline 2021-04-05 开放数据提交: 提交结果 2021-04-01 开放报名: 我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023
[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

94 Dec 21, 2022
Yodatranslator is a simple translator English to Yoda-language

yodatranslator Overview yodatranslator is a simple translator English to Yoda-language. Project is created for educational purposes. It is intended to

1 Nov 11, 2021
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
Searching keywords in PDF file folders

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

1 Nov 08, 2021
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization 📥 Download Datasets 📥 Download Trained Models INTRODUCTION TH2ZH (

Nakhun Chumpolsathien 5 Jan 03, 2022
The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation This repository contains the implementation of the following paper: Live Speech

OldSix 575 Dec 31, 2022
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

0 Jun 20, 2021
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 09, 2021
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 02, 2023
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022