A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Last update: Sep 26, 2022

Overview

RITA DSL

This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy compatible patterns, or pure regex. These patterns can be used for doing manual NER as well as used in other processes, like retokenizing and pure matching

An Introduction Video

Support

Install

pip install rita-dsl

Simple Rules example

rules = """
cuts = {"fitted", "wide-cut"}
lengths = {"short", "long", "calf-length", "knee-length"}
fabric_types = {"soft", "airy", "crinkled"}
fabrics = {"velour", "chiffon", "knit", "woven", "stretch"}

{IN_LIST(cuts)?, IN_LIST(lengths), WORD("dress")}->MARK("DRESS_TYPE")
{IN_LIST(lengths), IN_LIST(cuts), WORD("dress")}->MARK("DRESS_TYPE")
{IN_LIST(fabric_types)?, IN_LIST(fabrics)}->MARK("DRESS_FABRIC")
"""

Loading in spaCy

import spacy
from rita.shortcuts import setup_spacy


nlp = spacy.load("en")
setup_spacy(nlp, rules_string=rules)

And using it:

>>> r = nlp("She was wearing a short wide-cut dress")
>>> [{"label": e.label_, "text": e.text} for e in r.ents]
[{'label': 'DRESS_TYPE', 'text': 'short wide-cut dress'}]

Loading using Regex (standalone)

import rita

patterns = rita.compile_string(rules, use_engine="standalone")

And using it:

>>> list(patterns.execute("She was wearing a short wide-cut dress"))
[{'end': 38, 'label': 'DRESS_TYPE', 'start': 18, 'text': 'short wide-cut dress'}]

Comments

Jetbrains RITA Plugin not compatible with PyCharm 2020.2.1

Plugin Version: 1.2 https://plugins.jetbrains.com/plugin/15011-rita-language/versions/

Tested Version: PyCharm 2020.2.1 (Professional Edition)

Error when trying to install from disk:

On the plugin site https://plugins.jetbrains.com/plugin/15011-rita-language/versions/ it says, that this should be uncompitable for all IntellJ-based IDEs in the 2020.2 version:

The list of supported products was determined by dependencies defined in the plugin.xml: Android Studio — build 201.7223 — 201.* DataGrip — 2020.1.3 — 2020.1.5 IntelliJ IDEA Ultimate — 2020.1.1 — 2020.1.4 Rider — 2020.1.3 PyCharm Professional — 2020.1.1 — 2020.1.4 PyCharm Community — 2020.1.1 — 2020.1.4 PhpStorm — 2020.1.1 — 2020.1.4 IntelliJ IDEA Educational — 2020.1.1 — 2020.1.2 CLion — 2020.1.1 — 2020.1.3 PyCharm Educational — 2020.1.1 — 2020.1.2 GoLand — 2020.1.1 — 2020.1.4 AppCode — 2020.1.2 — 2020.1.6 RubyMine — 2020.1.1 — 2020.1.4 MPS — 2020.1.1 — 2020.1.4 IntelliJ IDEA Community — 2020.1.1 — 2020.1.4 WebStorm — 2020.1.1 — 2020.1.4

opened by rolandmueller 3

IN_LIST ignores OP quantifier

Somehow I get this unexpected behaviour when using OP quantifiers (?, *, +, etc) with the IN_LIST element:

rules = """
list_elements = {"one", "two"}
{IN_LIST(list_elements)?}->MARK("LABEL")
"""
rules = rita.compile_string(rules)
expected_result = "[{'label': 'LABEL', 'pattern': [{'LOWER': {'REGEX': '^(one|two)$'}, 'OP': '?'}]}]"
print("expected_result:", expected_result)
print("result:", rules)
assert str(rules) == expected_result

Version: 0.5.0

bug

opened by rolandmueller 3

Add module regex
This feature would introduce the REGEX element as a module.

Matches words based on a Regex pattern e.g. all words that start with an 'a' would be REGEX("^a")

!IMPORT("rita.modules.regex") {REGEX("^a")}->MARK("TAGGED_MATCH")
opened by rolandmueller 2
Feature/pluralize
Add a new module for a PLURALIZE tag For a noun or a list of nouns, it will match any singular or plural word. Usage for a single word, e.g.:

PLURALIZE("car")

Usage for lists, e.g.:

vehicles = {"car", "bicycle", "ship"} PLURALIZE(vehicles)

Will work even for regex or if the lemmatizer of spaCy is making an error. Has dependency to the Python inflect package https://pypi.org/project/inflect/
opened by rolandmueller 2
Feature/regex tag

This feature would introduce the TAG element as a module. Needs a new parser for the SpaCy translate. Would allow more flexible matching of detailed part-of-speech tag, like all adjectives or nouns: TAG("^NN|^JJ").

opened by rolandmueller 2
Feature/improve robustness

In general - measure how long it takes to compile and avoid situations when pattern creates infinite loop (possible to get to this situation using regex).

Closes: https://github.com/zaibacu/rita-dsl/issues/78

opened by zaibacu 1
Add TAG_WORD macro to Tag module
This feature would introduce the TAG_WORD element to the Tag module

TAG_WORD is for generating TAG patterns with a word or a list.

e.g. match only "proposed" when it is in the sentence a verb (and not an adjective):

!IMPORT("rita.modules.tag") TAG_WORD("^VB", "proposed")

or e.g. match a list of words only to verbs

!IMPORT("rita.modules.tag") words = {"percived", "proposed"} {TAG_WORD("^VB", words)}->MARK("LABEL")
opened by rolandmueller 1
Add Orth module
This feature would introduce the ORTH element as a module.

Ignores case-insensitive configuration and checks words as written that means case-sensitive even if configuration is case-insensitive. Especially useful for acronyms and proper names.

Works only with spaCy engine

Usage:

!IMPORT("rita.modules.orth") {ORTH("IEEE")}->MARK("TAGGED_MATCH")
opened by rolandmueller 1
Add conifugration for implicit hyphon characters between words

Add a new Configuration implicit_hyphon (default false) for automatically adding hyphon characters - to the rules. Enabling implicit_hyphon is disabling implicit_punct. Rationale: implicit_punct is often to much inclusive. The implicit_punct has the hyphon token included, but it is adding (at least in my use case) unwanted tokens (like parentheses) to the matches, especially for more complex rules. So implicit_hyphon is a little bit more strict than implicit_punct.

opened by rolandmueller 1
Fix sequencial optional

Closes https://github.com/zaibacu/rita-dsl/issues/69

Turns out it is a bug related to - character which in most cases used as a splitter, but in this case as a stand alone word

opened by zaibacu 1

Method to validate syntax

Currently it can be partially done:

from rita.parser import RitaParser
from rita.config import SessionConfig
config = SessionConfig()
p = RitaParser(config)
p.build()
result = p.parse(rules)
if result is None:
    raise RuntimeError("... Something is wrong with syntax")

But it would be nice to have single method for that and have actual error info.

enhancement

opened by zaibacu 0

Dynamic case sensitivity for Standalone Engine

We want to be able to make specified word inside pattern to be case sensitive, while rest of the pattern is case insensitive.

It looks like it can be achieved using inline modifier groups regex feature, it requires Python3.6+ version
enhancement

opened by zaibacu 0
JS rule engine

Should work similarly to standalone engine, maybe even inherit most of it, but it should result into valid JavaScript code, preferably a single function to which you give raw text and get result of multiple parsed entities
enhancement help wanted

opened by zaibacu 0
Allow LOAD macro to load from external locations

now LOAD(file_name) macro searches text file in current path.

Usually reading from the local file is the best, but it should be cool, to be able just give like github GIST url and just load everything we need. This would be very useful for Demo page case
good first issue

opened by zaibacu 0

Releases(0.7.0)

0.7.0(Feb 2, 2021)
0.7.0 (2021-02-02)

Features

standalone engine now will return submatches list containing start and end for each part of match #93

Partially covered https://github.com/zaibacu/rita-dsl/issues/70

Allow nested patterns, like:

num_with_fractions = {NUM, WORD("-")?, IN_LIST(fractions)} complex_number = {NUM|PATTERN(num_with_fractions)} {PATTERN(complex_number)}->MARK("NUMBER")

#95

Submatches for rita-rust engine #96

Regex module which allows to specify word pattern, eg. REGEX(^a) means word must start with letter "a"

Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #101

ORTH module which allows you to specify case sensitive entry while rest of the rules ignores case. Used for acronyms and proper names

Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #102

Additional macro for tag module, allowing to tag specific word/list of words

Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #103

Added names module which allows to generate person names variations #105

spaCy v3 Support #109

Fix

Optimizations for Rust Engine

No need for passing text forward and backward, we can calculate from text[start:end]

Grouping and sorting logic can be done in binary code #88

Fix NUM parsing bug #90

Switch from (^\s) to \b when doing IN_LIST. Should solve several corner cases #91

Fix floating point number matching #92

revert #91 changes. Keep old way for word boundary #94

Source code(tar.gz)
Source code(zip)
0.6.0(Aug 29, 2020)
0.6.0 (2020-08-29)

Features

Implemented ability to alias macros, eg.:

numbers = {"one", "two", "three"} @alias IN_LIST IL IL(numbers) -> MARK("NUMBER")

Now using "IL" will actually call "IN_LIST" macro. #66

introduce the TAG element as a module. Needs a new parser for the SpaCy translate. Would allow more flexible matching of detailed part-of-speech tag, like all adjectives or nouns: TAG("^NN|^JJ").

Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #81

Add a new module for a PLURALIZE tag For a noun or a list of nouns, it will match any singular or plural word.

Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #82

Add a new Configuration implicit_hyphon (default false) for automatically adding hyphon characters - to the rules.

Implemented by: Roland M. Mueller (https://github.com/rolandmueller) #84

Allow to give custom regex impl. By default re is used #86

An interface to be able to use rust engine.

In general it's identical to standalone, but differs in one crucial part - all of the rules are compiled into actual binary code and that provides large performance boost. It is proprietary, because there are various caveats, engine itself is a bit more fragile and needs to be tinkered to be optimized to very specific case (eg. few long texts with many matches vs a lot short texts with few matches). #87

Fix

Fix - bug when it is used as stand alone word #71

Fix regex matching, when shortest word is selected from IN_LIST #72

Fix IN_LIST regex so that it wouldn't take part of word #75

Fix IN_LIST operation bug - it was ignoring them #77

Use list branching only when using spaCy Engine #80

Source code(tar.gz)
Source code(zip)
0.5.0(Jun 18, 2020)
Features

Added PREFIX macro which allows to attach word in front of list items or words #47

Allow to pass variables directly when doing compile and compile_string #51

Allow to compile (and later load) rules using rita CLI while using standalone engine (spacy is already supported) #53

Added ability to import rule files into rule file. Recursive import is supported as well. #55

Added possibility to define pattern as a variable and reuse it in other patterns:

Example:

ComplexNumber = {NUM+, WORD("/")?, NUM?} {PATTERN(ComplexNumber), WORD("inches"), WORD("Height")}->MARK("HEIGHT") {PATTERN(ComplexNumber), WORD("inches"), WORD("Width")}->MARK("WIDTH")

#64

Fix

Fix issue with multiple wildcard words using standalone engine #46

Don't crash when no rules are provided #50

Fix Number and ANY-OF parsing #59

Allow escape characters inside LITERAL #62

Source code(tar.gz)
Source code(zip)
0.4.0(Jan 25, 2020)
0.4.0 (2020-01-25)

Features

Support for deaccent. In general, if accented version of word is given, both deaccented and accented will be used to match. To turn iit off - !CONFIG("deaccent", "N") #38

Added shortcuts module to simplify injecting into spaCy #42

Fix

Fix issue regarding Spacy rules with IN_LIST and using case-sensitive mode. It was creating Regex pattern which is not valid spacy pattern #40

Source code(tar.gz)
Source code(zip)
0.3.2(Dec 19, 2019)
Features

Introduced towncrier to track changes

Added linter flake8

Refactored code to match pep8 #32

Fix

Fix WORD split by -

Split by (empty space) as well

Coverage score increase #35

Source code(tar.gz)
Source code(zip)
0.3.0(Dec 14, 2019)

Now there's one global config and child config created per-session (one session = one rule file compilation). Imports and variables are stored in this config as well.

Remove context argument from MACROS, making code cleaner and easier to read
Source code(tar.gz)
Source code(zip)
0.2.2(Dec 8, 2019)
Features of up to this point:

Standalone parser - can use internal regex rather than spaCy if you need to

Ability to do logical OR in rule. eg.: {WORD(w1)|WORD(w2),WORD(w3)} would result into two rules: {WORD(w1),WORD(w3)} and {WORD(w2),WORD(w3)}

Exclude operator {WORD(w1), WORD(w2)!} would match w1 and anything but w2

Source code(tar.gz)
Source code(zip)
0.1(Nov 16, 2019)

Source code(tar.gz)
Source code(zip)

Owner

Šarūnas Navickas

Data Engineer @ TokenMill. Doing BJJ @ Voras-Bjj. Dad @ Home.

GitHub Repository

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

1 Nov 20, 2021

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

20 Jul 14, 2022

NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

290 Dec 20, 2022

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

2 Jan 12, 2022

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

220 Dec 30, 2022

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

Simple NLP based project without any use of AI

1 Apr 26, 2022

NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

2.5k Jan 04, 2023

This is a really simple text-to-speech app made with python and tkinter.

Tkinter Text-to-Speech App by Souvik Roy This is a really simple tkinter app which converts the text you have entered into a speech. It is created wit

1 Dec 21, 2021

"Investigating the Limitations of Transformers with Simple Arithmetic Tasks", 2021

transformers-arithmetic This repository contains the code to reproduce the experiments from the paper: Nogueira, Jiang, Lin "Investigating the Limitat

33 Nov 16, 2022

Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

41 Nov 18, 2022

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

2 Oct 22, 2022

Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022

【原神】自动演奏风物之诗琴的程序

疯物之诗琴读取midi并自动演奏原神风物之诗琴。可以自定义配置文件自动调整音符来适配风物之诗琴。（原神1.4直播那天就开始做了！到现在才能放出来。。）如何使用在Release页面中下载打包好的程序和midi压缩包并解压。双击运行“疯物之诗琴.exe”。在原神中打开风物之诗琴，软件内输入

435 Jan 04, 2023

Twitter Sentiment Analysis using #tag, words and username

Twitter Sentment Analysis Web App using #tag, words and username to fetch data finds Insides of data and Tells Sentiment of the perticular #tag, words or username.

26 Dec 25, 2022

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

12.3k Jan 02, 2023

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Twitter-News-Summarizer Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline 1.) Extracts all tweets fr

1 Jan 27, 2022

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

37 Sep 05, 2022

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

152 Sep 02, 2022

A website which allows you to play with the GPT-2 transformer

transformers A website which allows you to play with the GPT-2 model Built with ❤️ by raphtlw Table of contents Model Setup About Contributors Model T

2 Jan 27, 2022

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Related tags

Overview

RITA DSL

An Introduction Video

Links

Support

Install

Simple Rules example

Loading in spaCy

Loading using Regex (standalone)

Comments

Releases(0.7.0)

0.7.0(Feb 2, 2021)

Features

Fix

0.6.0(Aug 29, 2020)

Features

Fix

0.5.0(Jun 18, 2020)

Features

Fix

0.4.0(Jan 25, 2020)

Features

Fix

0.3.2(Dec 19, 2019)

Features

Fix

0.3.0(Dec 14, 2019)

0.2.2(Dec 8, 2019)

0.1(Nov 16, 2019)

Owner

Šarūnas Navickas

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

NLP library designed for reproducible experimentation management

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Simple NLP based project without any use of AI

NLP made easy

This is a really simple text-to-speech app made with python and tkinter.

"Investigating the Limitations of Transformers with Simple Arithmetic Tasks", 2021

Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

【原神】自动演奏风物之诗琴的程序

Twitter Sentiment Analysis using #tag, words and username

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

A website which allows you to play with the GPT-2 transformer