Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Overview

Simplemma: a simple multilingual lemmatizer for Python

Python package License Python versions Code Coverage

Purpose

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task is useful in information retrieval and natural language processing.

Simplemma provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it doesn't need morphosyntactic information and can process a raw series of tokens or even a text with its built-in (simple) tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect.

With its comparatively small footprint it is especially useful when speed and simplicity matter, for educational purposes or as a baseline system for lemmatization and morphological analysis.

Currently, 38 languages are partly or fully supported (see table below).

Installation

The current library is written in pure Python with no dependencies:

pip install simplemma

  • pip3 where applicable
  • pip install -U simplemma for updates

Usage

Word-by-word

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

>>> import simplemma
# get a word
myword = 'masks'
# decide which language data to load
>>> langdata = simplemma.load_data('en')
# apply it on a word form
>>> simplemma.lemmatize(myword, langdata)
'mask'
# grab a list of tokens
>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> langdata = simplemma.load_data('de')
>>> for token in mytokens:
>>>     simplemma.lemmatize(token, langdata)
'hier'
'sein'
'Vaccines'
# list comprehensions can be faster
>>> [simplemma.lemmatize(t, langdata) for t in mytokens]
['hier', 'sein', 'Vaccines']

Chaining several languages can improve coverage:

>>> langdata = simplemma.load_data('de', 'en')
>>> simplemma.lemmatize('Vaccines', langdata)
'vaccine'
>>> langdata = simplemma.load_data('it')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghettis'
>>> langdata = simplemma.load_data('it', 'fr')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghetti'
>>> simplemma.lemmatize('spaghetti', langdata)
'spaghetto'

There are cases in which a greedier decomposition and lemmatization algorithm is better. It is deactivated by default:

# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', mydata, greedy=True)
'spaghetto'
# a German case
>>> langdata = simplemma.load_data('de')
>>> simplemma.lemmatize('angekündigten', langdata)
'ankündigen' # infinitive verb
>>> simplemma.lemmatize('angekündigten', langdata, greedy=False)
'angekündigt' # past participle

Tokenization

A simple tokenization function is included for convenience:

>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']

The function text_lemmatizer() chains tokenization and lemmatization. It can take greedy (affecting lemmatization) and silent (affecting errors and logging) as arguments:

>>> from simplemma import text_lemmatizer
>>> langdata = simplemma.load_data('pt')
>>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', langdata)
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']

Caveats

# don't expect too much though
>>> langdata = simplemma.load_data('it')
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', langdata)
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> langdata = simplemma.load_data('es')
>>> simplemma.lemmatize('son', langdata)
'son' # valid common name, but what about the verb form?

As the focus lies on overall coverage, some short frequent words (typically: pronouns) can need post-processing, this generally concerns 10-20 tokens per language.

Additionally, the current absence of morphosyntactic information is both an advantage in terms of simplicity and an impassable frontier with respect to lemmatization accuracy, e.g. to disambiguate between past participles and adjectives derived from verbs in Germanic and Romance languages. In most cases, simplemma often doesn't change the input then.

The greedy algorithm rarely produces forms that are not valid. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages. It can also act as a linguistically motivated stemmer.

Bug reports over the issues page are welcome.

Supported languages

The following languages are available using their ISO 639-1 code:

Available languages (2021-10-19)
Code Language Word pairs Acc. Comments
bg Bulgarian 73,847   low coverage
ca Catalan 579,507    
cs Czech 34,674   low coverage
cy Welsh 360,412    
da Danish 554,238   alternative: lemmy
de German 683,207 0.95 on UD DE-GSD, see also German-NLP list
el Greek 76,388   low coverage
en English 136,162 0.94 on UD EN-GUM, alternative: LemmInflect
es Spanish 720,623 0.94 on UD ES-GSD
et Estonian 133,104   low coverage
fa Persian 10,967   low coverage
fi Finnish 2,106,359   alternatives: voikko or NLP list
fr French 217,213 0.94 on UD FR-GSD
ga Irish 383,448    
gd Gaelic 48,661    
gl Galician 384,183    
gv Manx 62,765    
hu Hungarian 458,847    
hy Armenian 323,820    
id Indonesian 17,419 0.91 on UD ID-CSUI
it Italian 333,680 0.92 on UD IT-ISDT
ka Georgian 65,936    
la Latin 850,283    
lb Luxembourgish 305,367    
lt Lithuanian 247,337    
lv Latvian 57,153    
mk Macedonian 57,063    
nb Norwegian (Bokmål) 617,940    
nl Dutch 254,073 0.91 on UD-NL-Alpino
pl Polish 3,723,580    
pt Portuguese 933,730 0.92 on UD-PT-GSD
ro Romanian 311,411    
ru Russian 607,416   alternative: pymorphy2
sk Slovak 846,453 0.87 on UD SK-SNK
sl Slovenian 97,050   low coverage
sv Swedish 658,606   alternative: lemmy
tr Turkish 1,333,137 0.88 on UD-TR-Boun
uk Ukrainian 190,472   alternative: pymorphy2

Low coverage mentions means you'd probably be better off with a language-specific library, but simplemma will work to a limited extent. Open-source alternatives for Python are referenced if possible.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. They can be reproduced using the script udscore.py in the tests/ folder.

Roadmap

  • [-] Add further lemmatization lists
  • [ ] Grammatical categories as option
  • [ ] Function as a meta-package?
  • [ ] Integrate optional, more complex models?

Credits

Software under MIT license, for the linguistic information databases see licenses folder.

The surface lookups (non-greedy mode) use lemmatization lists taken from various sources:

This rule-based approach based on flexion and lemmatizations dictionaries is to this day an approach used in popular libraries such as spacy.

Contributions

Feel free to contribute, notably by filing issues for feedback, bug reports, or links to further lemmatization lists, rules and tests.

You can also contribute to this lemmatization list repository.

Other solutions

See lists: German-NLP and other awesome-NLP lists.

For a more complex and universal approach in Python see universal-lemmatizer.

References

Barbaresi A. (2021). Simplemma: a simple multilingual lemmatizer for Python. Zenodo. http://doi.org/10.5281/zenodo.4673264

This work draws from lexical analysis algorithms used in:

Comments
  • Huge memory usage for some languages, e.g. Finnish

    Huge memory usage for some languages, e.g. Finnish

    While testing https://github.com/NatLibFi/Annif/pull/626, I noticed that Simplemma needs a lot of memory for Finnish language lemmatization and language detection. I did a little comparison to see how the choice of language affects memory consumption.

    First the baseline, English:

    #!/usr/bin/env python
    
    from simplemma import text_lemmatizer
    
    print(text_lemmatizer("she sells seashells on a seashore", lang='en'))
    

    Results (partial) of running this with /usr/bin/time -v:

    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.14
    Maximum resident set size (kbytes): 41652
    

    So far so good. Then other languages I know.

    Estonian:

    print(text_lemmatizer("jääääre kuuuurija töööö", lang='et'))
    
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.16
    Maximum resident set size (kbytes): 41252
    
    • practically same as English

    Swedish:

    print(text_lemmatizer("sju sköna sjuksköterskor skötte sju sjösjuka sjömän", lang='sv'))
    
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.70
    Maximum resident set size (kbytes): 190920
    
    • somewhat slower, needs 150MB more memory

    ...

    Finnish:

    print(text_lemmatizer("mustan kissan paksut posket", lang='fi'))
    
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.44
    Maximum resident set size (kbytes): 1059328
    
    • oops, now takes at least an order of magnitude longer and 1GB of memory! Why?

    I checked the sizes of the data files for these four languages:

     14M fi.plzma
    2.6M sv.plzma
    604K et.plzma
    479K en.plzma
    

    There seems to be a correlation with memory usage here. The data file for Finnish is much bigger than the others, and Swedish is also big compared to English and Estonian. But I think something must be wrong if a data file that is ~10MB larger than the others leads to 1GB of extra memory usage.

    enhancement 
    opened by osma 19
  • Support for Northern Sami language

    Support for Northern Sami language

    I propose that Simplemma could support the Northern Sami language (ISO 639-1 code se). I understood from this discussion that adding a new language would require a corpus of word + lemma pairs. My colleague @nikopartanen found at least these two corpora that could perhaps be used as raw material:

    SIKOR North Saami free corpus - this is a relatively large (9M tokens) corpus. From the description:

    The corpus has been automatically processed and linguistically analyzed with the Giellatekno/Divvun tools. Therefore, it may contain wrong annotations.

    Another obvious corpus is the Universal Dependency treebank for Northern Sami.

    Thoughts on these two corpora? What needs to be done to make this happen?

    enhancement 
    opened by osma 12
  • Add some simple suffix rules for Finnish

    Add some simple suffix rules for Finnish

    This PR adds some generated suffix rules for Finnish language, as discussed in #19.

    All these rules have an accuracy above 90% as evaluated on the Finnish language dictionary included with Simplemma. Collectively they cover around 6% of the dictionary entries.

    opened by osma 10
  • Porting to C++

    Porting to C++

    This seems like a relatively simple library since I imagine most of the work is done on the data collection part. Would you consider porting it to C++ (and possibly other languages) so it can be more easily integrated to user applications?

    question 
    opened by 1over137 3
  • Sourcery refactored main branch

    Sourcery refactored main branch

    Branch main refactored by Sourcery.

    If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

    See our documentation here.

    Run Sourcery locally

    Reduce the feedback loop during development by using the Sourcery editor plugin:

    Review changes via command line

    To manually merge these changes, make sure you're on the main branch, then run:

    git fetch origin sourcery/main
    git merge --ff-only FETCH_HEAD
    git reset HEAD^
    

    Help us improve this pull request!

    opened by sourcery-ai[bot] 3
  • German lemmatization: Definite articles

    German lemmatization: Definite articles

    Hi @adbar,

    thanks for providing this really cool (and capable) library.

    Currently, all German definite articles ("der", "die", "das") are lemmatized to "der", which is wrong.

    import simplemma
    simplemma.lemmatize("Das", lang=('de',))
    # Output: 'der'
    simplemma.lemmatize("Die", lang=('de',))
    # Output: 'der'
    simplemma.lemmatize("Der", lang=('de',))
    # Output: 'der'
    

    Normally, I would not be too picky about wrong lemmatization (this happens all the time). But since these are among the most common German words, this should be correct. Their respective lemma should be the same word each: "Das" -> "Das", "Die" -> "Die", "Der" -> "Der". The lemma should be lowercase, if the original token is lowerecased, too.

    question 
    opened by GrazingScientist 2
  • Add some simple suffix rules for Finnish (Sourcery refactored)

    Add some simple suffix rules for Finnish (Sourcery refactored)

    Pull Request #23 refactored by Sourcery.

    Since the original Pull Request was opened as a fork in a contributor's repository, we are unable to create a Pull Request branching from it.

    To incorporate these changes, you can either:

    1. Merge this Pull Request instead of the original, or

    2. Ask your contributor to locally incorporate these commits and push them to the original Pull Request

      Incorporate changes via command line
      git fetch https://github.com/adbar/simplemma pull/23/head
      git merge --ff-only FETCH_HEAD
      git push

    NOTE: As code is pushed to the original Pull Request, Sourcery will re-run and update (force-push) this Pull Request with new refactorings as necessary. If Sourcery finds no refactorings at any point, this Pull Request will be closed automatically.

    See our documentation here.

    Run Sourcery locally

    Reduce the feedback loop during development by using the Sourcery editor plugin:

    Help us improve this pull request!

    opened by sourcery-ai[bot] 2
  • Add some simple suffix rules for Finnish (Sourcery refactored)

    Add some simple suffix rules for Finnish (Sourcery refactored)

    Pull Request #23 refactored by Sourcery.

    Since the original Pull Request was opened as a fork in a contributor's repository, we are unable to create a Pull Request branching from it.

    To incorporate these changes, you can either:

    1. Merge this Pull Request instead of the original, or

    2. Ask your contributor to locally incorporate these commits and push them to the original Pull Request

      Incorporate changes via command line
      git fetch https://github.com/adbar/simplemma pull/23/head
      git merge --ff-only FETCH_HEAD
      git push

    NOTE: As code is pushed to the original Pull Request, Sourcery will re-run and update (force-push) this Pull Request with new refactorings as necessary. If Sourcery finds no refactorings at any point, this Pull Request will be closed automatically.

    See our documentation here.

    Run Sourcery locally

    Reduce the feedback loop during development by using the Sourcery editor plugin:

    Help us improve this pull request!

    opened by sourcery-ai[bot] 2
  • English lemmatize returns funny nonsense lemma 'splitshine' for token 'e'

    English lemmatize returns funny nonsense lemma 'splitshine' for token 'e'

    Probably not intended, but this comes up with the current version 0.8.0

    import simplemma simplemma.lemmatize("e", "en") 'spitshine'

    BR, Bart Depoortere

    bug 
    opened by bartdpt 2
  • How can I speed up Simplemma for Polish?

    How can I speed up Simplemma for Polish?

    First of all, thank you for creating this great lemmatizer!

    I've used it for English and it's blazing fast (in my trials, it found the lemma in less than 10 ms). For other languages, such as Portuguese and Spanish it's still reasonably fast, with lemmatization working under 50ms.

    For Polish, however, lemmatization is taking over 2 seconds. I know Polish is a more complicated language for lemmatizing because it's inflected, but is there a way I can speed it up? Ideally, I'd like to have it below 100ms, but even under 500ms I think it would be good enough.

    In my trials, I'm passing a single word and doing it simply like this:

    import  simplemma
    import time
    
    start = time.time()
    langdata = simplemma.load_data('pl')
    print(simplemma.lemmatize('robi', langdata))
    end = time.time()
    print(end-start)
    
    opened by rafaelsaback 2
  • Support for Asturian

    Support for Asturian

    Hi, the README says that this project uses data from Lemmatization lists by Michal Měchura, but the Asturian language is not supported in simplemma.

    I'm wondering if there there are any plans to add the support for Asturian?

    enhancement 
    opened by BLKSerene 1
  • Documentation: Please elaborate on the `greedy` parameter

    Documentation: Please elaborate on the `greedy` parameter

    Hi @adbar ,

    could you please in you documentation go into detail on what the greedy parameter actually does. In the documentation, it is mentioned as if it would be self-explanatory. However, I really cannot estimate its potential/dangers.

    Thanks! :smiley:

    documentation 
    opened by GrazingScientist 0
  • Czech tokenizer is very slow

    Czech tokenizer is very slow

    Thanks for this great tool! I'm using it to align English lemmas with Czech lemmas for an MT use case. I'm finding that while the English lemmatizer is extremely fast (haven't benchmarked, but your ~250k/s estimate seems reasonably close), Czech is quite slow.

    A tiny microbenchmark (admittedly misleading) shows disparity between the two:

    In [1]: from simplemma import lemmatize
    
    In [2]: %timeit -n1 lemmatize("fanoušků", lang="cs")
    The slowest run took 768524.54 times longer than the fastest. This could mean that an intermediate result is being cached.
    27.6 ms ± 67.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [3]: %timeit -n1 lemmatize("banks", lang="en")
    The slowest run took 464802.09 times longer than the fastest. This could mean that an intermediate result is being cached.
    16.9 ms ± 41.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    I see that lemmas are cached in the source code, with a single loop we see a near double penalty for Czech over English. I'm not exactly sure how to improve this immediately so I'm reporting it here for awareness and for others. 😄

    question 
    opened by erip 3
  • Language detector can be tricked

    Language detector can be tricked

    I'm not sure if this is even worth reporting, but I was looking closer at how the language detection works and I realized that for languages that have rules (English and German currently), it can be quite easily tricked to accept nonsensical input and respond with high confidence that it is English (or probably German):

    >>> from simplemma.langdetect import in_target_language, lang_detector
    >>> in_target_language('grxlries cnwated pltdoms', lang='en')
    1.0
    >>> lang_detector('grxlries cnwated pltdoms', lang=('en','sv','de','fr','es'))
    [('en', 1.0), ('unk', 0.0), ('sv', 0.0), ('de', 0.0), ('fr', 0.0), ('es', 0.0)]
    

    This happens because the English language rules match the word suffixes (in this case, -ries, -ated and -doms) in the input words without looking too carefully at the preceding characters (the word length matters though).

    This is unlikely to happen with real world text inputs, but it's possible for an attacker to deliberately trick the detector.

    question 
    opened by osma 2
  • Additional inflection data for RU & UK

    Additional inflection data for RU & UK

    Hi, I'm the author of SSM which is a language learning utility for quickly making vocabulary flashcards. Thanks for this project! Without this it would have been difficult to provide multilingual lemmatization, which is an essential aspect of this tool.

    However, I found that this is not particularly accurate for Russian. PyMorphy2 is a lemmatizer for Russian that I used in other projects. It's very fast and accurate in my experience, much more than spacy or anything else. Any chance you can include PyMorphy2's data in this library?

    question 
    opened by 1over137 10
  • Use additional sources for better coverage

    Use additional sources for better coverage

    • [x] http://unimorph.ethz.ch/languages
    • [x] https://github.com/lenakmeth/Wikinflection-Corpus
    • [x] https://github.com/TALP-UPC/FreeLing/tree/master/data/
    • [x] https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data
    • [x] https://github.com/tatuylonen/wiktextract
    • [ ] http://pauillac.inria.fr/~sagot/index.html#udlexicons
    enhancement 
    opened by adbar 0
Releases(v0.9.0)
  • v0.9.0(Oct 18, 2022)

  • v0.8.2(Sep 5, 2022)

    • languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog
    • fix for slow language detection introduced in 0.7.0

    Full Changelog: https://github.com/adbar/simplemma/compare/v0.8.1...v0.8.2

    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Sep 1, 2022)

    • better rules for English and German
    • inconsistencies fixed for cy, de, en, ga, sv (#16)
    • docs: added language detection and citation info

    Full Changelog: https://github.com/adbar/simplemma/compare/v0.8.0...v0.8.1

    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Aug 2, 2022)

    • code fully type checked, optional pre-compilation with mypyc
    • fixes: logging error (#11), input type (#12)
    • code style: black

    Full Changelog: https://github.com/adbar/simplemma/compare/v0.7.0...v0.8.0

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Jun 16, 2022)

    • breaking change: language data pre-loading now occurs internally, language codes are now directly provided in lemmatize() call, e.g. simplemma.lemmatize("test", lang="en")
    • faster lemmatization and result cache
    • sentence-aware text_lemmatizer()
    • optional iterators for tokenization and lemmatization

    Full Changelog: https://github.com/adbar/simplemma/compare/v0.6.0...v0.7.0

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Apr 6, 2022)

    • improved language models
    • improved tokenizer
    • maintenance and code efficiency
    • added basic language detection (undocumented)

    Full Changelog: https://github.com/adbar/simplemma/compare/v0.5.0...v0.6.0

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Nov 19, 2021)

  • v0.4.0(Oct 19, 2021)

    • new languages: Armenian, Greek, Macedonian, Norwegian (Bokmål), and Polish
    • language data reviewed for: Dutch, Finnish, German, Hungarian, Latin, Russian, and Swedish
    • Urdu removed of language list due to issues with the data
    • add support for Python 3.10 and drop support for Python 3.4
    • improved decomposition and tokenization algorithms
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Apr 8, 2021)

  • v0.2.2(Feb 24, 2021)

  • v0.2.1(Feb 2, 2021)

  • v0.2.0(Jan 25, 2021)

    • Languages added: Danish, Dutch, Finnish, Georgian, Indonesian, Latin, Latvian, Lithuanian, Luxembourgish, Turkish, Urdu
    • Improved word pair coverage
    • Tokenization functions added
    • Limit greediness and range of potential candidates
    Source code(tar.gz)
    Source code(zip)
Owner
Adrien Barbaresi
Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.
Adrien Barbaresi
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

45 Oct 29, 2022
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 763 Dec 27, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
构建一个多源(公众号、RSS)、干净、个性化的阅读环境

2C 构建一个多源(公众号、RSS)、干净、个性化的阅读环境 作为一名微信公众号的重度用户,公众号一直被我设为汲取知识的地方。随着使用程度的增加,相信大家或多或少会有一个比较头疼的问题——广告问题。 假设你关注的公众号有十来个,若一个公众号两周接一次广告,理论上你会面临二十多次广告,实际上会更多,运

howie.hu 678 Dec 28, 2022
This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Zhong Peixiang 156 Dec 21, 2022
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
Code for the ACL 2021 paper "Structural Guidance for Transformer Language Models"

Structural Guidance for Transformer Language Models This repository accompanies the paper, Structural Guidance for Transformer Language Models, publis

International Business Machines 10 Dec 14, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
A list of NLP(Natural Language Processing) tutorials

NLP Tutorial A list of NLP(Natural Language Processing) tutorials built on PyTorch. Table of Contents A step-by-step tutorial on how to implement and

Allen Lee 1.3k Dec 25, 2022
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU

GPU Docker NLP Application Deployment Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU, to setup the enviroment on

Ritesh Yadav 9 Oct 14, 2022
FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

FedML-AI 216 Nov 27, 2022
Code for Editing Factual Knowledge in Language Models

KnowledgeEditor Code for Editing Factual Knowledge in Language Models (https://arxiv.org/abs/2104.08164). @inproceedings{decao2021editing, title={Ed

Nicola De Cao 86 Nov 28, 2022