Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

Overview

eBook Reader Dictionaries

All Contributors

Finally, decent dictionaries based on Wiktionary for your beloved eBook reader.

Dictionaries

Update dictionaries

Requirements

Kobo

Kobo firmware >= 4.24. For older firmwares, you can find outdated dictionaries here.

Updating Dictionaries

All dictionaries are automatically re-generated every day at midnight. The process uses the latest Wiktionary dump available at that time. Note that download links never change.

  • You should open an issue if:
    • you do not find a word;
    • a definition is not similar to the one on Wiktionary;
    • a definition is missing.
  • If a definition is not good for you, changes must be done on Wiktionary directly. Your changes will likely be included in the next Wiktionary dump, so when it will come, at most 24h later the new dictionary will contain your stuff :)

Adding a new Dictionary

Pull requests are very welcome. It is quite straightforward to add a new locale, see HOWTO Add a New Local.

Contributors

Thanks go to these wonderful people (emoji key):


Nicolas Froment

💻 📖

Attilio

💻

chopinesque

💻

Saeed Rasooli

🚇

This project follows the all-contributors specification. Contributions of any kind welcome!

Comments
  • Generate SVG rather than GIF for embedded pictures

    Generate SVG rather than GIF for embedded pictures

    A successfull experiementation was done in https://github.com/BoboTiG/ebook-reader-dict/issues/1182#issuecomment-1027245425 about moving embedded pictures from GIF to SVG. Results are way better, so let's do the move.

    We first need to ensure this works with PyGlossary and StarDict display.

    Note: PyGlossary 4.4.2 or newer is required.

    opened by BoboTiG 60
  • PyGlossary conversion errors (missing images)

    PyGlossary conversion errors (missing images)

    Note from @BoboTiG: issue tightly coupled to #1182, interesting details can be found there too.


    I just downloaded, parsed and rendered the EN Wiktionary, and it apparently has some problems with erroneous and/or missing GIFs:

    output.txt

    All of the .gif files in data/en/res appear to be very ugly rendered fomulae (?).

    opened by Moonbase59 51
  • New locale: DE

    New locale: DE

    My goal is to have (and share) a good German Wiktionary-based dictionary that displays well on small e-reader screens and is a little more informative (i.e., has word form, gender, hyphenation, IPA pronunciation, meaning, abbreviations, synonyms and examples). My main target format would be StarDict, with possible spinoff formats for Kobo (dicthtml?), PocketBook (?) and Tolino (quickdic).

    Too bad pyglossary doesn’t support R. Döffinger’s quickdic format, because Tolino devices use that, and we do have a rather large Tolino user base in Germany. Not everybody wants to jailbreak their device…

    I currently use DE Wiktionary dumps and a rather brute-force Rexx script to generate a Tabfile, which I then convert to StarDict and dicthtml formats. (See attached screenshots for how it looks in GoldenDict on Linux.)

    This is of course a flakey way to do it, and I’d prefer to collaborate with a more sound foundation like yours and integrate it there, also because yours gets auto-updated.

    Unfortunately, the HOWTO Add a New Locale section in the wiki here isn’t too detailed, and I’d probably need quite a bit of help to get started. I’m especially unsure about the first two steps and the "Remove all data from the old lang."

    So my questions are:

    1. Would you be interested in a German dictionary that should look approximately like the screenshots show?
    2. Is it possible to do, without investing too much time? (There’s a lot of other things I have to spend my time on, but I’d be willing to invest a substantial amount of time to get it started and polished a little.)
    3. Is there any assistance possible in getting me set up to get the first steps done? I reckon that’d be to set up a working environment on my Linux Mint 20.3 machine, do a fork, and start adding a language "de".
    4. Since I know almost nothing about Wiktionary’s internal structures, I fear the templates most. But having had a glance at your code, I think there is some expertise here…

    Screenshots: This is how I envision it to look like. Users on MobileRead and the German E-Reader Forum have been quite enthusiastic about the first version. Screenshots show the StarDict version used by GoldenDict on a Linux desktop.

    wiktionary - GoldenDict_001

    Wiktionarys - GoldenDict_001

    Auswahl_194

    Links to what exists already:

    locale:German 
    opened by Moonbase59 49
  • [EL] Add EL locale

    [EL] Add EL locale

    I am trying to add Greek. I wonder if you could give me some feedback on the regexes. Below you see some examples and what I have come up with so far (I tried editing the IT file). The pronunciation appears to have variant structures, not sure how to accommodate that.

    # Regex to find the pronunciation
    # {{ΔΦΑ|tɾeˈlos|γλ=el}}
    # {{ΔΦΑ|γλ=el|ˈni.xta}}
    pronunciation = r"{ΔΦΑ\|γλ=el\|/([^/]+)/"
    # Regex to find the gender
    # '''{{PAGENAME}}''' {{θ}}
    # '''{{PAGENAME}}''' {{ο}}
    # '''{{PAGENAME}}''' {{α}}
    gender = r"'''{{PAGENAME}}''' ([θαο])"
    
    

    I tried running it and I got

    >> Processing data\el\pages-20210620.xml ...
    Traceback (most recent call last):
      File "C:\Users\spiros\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "C:\Users\spiros\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "C:\path1\wikidict\wikidict\__main__.py", line 118, in <module>
        sys.exit(main())
      File "C:\path1\wikidict\wikidict\__main__.py", line 110, in main
        parse.main(args["LOCALE"])
      File "C:\path1\wikidict\wikidict\parse.py", line 103, in main
        words = process(file, locale)
      File "C:\path1\wikidict\wikidict\parse.py", line 70, in process
        word, code = xml_parse_element(element, locale)
      File "C:\path1\wikidict\wikidict\parse.py", line 57, in xml_parse_element
        if all(section not in code for section in head_sections[locale]):
    KeyError: 'el'
    
    

    This is all the file

    """Greek language."""
    from typing import Dict, Tuple
    
    # Regex to find the pronunciation
    # {{ΔΦΑ|tɾeˈlos|γλ=el}}
    # {{ΔΦΑ|γλ=el|ˈni.xta}}
    pronunciation = r"{ΔΦΑ\|γλ=el\|/([^/]+)/"
    # Regex to find the gender
    # '''{{PAGENAME}}''' {{θ}}
    # '''{{PAGENAME}}''' {{ο}}
    # '''{{PAGENAME}}''' {{α}}
    gender = r"'''{{PAGENAME}}''' ([θαο])"
    
    # Float number separator
    float_separator = ","
    
    # Thousands separator
    thousands_separator = " "
    
    # Markers for sections that contain interesting text to analyse.
    head_sections = ("{{-el-}}",)
    etyl_section = ["{{ετυμολογία}}"]
    sections = (
        *head_sections,
        *etyl_section,
        "{{ουσιαστικό}},
        "{{ρήμα}},
        "{{επίθετο}},
        "{{επίρρημα}},
        "{{επίρρημα}},
        "{{σύνδεσμος}},
        "{{συντομομορφή}},
        "{{κύριο όνομα}},
        "{{αριθμητικό}},
        "{{άρθρο}},
        "{{μετοχή}},
        "{{μόριο}},
        "{{αντωνυμία}},
        "{{επιφώνημα}},
        "{{ρηματική έκφραση}},
        "{{επιρρηματική έκφραση}},
    )
    
    # Some definitions are not good to keep (plural, gender, ... )
    definitions_to_ignore = (
        "{{μορφή ουσιαστικού",
        "{{μορφή ρήματος",
        "{{μορφή επιθέτου",
        "{{εκφράσεις",
    )
    
    # Templates to ignore: the text will be deleted.
    templates_ignored: Tuple[str, ...] = tuple()
    
    # Templates that will be completed/replaced using italic style.
    templates_italic: Dict[str, str] = {}
    
    # Templates more complex to manage.
    templates_multi: Dict[str, str] = {
        # {{Term|statistica|it}}   
        # "term": "small(term(parts[1]))",
    }
    
    # Release content on GitHub
    # https://github.com/BoboTiG/ebook-reader-dict/releases/tag/el
    release_description = """\
    Αριθμός λέξεων: {words_count}
    Εξαγωγή Wiktionary: {dump_date}
    
    Διαθέσιμα αρχεία:
    
    - [Kobo]({url_kobo}) (dicthtml-{locale}.zip)
    - [StarDict]({url_stardict}) (dict-{locale}.zip)
    - [DictFile]({url_dictfile}) (dict-{locale}.df)
    
    <sub>Aggiornato il {creation_date}</sub>
    """  # noqa
    
    # Dictionary name that will be printed below each definition
    wiktionary = "Βικιλεξικό (ɔ) {year}"
    
    
    locale:Greek 
    opened by chopinesque 47
  • [FR] Redirect conjuged verbs to their infinitive form

    [FR] Redirect conjuged verbs to their infinitive form

    As requested it would be cool to have conjuged verbs redirecting to their infinitive form instead of nothing.

    I already tried some things, but without success. I think we could make use of variants, but it is not clear yet how to do that.

    locale:French 
    opened by BoboTiG 31
  • Support <hiero> mediawiki extension

    Support mediawiki extension

    • Wiktionary page: https://fr.wiktionary.org/wiki/djed

    Wikicode:

    <hiero>R11</hiero>
    

    Output:

    R11
    

    Expected:

    
    

    Model link, if any: https://www.mediawiki.org/wiki/Extension:WikiHiero https://www.mediawiki.org/wiki/Special:MyLanguage/Extension:WikiHiero/Syntax https://github.com/wikimedia/mediawiki-extensions-wikihiero/blob/366b1226891e609650b4c7f7d925b718c779517c/includes/WikiHiero.php

    opened by lasconic 26
  • [Meta] Project refactoring

    [Meta] Project refactoring

    Note: the description is updated with comments and changes requested in comments.

    The goal is to rework the script module to allow more flexibility and clearly separate concerns.

    First, about the module name: script. It has been decided to change to wikidict.

    Overview

    I would like to see the module splitted into 4 parts (each part will independent from others and can be replayed & extended easily). This will also help leveraging multithreading to speed-up the whole process.

    1. [x] Download the data (#466)
    2. [x] Parse and store raw data (#469)
    3. [x] Render templates and store results (#469)
    4. [ ] Output to the proper eBook reader format

    I have in mind a SQLite database where raw data will be stored and updated when needed. Then, the parts will only use the data from the database. It should speed-up regenerating a whole dictionary when we update a template.

    Then, each and every part will have its own CLI:

    $ python -m wikidict --download ...
    $ python -m wikidict --parse ...
    $ python -m wikidict --render ...
    $ python -m wikidict --output ...
    

    And the all-in-one operation would be:

    $ python -m wikidict --run ...
    

    Side note: we could use an entry point to only having to type wikidict instead of python -m wikidict.

    Splitting get.py

    Here we are talking about parts 1 and 2.

    Part 1 is already almost fine as-is, we just need to move the code into its own submodule. We could improve the CLI by allowing passing the Wiktionary dump date as argument, instead of relying on an envar.

    Part 2 is only the mater of parsing the big XML file and storing raw data into a SQLite database. I am thinking of using this schema:

    table: Word
    fields:
        - word: varchar(256)
        - code: text
    index on: word
    
    table: Render
    fields:
        - word_id: int
        - nature: varchar(16)
        - text: text
    foreign key: word_id (Word._rowid_)
    
    • The Word table will contain raw data from the Wiktionary.
    • The Render table will be used to store the transformed text for a given word (after being cleaned up and where templates were processed). It will allow to have multiple texts for a given word (noun 1, noun 2, verb, adjective, ...).

    We will have one database per locale, located at data/$LOCALE/$WIKIDUMP_DATE.db.

    At the download step, if no database exists, it will be retrieved from GitHub releases where they will be saved alongside dictionaries. This is a cool thing IMO: everyone will have the good and up-to-date local database. Of course, we will have options to skip it if the local file already exists or if we would like to force the download.

    At the parse step, we will have to find a way to prevent parsing again if we run the command twice on the same Wiktionary dump. I was thinking of using the PRAGME user_version that would contain the Wiktionary dump date as integer. It would be set only after the full parsing is done with success.

    Splitting convert.py

    Here we are talking about parts 3 and 4.

    Part 3 will call clean() and process_templates() on the wikicode. And store the result into the rendered field. This is the most time and CPU consuming part. It will be parallelized.

    Part 4 will rethink how we are handling dictionary output to easily add more formats.

    I was thinking of using a class with those methods (not really thought about it, I am just proposing the idea):

    class BaseFormat:
    
        __slots__ = {"locale", "output_dir"}
    
        def __init__(self, locale: str, output_dir: Path) -> None:
            self.locale = locale
            self.output_dir = output_dir
        
        def process(self) -> None:
            raise NotImplementedError()
    
        def save(self) -> None:
            raise NotImplementedError()
    
    
    class KoboFormat(BaseFormat):
        def process(self, words) -> None:
            groups = self.make_groups(self.words)
            variants = self.make_variants(self.words)
    
            wordlist = []
            for word in words:
                wordlist.append(self.process_word(word))
    
            self.save(wordlist, groups, variants)
    
        def save(self, ...) -> None:
            ...
    

    That part is way from being finished, but when we have a fully working format, in our code will will use that kind of code to generate the dict file:

    # Get all registered formats
    formaters = get_formaters()
    
    # Get all words from the database
    words = get_words()
    
    # And distribute the workload
    from multiprocessing import Pool
    
    def run(cls):
        formater = cls(locale, output_dir)
        formater.process(words)
    
    with Pool(len(formaters)) as pool:
        pool.map(run_formatter, formaters))
    
    opened by BoboTiG 26
  • Use a custom docker image for tests

    Use a custom docker image for tests

    For each PR tests job, most of the time is taken by LateX installation. For instance, it takes about 2m40s to install it against 30s to run tests.

    Maybe should we investigate the creation of a custom Docker image with LaTeX preinstalled. If so, I would be in favor of using a Debian-based light distribution, but I am open to any distribution as soon as tests are passing as-is (e.g: no modifications to be done on the source code).

    QA/CI 
    opened by BoboTiG 22
  • [EN] Discover unhandled templates

    [EN] Discover unhandled templates

    I added some code at the end of the english last_template_handler in order to log the templates that are rendered by default. To limit the number of templates, I print only templates with more than 2 parts and with data, especially if nocat is not the only data

    The code and the result is available here: https://gist.github.com/lasconic/139942e3761200eaa62e0a3a9be3d4f6 First file is the code. Second file is the template name and the number of hits : it gives a sense of the impact if the support for a template is handled Third file is the full list, convenient to find one or more examples of the template used on wiktionary.

    I discovered a couple of templates that should be ignored: https://github.com/BoboTiG/ebook-reader-dict/issues/395 and many others that needs to be implemented...

    I was not sure where to put this, so I open an issue. Please, let me know if it's not the right place.

    locale:English 
    opened by lasconic 22
  • utils: <math> formulas rendered to SVGs without using LaTeX tools

    utils: formulas rendered to SVGs without using LaTeX tools

    Fixes #1427. Fixes #1198. Closes #1209.

    Tests to pass before merging (the rendering is good, but not the display):

    • [x] $ python -m wikidict fr --gen-dict "cercle unité" --output issue-1427
    • [x] $ python -m wikidict en --gen-dict "Wallis product,primitive recursion,Horner's rule" --output issue-1427
    opened by BoboTiG 21
  • Rendering errors (<chem> and <math>)

    Rendering errors ( and )

    Note from @BoboTiG: issue tightly coupled to #1183, interesting details can be found there too.


    I did a fresh download and render of the EN wiktionary today, and got the following errors:

    >>> Loading data/en/data_wikicode-20220120.json ...
    >>> Loaded 1,038,672 words from data/en/data_wikicode-20220120.json
    <chem> ERROR with ^-N=\overset{+}N=N^- in [azide]
    <math> ERROR with \begin{align}\frac{\pi}{2} & = \prod_{n=1}^{\infty} \frac{ 4n^2 }{ 4n^2 - 1 } = \prod_{n=1}^{\infty} \left(\frac{2n}{2n-1} \cdot \frac{2n}{2n+1}\right) \\[6pt]& = \Big(\frac{2}{1} \cdot \frac{2}{3}\Big) \cdot \Big(\frac{4}{3} \cdot \frac{4}{5}\Big) \cdot \Big(\frac{6}{5} \cdot \frac{6}{7}\Big) \cdot \Big(\frac{8}{7} \cdot \frac{8}{9}\Big) \cdot \; \cdots \\\end{align} in [Wallis product]
    <math> ERROR with \begin{align}a_0 &+ a_1x + a_2x^2 + a_3x^3 + \cdots + a_nx^n \\ &= a_0 + x \bigg(a_1 + x \Big(a_2 + x \big(a_3 + \cdots + x(a_{n-1} + x \, a_n) \cdots \big) \Big) \bigg).\end{align} in [Horner's rule]
    <math> ERROR with \frac = \frac in [circle of Apollonius]
    <math> ERROR with \begin{align}\rho(g, h) (0,x_1,\ldots,x_k) &= g(x_1,\ldots,x_k) \\\rho(g, h) (y+1,x_1,\ldots,x_k) &= h(y,\rho(g, h) (y,x_1,\ldots,x_k),x_1,\ldots,x_k)\,\end{align} in [primitive recursion]
    >>> Saved 697,169 words into data/en/data-20220120.json
    >>> Render done!
    
    bug 
    opened by Moonbase59 19
  • [FR] Handle

    [FR] Handle "équiv-pour" additionnal arguments

    • Wiktionary page: https://fr.wiktionary.org/wiki/chercheureuse

    Wikicode:

    {{équiv-pour|une femme|chercheuse|chercheure|langue=fr|2egenre=un homme|2egenre1=chercheur}}
    

    Output:

    <i>(pour une femme, on peut dire</i>&nbsp: chercheuse, chercheure<i>)</i>
    

    Expected:

    <i>(pour une femme, on peut dire</i>&nbsp: chercheuse, chercheure<i>&nbsp; <i>pour un homme, on dit<i>&nbsp: chercheur<i>)</i>
    

    Model link, if any: https://fr.wiktionary.org/wiki/Mod%C3%A8le:%C3%A9quiv-pour

    locale:French 
    opened by BoboTiG 0
  • [FR] Add

    [FR] Add "siècle2" HTML filter

    • Wiktionary page: https://fr.wiktionary.org/wiki/t%C5%8D-on
    • Model link, if any: https://fr.wiktionary.org/wiki/Mod%C3%A8le:si%C3%A8cle2
    $ python -m wikidict fr --check-word "tō-on"
    
    locale:French 
    opened by BoboTiG 0
  • [FR] Adapt

    [FR] Adapt "composé de" output

    • Wiktionary page: https://fr.wiktionary.org/wiki/hexavalent

    Wikicode:

    {{composé de|lang=fr|hexa-|-valent|m=1}}
    

    Output:

    Composé de <i>hexa-</i> et de <i>-valent</i>
    

    Expected:

    Dérivé du préfix <i>hexa-</i>, avec le suffixe <i>-valent</i>
    

    Model link, if any: https://fr.wiktionary.org/wiki/Mod%C3%A8le:compos%C3%A9_de

    locale:French 
    opened by BoboTiG 2
  • [CA] Improve 'etim-lang' support

    [CA] Improve 'etim-lang' support

    • Wiktionary page: https://ca.wiktionary.org/wiki/feocromocitoma

    Wikicode:

    {{etim-lang|grc|ca|φαιός|trad=gris}}
    

    Output:

    Del grec antic <i>φαιός</i> («gris»)
    

    Expected:

    Del grec antic <i>φαιός</i> (<i>phaiós</i>, «gris»)
    

    Model link, if any: https://ca.wiktionary.org/wiki/Plantilla:etim-lang

    locale:Catalan 
    opened by BoboTiG 1
  • [EL] missing αγγειοχειρουργός

    [EL] missing αγγειοχειρουργός

    • Wiktionary page: https://el.wiktionary.org/w/index.php?title=%CE%B1%CE%B3%CE%B3%CE%B5%CE%B9%CE%BF%CF%87%CE%B5%CE%B9%CF%81%CE%BF%CF%85%CF%81%CE%B3%CF%8C%CF%82&action=edit

    Wikicode:

    '''{{PAGENAME}}''' {{αθ}}
    * {{ετ|ιατρική}} ο [[χειρουργός]] που ειδικεύεται στην αποκατάσταση βλαβών στα αιμοφόρα [[αγγείο|αγγεία]]
    *: {{μορφ}} [[αγγειοχειρούργος]]
    

    Output:

    αγγειοχειρουργός el '<i>αρσενικό ή θηλυκό</i>.'
    
    '<b>αγγειοχειρουργός</b> < <i>(Π)</i> + χειρουργός'
    

    Expected:

    αγγειοχειρουργός αρσενικό ή θηλυκό
    (ιατρική) ο χειρουργός που ειδικεύεται στην αποκατάσταση βλαβών στα αιμοφόρα αγγεία
    άλλες μορφές: αγγειοχειρούργος
    

    Model link, if any:

    I guess the {{μορφ}} template can be resolved via

        if tpl == "μορφ":
            phrase = "άλλες μορφές"
            if not data["0"]:
                phrase += ":"
            return phrase
    

    Not sure how to resolve the other issues or whether on should expect pronunciation data to be included too.

    locale:Greek 
    opened by chopinesque 24
Releases(sv)
Owner
Mickaël Schoentgen
Software Engineer. Creator of Python module MSS, FOSS contributor. Maintainer of watchdog, and MARISA Trie.
Mickaël Schoentgen
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
Simple Annotated implementation of GPT-NeoX in PyTorch

Simple Annotated implementation of GPT-NeoX in PyTorch This is a simpler implementation of GPT-NeoX in PyTorch. We have taken out several optimization

labml.ai 101 Dec 03, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment วิธีติดตั้ง pip install thai_sentiment==0.1.3

Charin 7 Dec 08, 2022
DLO8012: Natural Language Processing & CSL804: Computational Lab - II

NATURAL-LANGUAGE-PROCESSING-AND-COMPUTATIONAL-LAB-II DLO8012: NLP & CSL804: CL-II [SEMESTER VIII] Syllabus NLP - Reference Books THE WALL MEGA SATISH

AMEY THAKUR 7 Apr 28, 2022
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

Nicholas Heller 4 Jan 22, 2022
novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作,包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。 目录 计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Jan 03, 2023
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Semantic search through Wikipedia with the Weaviate vector search engine Weaviate is an open source vector search engine with build-in vectorization a

SeMI Technologies 191 Dec 26, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
基于pytorch+bert的中文事件抽取

pytorch_bert_event_extraction 基于pytorch+bert的中文事件抽取,主要思想是QA(问答)。 要预先下载好chinese-roberta-wwm-ext模型,并在运行时指定模型的位置。

西西嘛呦 31 Nov 30, 2022
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

138 Dec 24, 2022
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
Persian Bert For Long-Range Sequences

ParsBigBird: Persian Bert For Long-Range Sequences The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many ta

Sajjad Ayoubi 63 Dec 14, 2022
Search msDS-AllowedToActOnBehalfOfOtherIdentity

前言 现在进行RBCD的攻击手段主要是搜索mS-DS-CreatorSID,如果机器的创建者是我们可控的话,那就可以修改对应机器的msDS-AllowedToActOnBehalfOfOtherIdentity,利用工具SharpAllowedToAct-Modify 那我们索性也试试搜索所有计算机

Jumbo 26 Dec 05, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021