Train BPE with fastBPE, and load to Huggingface Tokenizer.

Last update: Dec 23, 2021

Related tags

Overview

BPEer

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Description

The BPETrainer of Huggingface consumes a lot of memory when I am training on a large corpus (e.g. 50000 merges on 20GB corpus). And I got a memory error.

So I use fastBPE (implemented with C) instead, which returns a list of merge operations.

However, I still want to use the huggingface Tokenizer API. So I write a simple convertor for generating the json file for Huggingface Tokenizer.

Usage

Train BPE:

cd fastBPE
./fast learnbpe [merges, e.g. 50000] [train.txt] > allvocab

Convert to json:

python convertjs.py

Warning

This tokenizer does not indicate the start of a token.

E.g. BPE result for "I am" and "Iam" may be the same. Please split the sentence by space before you use it.

    words = "I am".split()
    for word in words:
        subs = tokenizer.tokenize(word)
        subs[0] = "
   
    "
    + subs[0]

This results in [" I", "am"] and [" I", " am"] for "Iam" and "I am".

Owner

Lizhuo

二律背反的双重人格

GitHub Repository

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 02, 2023

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

27 Jan 05, 2023

DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

440 Dec 18, 2022

Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

3 May 23, 2022

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

478 Dec 25, 2022

Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包，它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 🔥 News

3.7k Jan 02, 2023

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

12 Aug 19, 2021

ChatBotProyect - This is an unfinished project about a simple chatbot.

chatBotProyect This is an unfinished project about a simple chatbot. (union_todo.ipynb) Reminders for the project: Find why one of the vectorizers fai

0 Jul 24, 2022

precise iris segmentation

PI-DECODER Introduction PI-DECODER, a decoder structure designed for Precise Iris Segmentation and Location. The decoder structure is shown below: Ple

8 Aug 08, 2022

Rhythm-Finder is a unsupervised ML driven python powered web-application that can find the songs that suits you.

ML-powered Music Recommendation Engine

23 Oct 09, 2022

An extensive UI tool built using new data scraped from BBC News

BBC-News-Analyzer An extensive UI tool built using new data scraped from BBC New

1 Dec 31, 2021

Conditional probing: measuring usable information beyond a baseline

20 Dec 15, 2022

A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

52 Jan 05, 2023

Meta learning algorithms to train cross-lingual NLI (multi-task) models

4 Nov 20, 2022

Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

1.8k Jan 07, 2023

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

101 Dec 30, 2022

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Related tags

Overview

BPEer

Description

Usage

Warning

Owner

Lizhuo

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

DeLighT: Very Deep and Light-Weight Transformers

Generating Korean Slogans with phonetic and structural repetition

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Facilitating the design, comparison and sharing of deep text matching models.

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

ChatBotProyect - This is an unfinished project about a simple chatbot.

precise iris segmentation

Rhythm-Finder is a unsupervised ML driven python powered web-application that can find the songs that suits you.

An extensive UI tool built using new data scraped from BBC News

Conditional probing: measuring usable information beyond a baseline

A workshop with several modules to help learn Feast, an open-source feature store

Meta learning algorithms to train cross-lingual NLI (multi-task) models

Large-scale pretraining for dialogue

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Sample data associated with the Aurora-BP study

Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.