Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Last update: Feb 02, 2022

Overview

"# bpe_algorithm_can_finetune_tokenizer"

this is an implyment for https://github.com/huggingface/transformers/issues/15153

I just add tens of lines of code into the py_bpe algorithm. function finetune_tokenizer is main function added.

Details can be see in example.py , actuctally it is very simple. the official python library tokenizer is written is rust. I am learning hoping to give a rust version of this code.

ps: the_factor_of_new_added_token_divided_unk_number is the only param you should set. hoping can find a auto algorithm to set it.

Owner

张博

I am a chinese coder, having some machine learning and math book code and notes shared

GitHub Repository

Yet Another Compiler Visualizer

yacv: Yet Another Compiler Visualizer yacv is a tool for visualizing various aspects of typical LL(1) and LR parsers. Check out demo on YouTube to see

129 Dec 17, 2022

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

LineFlow: Framework-Agnostic NLP Data Loader in Python LineFlow is a simple text dataset loader for NLP deep learning tasks. LineFlow was designed to

177 Jan 04, 2023

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

2 May 17, 2022

Awesome-NLP-Research (ANLP)

72 Dec 19, 2022

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

490 Dec 15, 2022

Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

358 Dec 24, 2022

Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

30 Sep 07, 2022

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022

This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

NORESQA: Speech Quality Assessment using Non-Matching References This is a Pytorch implementation for using NORESQA. It contains minimal code to predi

36 Dec 08, 2022

Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

1.5k Dec 28, 2022

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank li

11.9k Jan 08, 2023

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Related tags

Overview

Owner

张博

Yet Another Compiler Visualizer

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Awesome-NLP-Research (ANLP)

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

Code release for "COTR: Correspondence Transformer for Matching Across Images"

Predict an emoji that is associated with a text

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

Phrase-Based & Neural Unsupervised Machine Translation

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

NLP Core Library and Model Zoo based on PaddlePaddle 2.0

Neural network sequence labeling model

Understanding the Difficulty of Training Transformers

Smart discord chatbot integrated with Dialogflow

CLIPfa: Connecting Farsi Text and Images

Simple text to phones converter for multiple languages

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms