mkultra

mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo.

Prompt tuning injects a string of 20-100 special tokens into the context in order to influence text generation. These tokens are trained on a corpus much like a finetune, but take up a fraction of the space. The Neuromancer example is only 401kb for 100 tokens.

Read the original paper: https://arxiv.org/abs/2104.08691

Text Generation

model = GPT2SoftPromptLM.from_pretrained("gpt2")
tokenizer = GPT2SPTokenizerFast.from_pretrained("gpt2")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

sp = SoftPrompt.from_file("sample_sps/finetune/neuromancer_gpt2.json")
prompt = sp + "The sky over the port"
output = generator(prompt)

SoftPrompts can be concatenated at any point into your context as if they were strings. When the context is printed, SoftPrompts show up as human-readable tags for debugging. They also tokenize to the underlying number of tokens for easy budgeting.

See the text generation notebook for pointers on adding mkultra to your generator.

Training

For finetune-like soft prompts, the finetune notebook demonstrates training on a corpus.

For AI text adventures or writing, the World Info notebook notebook demonstrates tuning a soft prompt to describe a character or setting. This is highly experimental.

Limitations (for now)

The Huggingface Trainer class should work as long as you set params=[model.get_soft_params()] on the optimizer, but it will still save full model checkpoints.
mkultra syncs a set of special tokens between its tokenizers the scenes. Adding your own tokens may result in unexpected behaviour.

Prompt tuning toolkit for GPT-2 and GPT-Neo

Related tags

Overview

mkultra

Text Generation

Training

Limitations (for now)

Owner

Collection of scripts to pinpoint obfuscated code

Application for shadowing Chinese.

EdiTTS: Score-based Editing for Controllable Text-to-Speech

[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

Few-shot Natural Language Generation for Task-Oriented Dialog

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

This repository contains examples of Task-Informed Meta-Learning

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Beyond Paragraphs: NLP for Long Sequences

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

NLTK Source

The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

Paddlespeech Streaming ASR GUI

GPT-2 Model for Leetcode Questions in python

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Pytorch version of BERT-whitening