Fake Shakespearean Text Generator

Overview

Fake Shakespearean Text Generator

This project contains an impelementation of stateful Char-RNN model to generate fake shakespearean texts.

Files and folders of the project.

models folder

This folder contains to zip file, one for stateful model and the other for stateless model (this model files are fully saved model architectures,not just weights).

weights.zip

As you its name implies, this zip file contains the model's weights as checkpoint format (see tensorflow model save formats).

tokenizer.save

This file is an saved and trained (sure on the dataset) instance of Tensorflow Tokenizer (used at inference time).

shakespeare.txt

This file is the dataset and composed of regular texts (see below what does it look like).

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

train.py

Contains codes for training.

inference.py

Contains codes for inference.

How to Train the Model

A more depth look into train.py file


First, it gets the dataset from the specified url (line 11). Then reads the dataset to train the tokenizer object just mentioned above and trains the tokenizer (line 18). After training, encodes the dataset (line 24). Since this is a stateful model, all sequences in batch should be start where the sequences at the same index number in the previous batch left off. Let's say a batch composes of 32 sequences. The 33th sequence (i.e. the first sequence in the second batch) should exactly start where the 1st sequence (i.e. first sequence in the first batch) ended up. The second sequence in the 2nd batch should start where 2nd sequnce in first batch ended up and so on. Codes between line 28 and line 48 do this and result the dataset. Codes between line 53 and line 57 create the stateful model. Note that to be able to adjust recurrent_dropout hyperparameter you have to train the model on a GPU. After creation of model, a callback to reset states at the beginning of each epoch is created. Then the training start with the calling fit method and then model (see tensorflow' entire model save), model's weights and the tokenizer is saved.

Usage of the Model

Where the magic happens (inference.py file)


To be able use the model, it should first converted to a stateless model due to a stateful model expects a batch of inputs instead of just an input. To do this a stateless model with the same architecture of stateful model should be created. Codes between line 44 and line 49 do this. To load weights the model should be builded. After building weight are loaded to the stateless model. This model uses predicted character at time step t as an inputs at time t + 1 to predict character at t + 2 and this operation keep goes until the prediction of last character (in this case it 100 but you can change it whatever you want. Note that the longer sequences end up with more inaccurate results). To predict the next characters, first the provided initial character should be tokenized. preprocess function does this. To prevent repeated characters to be shown in the generated text, the next character should be selected from candidate characters randomly. The next_char function does this. The randomness can be controlled with temperature parameter (to learn usage of it check the comment at line 30). The complete_text function, takes a character as an argument, predicts the next character via next_char function and concatenates the predicted character to the text. It repeats the process until to reach n_chars. Last, the stateless model will be saved also.

Results

Effects of the magic


print(complete_text("a"))

arpet:
like revenge borning and vinged him not.

lady good:
then to know to creat it; his best,--lord


print(complete_text("k"))

ken countents.
we are for free!

first man:
his honour'd in the days ere in any since
and all this ma


print(complete_text("f"))

ford:
hold! we must percy and he was were good.

gabes:
by fair lord, my courters,
sir.

nurse:
well


print(complete_text("h"))

holdred?
what she pass myself in some a queen
and fair little heartom in this trumpet our hands?
the

Owner
Recep YILDIRIM
Software Imagineering
Recep YILDIRIM
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021
A script that automatically creates a branch name using google translation api and jira api

About google translation api와 jira api을 사용하여 자동으로 브랜치 이름을 만들어주는 스크립트 Setup 환경변수에 다음 3가지를 등록해야 한다. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

A complete NLP guideline for enthusiasts

NLP-NINJA A complete guide for Natural Language Processing in Python Table of Contents S.No. Topic Level Meaning 1 Tokenization 🤍 Beginner 2 Stemming

MAINAK CHAUDHURI 22 Dec 27, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

K-BERT Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph", which is implemented based on the UER framework. R

Weijie Liu 834 Jan 09, 2023
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

Snowball Stemming language and algorithms 613 Jan 07, 2023
Community and sentiment analysis based on tweets

The project has set itself the goal of analyzing the thoughts and interaction of Italian users through the social posts expressed through the Twitter platform on the day of the entry into force of th

3 Nov 17, 2022
Russian GPT3 models.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Larg

Sberbank AI 1.6k Jan 05, 2023
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023
Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

Greg Ver Steeg 592 Dec 18, 2022
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 02, 2023
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
Azure Text-to-speech service for Home Assistant

Azure Text-to-speech service for Home Assistant The Azure text-to-speech platform uses online Azure Text-to-Speech cognitive service to read a text wi

Yassine Selmi 2 Aug 06, 2022
Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

ASReview hackathon for Follow the Money 2 Nov 28, 2021
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 07, 2022