Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Last update: Aug 19, 2021

Overview

Yase

Yet Another Sequence Encoder - encode sequences to vector of vectors in python !

Why Yase ?

Yase enable you to encode any sequence which can be represented by string to be encoded into a list of word-vector representation.

When searching over a tool to encode a sentence as a list of word-vector, it was clear that there was no simple tool to use. And so, i decided to create Yase.

Note : If you only want to get the word-vector of a word, or average of word-vector in sentence, you should probably better check Spacy.

Requirements

Yase requirements are :

numpy
tqdm

Mapping file

The mapping should be a columnar file like :

<token> <vector value>
token1 0.1 0.6 -1.2
token2 0.6 -2.3 3.4

All data should be separated by space, thus no space is allowed in token. You should be able to directly use Facebook Fast Text pretrained word vector as mapping.

Input file

Input file should be a list of text, with one sample per line.

hello world
Yase is awesome !

The default separator is a space " " but any regular expression can be provided.

Note that Yase is case insensitive

How to use

yase is command line tool. You can install by with :pip install git+https://github.com/PPACI/yase.git

>> yase
usage: yase [-h] --input input.txt [--input-encoding UTF8] --output
               output.txt --mapping mapping.vec [--mapping-encoding UTF8]
               [--separator \ |\.|\,] [--no-replace]
               [--cleaning-json cleaning.json]

Yet Another Sequence Encoder

optional arguments:
  -h, --help            show this help message and exit
  --input input.txt     Path to file to encode
  --input-encoding UTF8
                        encoding of input file. UTF8 by default
  --output output.txt   Path to output file
  --mapping mapping.vec
                        Path to mapping file
  --mapping-encoding UTF8
                        encoding of mapping file. UTF8 by default
  --separator \ |\.|\,  regular expression used to split the input sequence
  --no-replace          don't clean input data
  --cleaning-json cleaning.json
                        Path to your own json replacement file for cleaning.
                        Will use the included replacement file otherwise.

If you wanted to use the english word vector for an input file like previously described :

yase --input "input.txt" --output "output.csv" --mapping "wiki.en.vec"

Output format

The idea behind yase is to be as easy as possible to integrate it in all data science processing.

Yase output it's your data as CSV.

The only problem with CSV is that it's difficult to integrate multi-dimensional array. So we had to find a compromise..

Yase encode the vector columns in JSON format, which is easily readable and is very similar to python array representation.

The output file will be similar to :

inputs	vectors
hello world	[[1,1,1],[2,2,2]]
yase is awesome !	[[3,3,3],[4,4,4]]

Cleaning

Yase will automatically try to clean your input file by applying regex in the right order.

For example : Hello I'm yase.Nice to meet you will magically become Hello I m yase . Nice to meet you.

Remember that yase is case insensitive. So yase will understand as hello i m yase . nice to meet you.

Lastly, if your mapping doesn't include a mapping for ".", you will obtain vectors for hello i m yase nice to meet you

Of course, you can disable this behaviour by providing --no-replace argument.

Providing your own replacement file

You can do this by providing a path to your file with --cleaning-json.

The replacement file is a json like :

{
  "\"": "",
  "'": "",
  ",": " , ",
  "\\.": " . ",
  "  ": " "
}

Input are regex, so remind to escape . or *.

Note that replacement are made in the same order as in the json. So here, the first replacement will be to remove "

How to load a yase output ?

As said previously, the choice made with Yase make it possible to use it as simply as :

import pandas, json

csv = pandas.read_csv("output.csv")
csv.vectors = csv.vectors.apply(json.loads)

csv.head()

Note that Pandas is not mandatory but very recommended for data science.

TODO

Optimize Mapping loading time
Optional argument to output fixed size vectors for all input sequences
Surely lot of thing !

Can i contribute ?

Off course ! If you want to improve Yase, your idea / pull requests / issues are welcomed !

Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Related tags

Overview

Yase

Why Yase ?

Requirements

Mapping file

Input file

How to use

Output format

Cleaning

Providing your own replacement file

How to load a yase output ?

TODO

Can i contribute ?

Owner

Pierre PACI

Code for the paper "Are Sixteen Heads Really Better than One?"

Concept Modeling: Topic Modeling on Images and Text

Treemap visualisation of Maya scene files

GPT-2 Model for Leetcode Questions in python

Material for GW4SHM workshop, 16/03/2022.

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

SAINT PyTorch implementation

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Official PyTorch implementation of SegFormer

translate using your voice

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Contract Understanding Atticus Dataset

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Demo programs for the Talking Head Anime from a Single Image 2: More Expressive project.

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Submit issues and feature requests for our API here.

GPT-3 command line interaction

Natural Language Processing Tasks and Examples.

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking