The tutorial is a collection of many other resources and my own notes

Overview
# TOC

Before reading
the tutorial is a collection of many other resources and my own notes. Note that the ref if any in the tutorial means the whole passage. And part to be referred if any means the part has been summarized or detailed by me. Feel free to click the [the part to be referred] to read the original.

CTC_pytorch

1. Why we need CTC? ---> looking back on history

Feel free to skip it if you already know the purpose of CTC coming into being.

1.1. About CRNN

We need to learn CRNN because in the context we need an output to be a sequence.

ref: the overview from CRNN to CTC !! highly recommended !!

part to be referred

multi-digit sequence recognition

  • Characted-based
  • word-based
  • sequence-to-sequence
  • CRNN = CNN + RNN
    • CNN --> relationship between pixel
    • (the small fonts) Specifially, each feature vec of a feature seq is generated from left to right on the feature maps. That means the i-th feature vec is the concatenation of the columns of all the maps. So the shape of the tensor can be reshaped as e.g. (batch_size, 32, 256)

image1



1.2. from Cross Entropy Loss to CTC Loss

Usually, CE is applied to compute loss as the following way. And gt(also target) can be encoded as a stable matrix or vector.

image2

However, in OCR or audio recognition, each target input/gt has various forms. e.g. "I like to play piano" can be unpredictable in handwriting.

image3

Some stroke is longer than expected. Others are short.
Assume that the above example is encoded as number sequence [5, 3, 8, 3, 0].

image4

  • Tips: blank(the blue box symbol here) is introduced because we allow the model to predict a blank label due to unsureness or the end comes, which is similar with human when we are not pretty sure to make a good prediction. ref:lihongyi lecture starting from 3:45

Therefore, we see that this is an one-to-many question where e.g. "I like to play piano" has many target forms. But we not just have one sequence. We might also have other sequence e.g. "I love you", "Not only you but also I like apple" etc, none of which have a same sentence length. And this is what cross entropy cannot achieve in one batch. But now we can encode all sequences/sentences into a new sequence with a max length of all sequences.

e.g.
"I love you" --> len = 10
"How are you" --> len = 11
"what's your name" --> len = 16

In this context the input_length should be >= 16.

For dealing with the expanded targets, CTC is introduced by using the ideas of (1) HMM forward algorithm and (2) dynamic programing.

2. Details about CTC

2.1. intuition: forward algorithm

image5

image6

Tips: the reason we have - inserted between each two token is because, for each moment/horizontal(Note) position we allow the model to predict a blank representing unsureness.

Note that moment is for audio recognition analogue. horizontal position is for OCR analogue.



2.2. implementation: forward algorithm with dynamic programming

the complete code is CTC.py

given 3 samples, they are
"orange" :[15, 18, 1, 14, 7, 5]    len = 6
"apple" :[1, 16, 16, 12, 5]    len = 5
"watermelon" :[[23, 1, 20, 5, 18, 13, 5, 12, 15, 14]  len = 10

{0:blank, 1:A, 2:B, ... 26:Z}

2.2.1. dummy input ---> what the input looks like

# ------------ a dummy input ----------------
log_probs = torch.randn(15, 3, 27).log_softmax(2).detach().requires_grad_()# 15:input_length  3:batchsize  27:num of token(class)
# targets = torch.randint(0, 27, (3, 10), dtype=torch.long)
targets = torch.tensor([[15, 18, 1,  14, 7, 5,  0, 0,  0,  0],
                        [1,  16, 16, 12, 5, 0,  0, 0,  0,  0],
                        [23, 1,  20, 5, 18, 13, 5, 12, 15, 14]]
                        )

# assume that the prediction vary within 15 input_length.But the target length is still the true length.
""" 
e.g. [a,0,0,0,p,0,p,p,p, ...l,e] is one of the prediction
 """
input_lengths = torch.full((3,), 15, dtype=torch.long)
target_lengths = torch.tensor([6,5,10], dtype = torch.long)



2.2.2. expand the target ---> what the target matrix look like

Recall that one target can be encoded in many different forms. So we introduce a targets mat to represent it as follows.

"-d-o-g-" ">
target_prime = targets.new_full((2 * target_length + 1,), blank) # create a targets_prime full of zero

target_prime[1::2] = targets[i, :target_length] # equivalent to insert blanks in targets. e.g. targets = "dog" --> "-d-o-g-"

Now we got target_prime(also expanded target) for e.g. "apple"
target_prime is
tensor([ 0, 1, 0, 16, 0, 16, 0, 12, 0, 5, 0]) which is visualized as the red part(also t1)

image7

Note that the t8 is only for illustration. In the example, the width of target matrix should be 15(input_length).

probs = log_probs[:input_length, i].exp()

Then we convert original inputs from log-space like this, referring to "In practice, the above recursion ..." in original paper https://www.cs.toronto.edu/~graves/icml_2006.pdf

2.3. Alpha Matrix

image8

# alpha matrix init at t1 indicated by purple boxes.
alpha_col = log_probs.new_zeros((target_length * 2 + 1,))
alpha_col[0] = probs[0, blank] # refers to green box
alpha_col[1] = probs[0, target_prime[1]]
  • blank is the index of blank(here it's 0)
  • target_prime[1] refers to the 1-st index of the token. e.g. "apple": "a", "orange": "o"

2.4. Dynamic programming based on 3 conditions

refer to the details in CTC.py

reference:

Owner
手写AI
手写AI
JTEX is a command line tool (CLI) for rendering LaTeX documents from jinja-style templates.

JTEX JTEX is a command line tool (CLI) for rendering LaTeX documents from jinja-style templates. This package uses Jinja2 as the template engine with

Curvenote 15 Dec 21, 2022
advance python series: Data Classes, OOPs, python

Working With Pydantic - Built-in Data Process ========================== Normal way to process data (reading json file): the normal princiople, it's f

Phung Hưng Binh 1 Nov 08, 2021
A web app builds using streamlit API with python backend to analyze and pick insides from multiple data formats.

Data-Analysis-Web-App Data Analysis Web App can analysis data in multiple formates(csv, txt, xls, xlsx, ods, odt) and gives shows you the analysis in

Kumar Saksham 19 Dec 09, 2022
Paper and Code for "Curriculum Learning by Optimizing Learning Dynamics" (AISTATS 2021)

Curriculum Learning by Optimizing Learning Dynamics (DoCL) AISTATS 2021 paper: Title: Curriculum Learning by Optimizing Learning Dynamics [pdf] [appen

Tianyi Zhou 15 Dec 06, 2022
xeuledoc - Fetch information about a public Google document.

xeuledoc - Fetch information about a public Google document.

Malfrats Industries 651 Dec 27, 2022
Software engineering course project. Secondhand trading system.

PigeonSale Software engineering course project. Secondhand trading system. Documentation API doumenatation: list of APIs Backend documentation: notes

Harry Lee 1 Sep 01, 2022
:blue_book: Automatic documentation from sources, for MkDocs.

mkdocstrings Automatic documentation from sources, for MkDocs. Features Python handler features Requirements Installation Quick usage Features Languag

Timothée Mazzucotelli 1.1k Dec 31, 2022
Docov - Light-weight, recursive docstring coverage analysis for python modules

docov Light-weight, recursive docstring coverage analysis for python modules. Ov

Richard D. Paul 3 Feb 04, 2022
Python-slp - Side Ledger Protocol With Python

Side Ledger Protocol Run python-slp node First install Mongo DB and run the mong

Solar 3 Mar 02, 2022
A fast time mocking alternative to freezegun that wraps libfaketime.

python-libfaketime: fast date/time mocking python-libfaketime is a wrapper of libfaketime for python. Some brief details: Linux and OS X, Pythons 3.5

Simon Weber 68 Jun 10, 2022
Official Matplotlib cheat sheets

Official Matplotlib cheat sheets

Matplotlib Developers 6.7k Jan 09, 2023
EasyModerationKit is an open-source framework designed to moderate and filter inappropriate content.

EasyModerationKit is a public transparency statement. It declares any repositories and legalities used in the EasyModeration system. It allows for implementing EasyModeration into an advanced charact

Aarav 1 Jan 16, 2022
FireEye Related Projects

FireEye FireEye Related Projects Tor-IP-Collector Simple python script that will collect a list of TOR IPs from the SecOps Institute Github and inject

Taran Ulrich 2 Nov 12, 2022
Course materials and handouts for #100DaysOfCode in Python course

#100DaysOfCode with Python course Course details page: talkpython.fm/100days Course Summary #100DaysOfCode in Python is your perfect companion to take

Talk Python 1.9k Dec 31, 2022
A document format conversion service based on Pandoc.

reformed Document format conversion service based on Pandoc. Usage The API specification for the Reformed server is as follows: GET /api/v1/formats: L

David Lougheed 3 Jul 18, 2022
Mayan EDMS is a document management system.

Mayan EDMS is a document management system. Its main purpose is to store, introspect, and categorize files, with a strong emphasis on preserving the contextual and business information of documents.

3 Oct 02, 2021
Fast syllable estimation library based on pattern matching.

Syllables: A fast syllable estimator for Python Syllables is a fast, simple syllable estimator for Python. It's intended for use in places where speed

ProseGrinder 26 Dec 14, 2022
Poetry plugin to export the dependencies to various formats

Poetry export plugin This package is a plugin that allows the export of locked packages to various formats. Note: For now, only the requirements.txt f

Poetry 90 Jan 05, 2023
Parser manager for parsing DOC, DOCX, PDF or HTML files

Parser manager Description Parser gets PDF, DOC, DOCX or HTML file via API and saves parsed data to the database. Implemented in Ruby 3.0.1 using Acti

Эдем 4 Dec 04, 2021
Modified fork of CPython's ast module that parses `# type:` comments

Typed AST typed_ast is a Python 3 package that provides a Python 2.7 and Python 3 parser similar to the standard ast library. Unlike ast up to Python

Python 217 Dec 06, 2022