Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Last update: Jan 06, 2023

Related tags

Deep Learning docformer

Overview

DocFormer - PyTorch

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU) 📄 📄 📄 .

DocFormer is a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

The official implementation was not released by the authors.

Install

There might be some issues with the import of pytessaract, so in order to debug that, we need to write

pip install pytesseract
sudo apt install tesseract-ocr

And then,

pip install git+https://github.com/shabie/docformer

Usage

from docformer import modeling, dataset
from transformers import BertTokenizerFast


config = {
  "coordinate_size": 96,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "image_feature_pool_shape": [7, 7, 256],
  "intermediate_ff_size_factor": 4,
  "max_2d_position_embeddings": 1000,
  "max_position_embeddings": 512,
  "max_relative_positions": 8,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "shape_size": 96,
  "vocab_size": 30522,
  "layer_norm_eps": 1e-12,
}

fp = "filepath/to/the/image.tif"

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
encoding = dataset.create_features(fp, tokenizer)

feature_extractor = modeling.ExtractFeatures(config)
docformer = modeling.DocFormerEncoder(config)
v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)  # shape (1, 512, 768)

License

MIT

Maintainers

Contribute

Citations

@InProceedings{Appalaraju_2021_ICCV,
    author    = {Appalaraju, Srikar and Jasani, Bhavan and Kota, Bhargava Urala and Xie, Yusheng and Manmatha, R.},
    title     = {DocFormer: End-to-End Transformer for Document Understanding},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {993-1003}
}

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Related tags

Overview

DocFormer - PyTorch

Install

Usage

License

Maintainers

Contribute

Citations

Owner

KE-Dialogue: Injecting knowledge graph into a fully end-to-end dialogue system.

MediaPipeのPythonパッケージのサンプルです。2020/12/11時点でPython実装のある4機能(Hands、Pose、Face Mesh、Holistic)について用意しています。

Photo2cartoon - 人像卡通化探索项目 (photo-to-cartoon translation project)

Solving reinforcement learning tasks which require language and vision

(CVPR2021) ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic

Face recognize system

Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT).

A light and fast one class detection framework for edge devices. We provide face detector, head detector, pedestrian detector, vehicle detector......

FlingBot: The Unreasonable Effectiveness of Dynamic Manipulations for Cloth Unfolding

To propose and implement a multi-class classification approach to disaster assessment from the given data set of post-earthquake satellite imagery.

A Python Package for Portfolio Optimization using the Critical Line Algorithm

Conversion between units used in magnetism

NAS-Bench-x11 and the Power of Learning Curves

Text-to-Image generation

This is code of book "Learn Deep Learning with PyTorch"

Omnidirectional camera calibration in python

A community run, 5-day PyTorch Deep Learning Bootcamp

Boundary-preserving Mask R-CNN (ECCV 2020)

SCALoss: Side and Corner Aligned Loss for Bounding Box Regression (AAAI2022).

SOTR: Segmenting Objects with Transformers [ICCV 2021]