CDLA: A Chinese document layout analysis (CDLA) dataset

Last update: Dec 28, 2022

Related tags

Overview

CDLA: A Chinese document layout analysis (CDLA) dataset

介绍

CDLA是一个中文文档版面分析数据集，面向中文文献类（论文）场景。包含以下10个label：

正文	标题	图片	图片标题	表格	表格标题	页眉	页脚	注释	公式
Text	Title	Figure	Figure caption	Table	Table caption	Header	Footer	Reference	Equation

共包含5000张训练集和1000张验证集，分别在train和val目录下。每张图片对应一个同名的标注文件(.json)。

样例展示：

下载链接

百度云下载：https://pan.baidu.com/s/1449mhds2ze5JLk-88yKVAA, 提取码: tp0d
Google Drive Download：https://drive.google.com/file/d/14SUsp_TG8OPdK0VthRXBcAbYzIBjSNLm/view?usp=sharing

标注格式

我们的标注工具是labelme，所以标注格式和labelme格式一致。这里说明一下比较重要的字段。

"shapes": shapes字段是一个list，里面有多个dict，每个dict代表一个标注实例。

"labels": 类别。

"points": 实例标注。因为我们的标注是Polygon形式，所以points里的坐标数量可能大于4。

"shape_type": "polygon"

"imagePath": 图片路径/名

"imageHeight": 高

"imageWidth": 宽

展示一个完整的标注样例:

{
  "version":"4.5.6",
  "flags":{},
  "shapes":[
    {
      "label":"Title",
      "points":[
        [
          553.1111111111111,
          166.59259259259258
        ],
        [
          553.1111111111111,
          198.59259259259258
        ],
        [
          686.1111111111111,
          198.59259259259258
        ],
        [
          686.1111111111111,
          166.59259259259258
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    },
    {
      "label":"Text",
      "points":[
        [
          250.5925925925925,
          298.0740740740741
        ],
        [
          250.5925925925925,
          345.0740740740741
        ],
        [
          188.5925925925925,
          345.0740740740741
        ],
        [
          188.5925925925925,
          410.0740740740741
        ],
        [
          188.5925925925925,
          456.0740740740741
        ],
        [
          324.5925925925925,
          456.0740740740741
        ],
        [
          324.5925925925925,
          410.0740740740741
        ],
        [
          1051.5925925925926,
          410.0740740740741
        ],
        [
          1051.5925925925926,
          345.0740740740741
        ],
        [
          1052.5925925925926,
          345.0740740740741
        ],
        [
          1052.5925925925926,
          298.0740740740741
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    },
    {
      "label":"Footer",
      "points":[
        [
          1033.7407407407406,
          1634.5185185185185
        ],
        [
          1033.7407407407406,
          1646.5185185185185
        ],
        [
          1052.7407407407406,
          1646.5185185185185
        ],
        [
          1052.7407407407406,
          1634.5185185185185
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    }
  ],
  "imagePath":"val_0031.jpg",
  "imageData":null,
  "imageHeight":1754,
  "imageWidth":1240
}

转coco格式

执行命令:

# train
python3 labelme2coco.py CDLA_dir/train train_save_path  --labels labels.txt

# val
python3 labelme2coco.py CDLA_dir/val val_save_path  --labels labels.txt

转换结果保存在train_save_path/val_save_path目录下。

labelme2coco.py取自labelme，更多信息请参考labelme官方项目

CDLA: A Chinese document layout analysis (CDLA) dataset

Related tags

Overview

CDLA: A Chinese document layout analysis (CDLA) dataset

介绍

下载链接

标注格式

转coco格式

Owner

buptlihang

A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

SimCTG - A Contrastive Framework for Neural Text Generation

Codename generator using WordNet parts of speech database

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

Script to generate VAD dataset used in Asteroid recipe

Understanding the Difficulty of Training Transformers

This project converts your human voice input to its text transcript and to an automated voice too.

Unsupervised text tokenizer focused on computational efficiency

pytorch implementation of Attention is all you need

Binary LSTM model for text classification

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU

Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

NLP Text Classification

用Resnet101+GPT搭建一个玩王者荣耀的AI

Text Classification in Turkish Texts with Bert

Example code for "Real-World Natural Language Processing"