CDLA: A Chinese document layout analysis (CDLA) dataset

Related tags

Text Data & NLPCDLA
Overview

CDLA: A Chinese document layout analysis (CDLA) dataset

介绍

CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label:

正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式
Text Title Figure Figure caption Table Table caption Header Footer Reference Equation

共包含5000张训练集和1000张验证集,分别在train和val目录下。每张图片对应一个同名的标注文件(.json)。

样例展示:

下载链接

标注格式

我们的标注工具是labelme,所以标注格式和labelme格式一致。这里说明一下比较重要的字段。

"shapes": shapes字段是一个list,里面有多个dict,每个dict代表一个标注实例。

"labels": 类别。

"points": 实例标注。因为我们的标注是Polygon形式,所以points里的坐标数量可能大于4。

"shape_type": "polygon"

"imagePath": 图片路径/名

"imageHeight": 高

"imageWidth": 宽

展示一个完整的标注样例:

{
  "version":"4.5.6",
  "flags":{},
  "shapes":[
    {
      "label":"Title",
      "points":[
        [
          553.1111111111111,
          166.59259259259258
        ],
        [
          553.1111111111111,
          198.59259259259258
        ],
        [
          686.1111111111111,
          198.59259259259258
        ],
        [
          686.1111111111111,
          166.59259259259258
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    },
    {
      "label":"Text",
      "points":[
        [
          250.5925925925925,
          298.0740740740741
        ],
        [
          250.5925925925925,
          345.0740740740741
        ],
        [
          188.5925925925925,
          345.0740740740741
        ],
        [
          188.5925925925925,
          410.0740740740741
        ],
        [
          188.5925925925925,
          456.0740740740741
        ],
        [
          324.5925925925925,
          456.0740740740741
        ],
        [
          324.5925925925925,
          410.0740740740741
        ],
        [
          1051.5925925925926,
          410.0740740740741
        ],
        [
          1051.5925925925926,
          345.0740740740741
        ],
        [
          1052.5925925925926,
          345.0740740740741
        ],
        [
          1052.5925925925926,
          298.0740740740741
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    },
    {
      "label":"Footer",
      "points":[
        [
          1033.7407407407406,
          1634.5185185185185
        ],
        [
          1033.7407407407406,
          1646.5185185185185
        ],
        [
          1052.7407407407406,
          1646.5185185185185
        ],
        [
          1052.7407407407406,
          1634.5185185185185
        ]
      ],
      "group_id":null,
      "shape_type":"polygon",
      "flags":{}
    }
  ],
  "imagePath":"val_0031.jpg",
  "imageData":null,
  "imageHeight":1754,
  "imageWidth":1240
}

转coco格式

执行命令:

# train
python3 labelme2coco.py CDLA_dir/train train_save_path  --labels labels.txt

# val
python3 labelme2coco.py CDLA_dir/val val_save_path  --labels labels.txt

转换结果保存在train_save_path/val_save_path目录下。

labelme2coco.py取自labelme,更多信息请参考labelme官方项目

Owner
buptlihang
buptlihang
Official Stanford NLP Python Library for Many Human Languages

Official Stanford NLP Python Library for Many Human Languages

Stanford NLP 6.4k Jan 02, 2023
BookNLP, a natural language processing pipeline for books

BookNLP BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including: Part-of-speech taggin

654 Jan 02, 2023
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
Converts python code into c++ by using OpenAI CODEX.

🦾 codex_py2cpp 🤖 OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? 🐌 You want to speed it up but forgot how to code in C++? ⌨

Alexander 423 Jan 01, 2023
Textlesslib - Library for Textless Spoken Language Processing

textlesslib Textless NLP is an active area of research that aims to extend NLP t

Meta Research 379 Dec 27, 2022
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023
Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Persian Lexicon This repo uses Uppsala Persian Corpus (UPC) to construct a lexic

Saman Vaisipour 7 Apr 01, 2022
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023
TensorFlow code and pre-trained models for BERT

BERT ***** New March 11th, 2020: Smaller BERT Models ***** This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece

Google Research 32.9k Jan 08, 2023
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
Simple Text-To-Speech Bot For Discord

Simple Text-To-Speech Bot For Discord This is a very simple TTS bot for discord made with python. For this bot you need FFMPEG, see installation to se

1 Sep 26, 2022
A NLP program: tokenize method, PoS Tagging with deep learning

IRIS NLP SYSTEM A NLP program: tokenize method, PoS Tagging with deep learning Report Bug · Request Feature Table of Contents About The Project Built

Zakaria 7 Dec 13, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022