A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Related tags

Text Data & NLPDRCD
Overview

Delta Reading Comprehension Dataset

台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇維基條目中整理出10,014篇段落,並從段落中標註出30,000多個問題

關於資料集之更詳細資訊請洽詢論文: For more information please refer to Paper https://arxiv.org/abs/1806.00920

Data format 資料格式

  • version : 資料集版本
  • data :
    • title : : 文章標題
    • id : : 文章編號
    • paragraphs :
      • id : : 文章編號_段落編號
      • context : : 段落內容
      • qas :
        • question : : 問題內容
        • id : : 文章編號_段落編號_問題編號
        • answers :
          • answer_start : text在文中位置
          • id : : "1"表示為人工標註的答案,"2"以上為人工答題的答案
          • text : : 答案內容

Example

{
"version": "1.3",
"data": [
  {
    "title": "基督新教",
    "id": "2128",
    "paragraphs": [
      {
        "context": "基督新教與天主教均繼承普世教會歷史上許多傳統教義,如三位一體、聖經作為上帝的啟示、原罪、認罪、最後審判等等,但有別於天主教和東正教,新教在行政上沒有單一組織架構或領導,而且在教義上強調因信稱義、信徒皆祭司, 以聖經作為最高權威,亦因此否定以教宗為首的聖統制、拒絕天主教教條中關於聖傳與聖經具同等地位的教導。新教各宗派間教義不盡相同,但一致認同五個唯獨:唯獨恩典:人的靈魂得拯救唯獨是神的恩典,是上帝送給人的禮物。唯獨信心:人唯獨藉信心接受神的赦罪、拯救。唯獨基督:作為人類的代罪羔羊,耶穌基督是人與上帝之間唯一的調解者。唯獨聖經:唯有聖經是信仰的終極權威。唯獨上帝的榮耀:唯獨上帝配得讚美、榮耀",
        "id": "2128-2",
        "qas": [
          {
            "id": "2128-2-1",
            "question": "新教在教義上強調信徒皆祭司以及什麼樣的理念?",
            "answers": [
              {
                "id": "1",
                "text": "因信稱義",
                "answer_start": 92
              }
            ]
          },
          {
            "id": "2128-2-2",
            "question": "哪本經典為新教的最高權威?",
            "answers": [
              {
                "id": "1",
                "text": "聖經",
                "answer_start": 105
              }
            ]
          },
          {
            "id": "2128-2-3",
            "question": "新教認同幾個唯獨?",
            "answers": [
              {
                "id": "1",
                "text": "五個",
                "answer_start": 171
              }
            ]
          },
          {
            "id": "2128-2-4",
            "question": "文中提及,人唯獨藉信心接受神的赦罪、拯救,此為哪一種唯獨?",
            "answers": [
              {
                "id": "1",
                "text": "唯獨信心",
                "answer_start": 206
              }
            ]
          }
        ]
      },
      {
        "context": "主教制源自天主教的主教制度,幾乎和天主教的主教制度一模一樣,唯一不同的是主教亦可以結婚。天主教的主教制是在使徒們去世後於第二、三世紀興起的主教制度,所以可以說主教制是整個基督宗教中歷史最悠久的神職人員制度。現在行主教制的新教教會已經很少,聖公會就是沿用主教制,從教會制度和禮儀上看來,聖公會基本上屬大公教會傳統。路德宗和衛理公會則由各區會自行選擇使用主教制還是長老制;在香港和澳門,路德會和衛理公會就選用了長老制。然而,在歐洲,例如瑞典、芬蘭、挪威、德國等地,他們則通常採用主教制。長老制,是一個以議會形式管理區會的制度。議會內的成員由各教會選出長老,代表該教會出席會議。顧名思義,長老會就是採用長老制的教會。採用長老制的教會有基督教改革宗長老會、台灣基督長老教會、韓國基督長老教會等。",
        "id": "2128-3",
        "qas": [
          {
            "id": "2128-3-1",
            "question": "新教的主教制度源自於哪一教?",
            "answers": [
              {
                "id": "1",
                "text": "天主教",
                "answer_start": 5
              }
            ]
          },
          {
            "id": "2128-3-2",
            "question": "文中提及,新教的主教可以做什麼?",
            "answers": [
              {
                "id": "1",
                "text": "結婚",
                "answer_start": 41
              }
            ]
          },
          {
            "id": "2128-3-3",
            "question": "哪個會屬於大公教會傳統?",
            "answers": [
              {
                "id": "1",
                "text": "聖公會",
                "answer_start": 142
              }
            ]
          },
          {
            "id": "2128-3-4",
            "question": "以議會形式管理區會的制度,名為?",
            "answers": [
              {
                "id": "1",
                "text": "長老制",
                "answer_start": 241
              }
            ]
          }
        ]
      }
    ]
  }
]
}

Copyright Notice 版權聲明

本資料集整理、改編自維基百科,其內容以CC-BY-SA 3.0條款發布。 台達電子對於本資料集內容之正確性不為任何擔保,且不就因使用或倚賴本資料集而引致的任何損失,承擔任何責任。 CC-BY-SA 3.0相關條款請參考以下連結 http://creativecommons.org/licenses/by-sa/3.0/

DRCD is compiled and adapted from Wikipedia and its content is published under the terms of CC-BY-SA 3.0. Delta Electronics, Inc. makes no representations or warranties of the correctness of the contents of DRCD and will not be liable for any loss or damage arising from the use or reliance on DRCD.

CC-BY-SA 3.0 can be found at http://creativecommons.org/licenses/by-sa/3.0/

Contact us 聯繫我們

You might also like...
DELTA is a deep learning based natural language and speech processing platform.
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

CDLA: A Chinese document layout analysis (CDLA) dataset
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

A 10000+ hours dataset for Chinese speech recognition
A 10000+ hours dataset for Chinese speech recognition

A 10000+ hours dataset for Chinese speech recognition

Reading Wikipedia to Answer Open-Domain Questions
Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Code repository for "It's About Time: Analog clock Reading in the Wild"

it's about time Code repository for "It's About Time: Analog clock Reading in the Wild" Packages required: pytorch (used 1.9, any reasonable version s

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.
🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

A hyper-fast, safe Python module to read and write JSON data. Works as a drop-in replacement for Python's built-in json module. This is alpha software

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Comments
  • Evaluation problem

    Evaluation problem

    According to the paper:

    F1 score and exact match from Rajpurkar et al. (2016) are used as the evaluation metrics. Both metrics ignore punctuations. In F1 score metric, we consider predictions and ground truth as bag of Chinese character.

    Is the ignored punctuations including fullwidth? Is there a original evaluation script for this dataset?

    opened by penut85420 2
  • Format conversion to SQuAD2.0

    Format conversion to SQuAD2.0

    When you used Bert-Chinese model to do the DRCD tasks like your paper told us, is there anything such as format conversion that we need to do first, and then we can use Bert-Chinese model to do DRCD tasks ?

    p.s. Format conversion means that convert DRCD format to SQuAD2.0 format.

    opened by allenyummy 1
  • Training set problem

    Training set problem

    In the paper, it said that "the training set contains 26,932 questions in 8,014 paragraphs". However, after calculating I found that I got 26936 question's id in the json file.

    opened by Liangtaiwan 1
  • Dev problem

    Dev problem

    The dev set answers are duplicate in the same question. Also, in SQuAD dataset, it has 3 different answers in dev set and test set, so the human performance is much higher than your dataset in EM performance. Are you going to provide more answer?

    opened by Liangtaiwan 1
Releases(v1.0)
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.5k Dec 25, 2022
Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

Dual Path Learning for Domain Adaptation of Semantic Segmentation Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Sema

27 Dec 22, 2022
My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

Aflah 9 Oct 31, 2022
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 03, 2023
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

Machinalis 1.2k Dec 18, 2022
RecipeReduce: Simplified Recipe Processing for Lazy Programmers

RecipeReduce This repo will help you figure out the amount of ingredients to buy for a certain number of meals with selected recipes. RecipeReduce Get

Qibin Chen 9 Apr 22, 2022
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 06, 2023
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 08, 2022
Open solution to the Toxic Comment Classification Challenge

Starter code: Kaggle Toxic Comment Classification Challenge More competitions 🎇 Check collection of public projects 🎁 , where you can find multiple

minerva.ml 153 Jun 22, 2022
GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates

GraphNLI: A Graph-based Natural Language Inference Model for Polarity Prediction in Online Debates Vibhor Agarwal, Sagar Joglekar, Anthony P. Young an

Vibhor Agarwal 2 Jun 30, 2022
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

Iman Kermani 3 Apr 15, 2022
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognit

SpeechBrain 5.1k Jan 09, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
Yes it's true :broken_heart:

Information WARNING: No longer hosted If you would like to be on this repo's readme simply fork or star it! Forks 1 - Flowzii 2 - Errorcrafter 3 - vk-

Dropout 66 Dec 31, 2022