A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Overview



GitHub issues GitHub stars GitHub license

古文自然语言处理模型合集,收录互联网上的古文相关模型及资源。

更多内容请参考:

古文预训练语言模型

古文预训练语言模型是处理各种古文任务的基础模型,需要结合各种下游任务数据微调,才能发挥最大作用。这里收集了所有互联网上公开的古文预训练语言模型:

名称 简/繁 下载链接 备注
guwenbert-base Hugging Face 基于殆知阁语料和中文模型训练
guwenbert-large Hugging Face
guwenbert-fs-base One Drive 基于殆知阁语料从头训练
roberta-classical-chinese-base-char 简繁 Hugging Face 基于guwenbert训练,扩展了繁体词表
roberta-classical-chinese-large-char 简繁 Hugging Face
sikubert Hugging Face 基于四库全书语料和中文模型训练
sikuroberta Hugging Face

古文应用模型

古文应用模型是基于古文预训练模型,结合特定领域数据微调得到的模型,能够实现古文的各种实际应用。其中guwen-X模型使用的训练数据可以在CCLUE中下载,如果输入包含繁体字请先使用本页最下方提到的工具进行转换。

古文断句

guwen-seg: 基于guwenbert-fs-base的断句模型。

古文标点

guwen-punc: 基于guwenbert-fs-base的标点模型。

古文引号检测

guwen-quote: 基于guwenbert-fs-base的引号检测模型。

注意:如下图所示,使用Transformers自带的序列标注模型存在一定误差,请在实际场景中使用CRF模型解码。相关代码参考 crf_example.ipynb

古文命名实体识别

guwen-ner: 基于guwenbert-base的命名实体识别模型。

注意:为取得最好表现,推荐在实际场景中使用CRF模型解码。相关代码参考 crf_example.ipynb

古文分类

guwen-cls: 基于guwenbert-fs-base的古文分类模型。

古诗情感分类

guwen-sent: 基于guwenbert-base的古文分类模型。

其他古文相关资源

  • OpenCC: 简繁转换工具
  • zhconv: 简繁转换工具 (注意需使用zh-hans选项,只转换单字,避免转换地区词)
  • 甲言Jiayan: 古汉语处理的NLP工具包,古文分词,词性标注,断句,标点等工具
  • UD-Kanbun: 古文分词,词性标注,句法解析
  • daizhigev20: 殆知阁古代文献v2.0语料库
  • chinese-poetry: 最全中文诗歌古典文集数据库
  • Classical-Chinese: 古文现代文翻译平行语料库

关于

本仓库旨在收集互联网上的开源古文NLP模型,版权归原作者所有,欢迎补充更多资源,如有问题可以在Issue区讨论,或邮件联系ethanyt at qq.com

Owner
Ethan
Natural Language Processing, Deep Learning, Information Retrieval, Full-stack Development.
Ethan
SGMC: Spectral Graph Matrix Completion

SGMC: Spectral Graph Matrix Completion Code for AAAI21 paper "Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning". Data Format

Chao Chen 8 Dec 12, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Extension - matrix and vocabulary extractor for TF-IDF and Doc2Vec An extension for ASReview that adds a tf-idf extractor that saves the matrix and th

ASReview 4 Jun 17, 2022
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Neural Style Transfer Transition Video Processing By Brycen Westgarth and Tristan Jogminas Description This code extends the neural style transfer ima

Brycen Westgarth 110 Jan 07, 2023
Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Structured Super Lottery Tickets in BERT This repo contains our codes for the paper "Super Tickets in Pre-Trained Language Models: From Model Compress

Chen Liang 16 Dec 11, 2022
ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword

Google Research 409 Jan 06, 2023
Paradigm Shift in NLP - "Paradigm Shift in Natural Language Processing".

Paradigm Shift in NLP Welcome to the webpage for "Paradigm Shift in Natural Language Processing". Some resources of the paper are constantly maintaine

Tianxiang Sun 41 Dec 30, 2022
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

BERT is to NLP what AlexNet is to CV This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Iden

Asahi Ushio 20 Nov 03, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Manglish Text Editor This is a simple transiteration (romanization ) program which is used to convert manglish to malayalam (converts njaan to ഞാൻ ).

Merin Rose Tom 1 May 11, 2022
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Kundan Krishna 6 Jun 04, 2021
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
NLP-SentimentAnalysis - Coursera Course ( Duration : 5 weeks ) offered by DeepLearning.AI

Coursera Natural Language Processing Specialization This repository contains material related to Coursera Natural Language Processing Specialization.

Nishant Sharma 1 Jun 05, 2022