PyTorch impelementations of BERT-based Spelling Error Correction Models.

Last update: Dec 30, 2022

Overview

BertBasedCorrectionModels

基于BERT的文本纠错模型，使用PyTorch实现

数据准备

从 http://nlp.ee.ncu.edu.tw/resource/csc.html下载SIGHAN数据集
解压上述数据集并将文件夹中所有 ''.sgml'' 文件复制至 datasets/csc/ 目录
复制 ''SIGHAN15_CSC_TestInput.txt'' 和 ''SIGHAN15_CSC_TestTruth.txt'' 至 datasets/csc/ 目录
下载 https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml 至 datasets/csc 目录

请确保以下文件在 datasets/csc 中

train.sgml
B1_training.sgml
C1_training.sgml  
SIGHAN15_CSC_A2_Training.sgml  
SIGHAN15_CSC_B2_Training.sgml  
SIGHAN15_CSC_TestInput.txt
SIGHAN15_CSC_TestTruth.txt

环境准备

使用已有编码环境或通过 conda create -n python=3.7 创建一个新环境（推荐）
克隆本项目并进入项目根目录
安装所需依赖 pip install -r requirements.txt
如果出现报错 GLIBC 版本过低的问题（GLIBC 的版本更迭容易出事故，不推荐更新），openCC 改为安装较低版本（例如 1.1.0）
在当前终端将此目录加入环境变量 export PYTHONPATH=.

训练

运行以下命令以训练模型，首次运行会自动处理数据。

python tools/train_csc.py --config_file csc/train_SoftMaskedBert.yml

可选择不同配置文件以训练不同模型，目前支持以下配置文件：

train_bert4csc.yml
train_macbert4csc.yml
train_SoftMaskedBert.yml

如有其他需求，可根据需要自行调整配置文件中的参数。

实验结果

SoftMaskedBert

component	sentence level acc	p	r	f
Detection	0.5045	0.8252	0.8416	0.8333
Correction	0.8055	0.9395	0.8748	0.9060

Bert类

char level

MODEL	p	r	f
BERT4CSC	0.9269	0.8651	0.8949
MACBERT4CSC	0.9380	0.8736	0.9047

sentence level

model	acc	p	r	f
BERT4CSC	0.7990	0.8482	0.7214	0.7797
MACBERT4CSC	0.8027	0.8525	0.7251	0.7836

推理

方法一，使用inference脚本:

python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --texts "我今天很高心"
# 或给出line by line格式的文本地址
python inference.py --ckpt_fn epoch=0-val_loss=0.03.ckpt --text_file /ml/data/text.txt

其中/ml/data/text.txt文本如下：

我今天很高心
你这个辣鸡模型只能做错别字纠正

方法二，直接调用

texts = ['今天我很高心', '测试', '继续测试']
model.predict(texts)

方法三、导出bert权重，使用transformers或pycorrector调用

使用convert_to_pure_state_dict.py导出bert权重
后续步骤参考https://github.com/shibing624/pycorrector/blob/master/pycorrector/macbert/README.md

引用

如果你在研究中使用了本项目，请按如下格式引用：

@article{cai2020pre,
  title={BERT Based Correction Models},
  author={Cai, Heng and Chen, Dian},
  journal={GitHub. Note: https://github.com/gitabtion/BertBasedCorrectionModels},
  year={2020}
}

License

本源代码的授权协议为 Apache License 2.0，可免费用做商业用途。请在产品说明中附加本项目的链接和授权协议。本项目受版权法保护，侵权必究。

PyTorch impelementations of BERT-based Spelling Error Correction Models.

Related tags

Overview

BertBasedCorrectionModels

数据准备

环境准备

训练

实验结果

SoftMaskedBert

Bert类

char level

sentence level

推理

方法一，使用inference脚本:

方法二，直接调用

方法三、导出bert权重，使用transformers或pycorrector调用

引用

License

更新记录

20210618

20210518

20210517

References

Owner

Heng Cai

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

A text augmentation tool for named entity recognition.

The swas programming language

Autoregressive Entity Retrieval

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Clone a voice in 5 seconds to generate arbitrary speech in real-time

This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

Natural language computational chemistry command line interface.

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Library for Russian imprecise rhymes generation

NLP tool to extract emotional phrase from tweets 🤩

Neural network sequence labeling model

Backend for the Autocomplete platform. An AI assisted coding platform.

a test times augmentation toolkit based on paddle2.0.

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Code and data accompanying Natural Language Processing with PyTorch

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles