A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Last update: Dec 29, 2022

Related tags

Overview

ReaLiSe

ReaLiSe is a multi-modal Chinese spell checking model.

This the office code for the paper Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking.

The paper has been accepted in ACL Findings 2021.

Environment

Python: 3.6
Cuda: 10.0
Packages: pip install -r requirements.txt

Data

Raw Data

SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation

Data Processing

The code and cleaned data are in the data_process directory.

You can also directly download the processed data from this and put them in the data directory. The data directory would look like this:

data
|- trainall.times2.pkl
|- test.sighan15.pkl
|- test.sighan15.lbl.tsv
|- test.sighan14.pkl
|- test.sighan14.lbl.tsv
|- test.sighan13.pkl
|- test.sighan13.lbl.tsv

Pretrain

BERT: chinese-roberta-wwm-ext

Huggingface hfl/chinese-roberta-wwm-ext: https://huggingface.co/hfl/chinese-roberta-wwm-ext
Local: /data/dobby_ceph_ir/neutrali/pretrained_models/roberta-base-ch-for-csc/
Phonetic Encoder: pretrain_pho.sh
Graphic Encoder: pretrain_res.sh
Merge: merge.py

You can also directly download the pretrained and merged BERT, Phonetic Encoder, and Graphic Encoder from this, and put them in the pretrained directory:

pretrained
|- pytorch_model.bin
|- vocab.txt
|- config.json

Train

After preparing the data and pretrained model, you can train ReaLiSe by executing the train.sh script. Note that you should set up the PRETRAINED_DIR, DATE_DIR, and OUTPUT_DIR in it.

sh train.sh

Test

Test ReaLiSe using the test.sh script. You should set up the DATE_DIR, CKPT_DIR, and OUTPUT_DIR in it. CKPT_DIR is the OUTPUT_DIR of the training process.

sh test.sh

Well-trained Model

You can also download well-trained model from this direct using. The performance scores of RealiSe and some baseline models on the SIGHAN13, SIGHAN14, SIGHAN15 test set are here:

Methods

FASpell: FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm
Soft-Masked BERT: Spelling Error Correction with Soft-Masked BERT
SpellGCN: SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check
BERT: Our implementation

Metrics

"D" means "Detection Level", "C" means "Correction Level".
"A", "P", "R", "F" means "Accuracy", "Precision", "Recall", and "F1" respectively.

SIGHAN15

Method	D-A	D-P	D-R	D-F	C-A	C-P	C-R	C-F
FASpell	74.2	67.6	60.0	63.5	73.7	66.6	59.1	62.6
Soft-Masked BERT	80.9	73.7	73.2	73.5	77.4	66.7	66.2	66.4
SpellGCN	-	74.8	80.7	77.7	-	72.1	77.7	75.9
BERT	82.4	74.2	78.0	76.1	81.0	71.6	75.3	73.4
ReaLiSe	84.7	77.3	81.3	79.3	84.0	75.9	79.9	77.8

SIGHAN14

Method	D-A	D-P	D-R	D-F	C-A	C-P	C-R	C-F
Pointer Network	-	63.2	82.5	71.6	-	79.3	68.9	73.7
SpellGCN	-	65.1	69.5	67.2	-	63.1	67.2	65.3
BERT	75.7	64.5	68.6	66.5	74.6	62.4	66.3	64.3
ReaLiSe	78.4	67.8	71.5	69.6	77.7	66.3	70.0	68.1

SIGHAN13

Method	D-A	D-P	D-R	D-F	C-A	C-P	C-R	C-F
FASpell	63.1	76.2	63.2	69.1	60.5	73.1	60.5	66.2
SpellGCN	78.8	85.7	78.8	82.1	77.8	84.6	77.8	81.0
BERT	77.0	85.0	77.0	80.8	77.4	83.0	75.2	78.9
ReaLiSe	82.7	88.6	82.5	85.4	81.4	87.2	81.2	84.1

Citation

@misc{xu2021read,
      title={Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking}, 
      author={Heng-Da Xu and Zhongli Li and Qingyu Zhou and Chao Li and Zizhen Wang and Yunbo Cao and Heyan Huang and Xian-Ling Mao},
      year={2021},
      eprint={2105.12306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Related tags

Overview

ReaLiSe

Environment

Data

Raw Data

Data Processing

Pretrain

Train

Test

Well-trained Model

SIGHAN15

SIGHAN14

SIGHAN13

Citation

Owner

DaDa

Optical machine for senses sensing using speckle and deep learning

Learning to Estimate Hidden Motions with Global Motion Aggregation

Tensorflow implementation of soft-attention mechanism for video caption generation.

Pytorch implementation of the paper SPICE: Semantic Pseudo-labeling for Image Clustering

this is a lite easy to use virtual keyboard project for anyone to use

NBEATSx: Neural basis expansion analysis with exogenous variables

Adversarial Self-Defense for Cycle-Consistent GANs

Project page for our ICCV 2021 paper "The Way to my Heart is through Contrastive Learning"

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Learning from Synthetic Humans, CVPR 2017

Python package provinding tools for artistic interactive applications using AI

PINN(s): Physics-Informed Neural Network(s) for von Karman vortex street

A Pytorch Implementation of [Source data‐free domain adaptation of object detector through domain

Single-Shot Motion Completion with Transformer

Le dataset des images du projet d'IA de 2021

'A C2C E-COMMERCE TRUST MODEL BASED ON REPUTATION' Python implementation

Graph WaveNet apdapted for brain connectivity analysis.

Pytorch port of Google Research's LEAF Audio paper

DANA paper supplementary materials

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)