Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

General and Domain-adaptive Chinese Spelling Check with Error-consistent Pretraining

General and Domain-adaptive Chinese Spelling Check with Error-consistent Pretraining The lack of label data is one of the significant bottlenecks for Chinese Spelling Check. Existing researches use the automatic generation method by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatically generated corpus. Thus, we develop a competitive general speller ECSpell, which adopts the Error-consistent masking strategy to create data for pretraining. This error-consistency masking strategy is used to specify the error types of automatically generated sentences consistent with the real scene. The experimental result indicates that our model outperforms previous state-of-the-art models on the general benchmark.Moreover, spellers often work within a particular domain in real life. Due to many uncommon domain terms, experiments on our built domain-specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain-adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification-based speller. Our experiments demonstrate that ECSpellUD, namely, ECSpell combined with UD, surpasses all the other baselines broadly, even approaching the performance on the general benchmark.1 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) Association for Computing Machinery

General and Domain-adaptive Chinese Spelling Check with Error-consistent Pretraining

Loading next page...
 
/lp/association-for-computing-machinery/general-and-domain-adaptive-chinese-spelling-check-with-error-I8YzpYE6zw

References (37)

Publisher
Association for Computing Machinery
Copyright
Copyright © 2023 Association for Computing Machinery.
ISSN
2375-4699
eISSN
2375-4702
DOI
10.1145/3564271
Publisher site
See Article on Publisher Site

Abstract

The lack of label data is one of the significant bottlenecks for Chinese Spelling Check. Existing researches use the automatic generation method by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatically generated corpus. Thus, we develop a competitive general speller ECSpell, which adopts the Error-consistent masking strategy to create data for pretraining. This error-consistency masking strategy is used to specify the error types of automatically generated sentences consistent with the real scene. The experimental result indicates that our model outperforms previous state-of-the-art models on the general benchmark.Moreover, spellers often work within a particular domain in real life. Due to many uncommon domain terms, experiments on our built domain-specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain-adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification-based speller. Our experiments demonstrate that ECSpellUD, namely, ECSpell combined with UD, surpasses all the other baselines broadly, even approaching the performance on the general benchmark.1

Journal

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)Association for Computing Machinery

Published: May 9, 2023

Keywords: Chinese spelling check

There are no references for this article.