Resources and Evaluations of Automated Chinese Error Diagnosis for Language Learners

Lung-Hao Lee, Yuen-Hsien Tseng*, and Li-Ping Chang. In X. Lu & B. Chen (Eds.), Computational and Corpus Approaches to Chinese Language Learning, pages 235-252.

Abstract

Chinese as a foreign language (CFL) learners may, in their language pro- duction, generate inappropriate linguistic usages, including character-level confu- sions (or commonly known as spelling errors) and word-/sentence-/discourse-level grammatical errors. Chinese spelling errors frequently arise from confusions among multiple-character words that are phonologically and visually similar but semanti- cally distinct. Chinese grammatical errors contain coarse-grained surface differences in terms of missing, redundant, incorrect selection, and word ordering error of lin- guistic components. Simultaneously, fine-grained error types further focus on repre- senting linguistic morphology and syntax such as verb, noun, preposition, conjunc- tion, adverb, and so on. Annotated learner corpora are important language resources to understand these error patterns and to help the development of error diagnosis systems. In this chapter, we describe two representative Chinese learner corpora: the HSK Dynamic Composition Corpus constructed by Beijing Language and Cul- ture University and the TOCFL Learner Corpus built by National Taiwan Normal University. In addition, we introduce several evaluations based on both learner cor- pora designed for computer-assisted Chinese learning. One is a series of SIGHAN bakeoffs for Chinese spelling checkers. The other series are the NLPTEA workshop shared tasks for Chinese grammatical error identification. The purpose of this chapter is to summarize the resources and evaluations for better understanding the current research developments and challenges of automated Chinese error diagnosis for CFL learners.