Title: A Similarity-Based Smoothing Algorithm for Chinese Language Modeling and its Application on Pinyin-to-Character Conversion
Abstract: Data sparseness is a common and inherent problem of statistical language model which greatly damage the model performance and limit its applications. But current smoothing methods are too simple to further exploit the linguistic knowledge and prevent the performance improvement. By using word semantic information, this paper introduces a similarity-based smoothing algorithm for Chinese language modeling which combines word similarity calculation with back-off smoothing method, and presents an iterative method to optimize the parameters in the algorithm. Furthermore, the similarity-based smoothing algorithm is extended from low-level language model to high-level model. By applying our method to Pinyin-to-Character conversion system, the experiment shows that our method improves the performance of language model significantly and reduces the error rate of Pinyin-to-Character conversion system effectively.