基于级联重排序的汉语音字转换

李鑫鑫; 王轩; 姚霖; 关键

doi:10.3724/SP.J.1004.2014.00624

基于级联重排序的汉语音字转换

doi: 10.3724/SP.J.1004.2014.00624 cstr: 32138.14.SP.J.1004.2014.00624

李鑫鑫^1,2,
王轩^1,2,
姚霖^1,3,
关键^1,3

1.
哈尔滨工业大学深圳研究生院计算机应用研究中心深圳 518055;
2.
深圳互联网多媒体应用技术工程实验室深圳 518055;
3.
移动互联网应用安全产业公共服务平台深圳 518057

基金项目:

国家科技部重大科技专项（2011ZX03002-004-01），深圳市基础研究重点项目（JC201104210032A，JC201005260112A）资助

详细信息

作者简介:
王轩哈尔滨工业大学深圳研究生院教授.主要研究方向为人工智能，网络多媒体信息处理.E-mail：wangxuan@insun.hit.edu.cn

计量
- 文章访问数: 1419
- HTML全文浏览量: 45
- PDF下载量: 868
- 被引次数: 0
出版历程
- 收稿日期: 2013-04-22
- 修回日期: 2013-09-22
- 刊出日期: 2014-04-20

Chinese Pinyin-to-character Conversion Based on Cascaded Reranking

LI Xin-Xin^1,2,
WANG Xuan^1,2,
YAO Lin^1,3,
GUAN Jian^1,3

1.
Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055;
2.
Shenzhen Applied Technology Engineering Laboratory for Internet Multimedia Application, Shenzhen 518055;
3.
Public Service Platform of Mobile Internet Application Security Industry, Shenzhen 518057

Funds:

Supported by Key Science and Technology Projects of the Ministry of National Science and Technology (2011ZX03002-004-01) and Shenzhen Basic Research Key Project (JC201104210032A, JC201005260112A)

摘要

摘要: N元语言模型是解决汉字音字转换问题最常用的方法. 但在解析过程中，每一个新词的确定只依赖于前面的邻近词，缺乏长距离词之间的句法和语法约束. 我们引入词性标注和依存句法等子模型等来加强这种约束关系，并采用两个重排序方法来利用这些子模型提供的信息：1）线性重排序方法，采用最小错误学习方法来得到各个子模型的权重，然后产生候选词序列的概率；2）采用平均感知器方法对候选词序列进行重排序，能够利用词性、依存关系等复杂特征. 实验结果显示，两种方法都能有效地提高词N元语言模型的性能. 而将这两种方法进行级联，即首先采用线性重排序方法，然后把产生的概率作为感知器重排序方法的初始概率时性能取得最优.
- 汉语音字转换 /
- 重排序 /
- 最小错误学习 /
- 感知器方法
Abstract: The word n-gram language model is the most common approach for Chinese pinyin-to-character conversion. It is simple, efficient, and widely used in practice. However, in the decoding phase of the word n-gram model, the determination of a word only depends on its previous words, which lacks long distance grammatical or syntactic constraints. In this paper, we propose two reranking approaches to solve this problem. The linear reranking approach uses minimum error learning method to combine different sub-models, which includes word and character n-gram language models, part-of-speech tagging model and dependency model. The averaged perceptron reranking approach reranks the candidates generated by word n-gram model by employing features extracted from word sequence, part-of-speech tags, and dependency tree. Experimental results on "Lancaster Corpus of Mandarin Chinese" and "People's Daily" show that both reranking approaches can efficiently utilize information of syntactic structures, and outperform the word n-gram model. The perceptron reranking approach which takes the probability output of linear reranking approach as initial weight achieves the best performance.
- Chinese pinyin-to-character conversion /
- reranking approach /
- minimum error learning /
- averaged perceptron

HTML全文

参考文献(35)

[1]	Chen S F, Goodman J. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report, Computer Science Group, Harvard University, 1998
[2]	Brown P F, deSouza P V, Mercer R L, Della Pietra V J, Lai J C. Class-based n-gram models of natural language. Computational Linguistics, 1992, 18(4): 467-479
[3]	Huang J H, Powers D M. Adaptive compression-based approach for Chinese pinyin input. In: Proceedings of the 3rd SIGHAN Workshop Chinese Language Learning. Barcelona, Spain: Association for Computational Linguistics, 2004. 24 -27
[4]	Wei J, Li P X. Applying the word acquiring algorithm to the pinyin-to-character conversion. In: Proceedings of the 5th International Conference on Natural Computation. Washington, DC, USA: IEEE Computer Society, 2009. 17-21
[5]	Tang B Z, Wang X L, Wang X, Wang Y H. Frequency-based online adaptive n-gram models. In: Proceedings of the 2nd International Conference on Multimedia and Computational Intelligence. Wuhan, China: IEEE, 2010. 263-266
[6]	Huang J H, Powers D. Error-driven adaptive language modeling for Chinese pinyin-to-character conversion. In: Proceedings of the 2011 International Conference on Asian Language Processing. Penang, Malaysia: IEEE, 2011. 19-22
[7]	Pauls A, Klein D. Faster and smaller n-gram language models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, 2011. 258-267
[8]	Shan Yu-Xiang, Chen Xie, Shi Yong-Zhe, Liu Jia. Fast language model look-ahead algorithm using extended n-gram model. Acta Automatica Sinica, 2012, 38(10): 1618-1626(单煜翔, 陈谐, 史永哲, 刘加. 基于扩展N元文法模型的快速语言模型预测算法. 自动化学报, 2012, 38(10): 1618-1626)
[9]	Siu M H, Ostendorf M. Variable n-grams and extensions for conversational speech language modeling. IEEE Transactions on Speech and Audio Processing, 2000, 8(1): 63-75
[10]	Wang X, Li L, Yao L, Anwar W. A maximum entropy approach to Chinese pinyin-to-character conversion. In: Proceedings of the 2006 IEEE International Conference on Systems, Man, and Cybernetics. Taipei, China: IEEE, 2006. 2956-2959
[11]	Zhao Y, Wang X L, Liu B Q, Guan Y. Research of pinyin-to-character conversion based on maximum entropy model. Journal of Electronics, 2006, 23(6): 864-869
[12]	Xiao J H, Liu B Q, Wang X L. Exploiting pinyin constraints in pinyin-to-character conversion task: a class-based maximum entropy markov model approach. Computational Linguistics and Chinese Language Processing, 2007, 12(3): 325 -348
[13]	Jiang Wei, Guan Yi, Wang Xiao-Long, Liu Bin-Quan. Pinyin-to-character conversion model based on support vector machines. Journal of Chinese Information Processing, 2007, 21(2): 100-105(姜维, 关毅, 王晓龙, 刘秉权. 基于支持向量机的音字转换模型. 中文信息学报, 2007, 21(2): 100-105)
[14]	Li L, Wang X, Wang X L, Yu Y B. A conditional random fields approach to Chinese pinyin-to-character conversion. Journal of Communication and Computer, 2009, 6(4): 2531
[15]	Wang X L, Chen Q C, Yeung D S. Mining pinyin-to-character conversion rules from large-scale corpus: a rough set approach. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2004, 34(2): 834-844
[16]	Ney H, Essen U, Kneser R. On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 1994, 8(1): 1-38
[17]	Wang Xuan, Wang Xiao-Long, Zhang Kai. Language model for speech recognition applications. Acta Automatica Sinica, 1999, 25(3): 309-315(王轩, 王晓龙, 张凯. 语音识别中统计与规则结合的语言模型. 自动化学报, 1999, 25(3): 309-315)
[18]	Roark B. Probabilistic top-down parsing and language modeling. Computational Linguistics, 2001, 27(2): 249-276
[19]	Yang S H, Zhao H, Lu B L. A machine translation approach for chinese whole-sentence pinyin-to-character conversion. In: Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation. Bali, Indonesia: Universitas Indonesia, 2012. 333-342
[20]	Wen J, Wang X J, Xu W Z, Jiang H X. Ambiguity solution of pinyin segmentation in continuous pinyin-to-character conversion. In: Proceedings of the 2008 International Conference on Natural Language Processing and Knowledge Engineering. Beijing, China: IEEE, 2008. 1-7
[21]	Chen Z, Lee K F. A new statistical approach to Chinese pinyin input. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong: Association for Computational Linguistics, 2000. 241-247
[22]	Zheng Y B, Li C, Sun M S. CHIME: an efficient error-tolerant Chinese pinyin input method. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Barcelona, Catalonia, Spain: AAAI Press, 2011. 2551 -2556
[23]	Collins M. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Philadelphia, PA, USA: Association for Computational Linguistics, 2002. 1-8
[24]	Li X X, Wang X, L Yao Y. Joint decoding for Chinese word segmentation and POS tagging using character-based and word-based discriminative models. In: Proceedings of the 2011 International Conference on Asian Language Processing (IALP). Washington, DC, USA: IEEE, 2011. 11-14
[25]	Ng H T, Low J K. Chinese part-of-speech tagging: one-at-a-time or all at once? word-based or character-based? In: Proceedings of the 2004 EMNLP. Barcelona, Spain: Association for Computational Linguistics, 2004. 277-284
[26]	Zhang Y, Clark S. Joint word segmentation and POS tagging using a single perceptron. In: Proceedings of ACL-08: HLT. Columbus, Ohio: Association for Computational Linguistics, 2008. 888-896
[27]	Zhang Y, Nivre J. Transition-based dependency parsing with rich non-local features. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, 2011. 188 -193
[28]	Liu Di, Sun Dong-Mei, Qiu Zheng-Ding. Feature level fusion based on speaker verification via relation measurement Fusion framework. Acta Automatica Sinica, 2011, 37(12): 1503-1513(刘镝, 孙冬梅, 裘正定. 一种基于关系度量融合框架的说话人认证特征级融合算法. 自动化学报, 2011, 37(12): 1503-1513)
[29]	Och F J. Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan: Association for Computational Linguistics, 2003. 160-167
[30]	Jiang W B, Huang L, Liu Q, Lü Y J. A cascaded linear model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL-08: HLT. Columbus, Ohio: Association for Computational Linguistics, 2008. 897-904
[31]	Zaidan O. Z-MERT: A fully configurable open source tool for minimum error rate training of machine translation systems. The Prague Bulletin of Mathematical Linguistics, 2009, 91(1): 79-88
[32]	Stolcke A. SRILM ——an extensible language modeling toolkit. In: Proceedings of the 2002 International Conference on Spoken Language Processing. Denver, Colorado: IEEE 2002. 901-904
[33]	Liu W, Guthrie L. Chinese pinyin-text conversion on segmented text. In: Proceedings of the 12th International Conference on Text, Speech and Dialogue. Berlin, Heidelberg: Springer-Verlag, 2009. 116-123
[34]	Zhou X H, Hu X H, Zhang X D, Shen X J. A segment-based hidden Markov model for real-setting pinyin-to-Chinese conversion. In: Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM 2007). New York, NY, USA: ACM Press, 2007. 1027 -1030
[35]	Zhang Sen. Solving the pinyin-to-Chinese-character conversion problem based on hybrid word lattice. Chinese Journal of Computers, 2007, 30(7): 1145-1153(章森. 基于混合字词网格的汉语音字转换问题的求解. 计算机学报, 2007, 30(7): 1145-1153)