面向汉语建模的自适应词表生成算法

肖镜辉; 刘秉权; 王晓龙

doi:10.3724/SP.J.1004.2008.00040

面向汉语建模的自适应词表生成算法

doi: 10.3724/SP.J.1004.2008.00040

1.
哈尔滨工业大学计算机科学与技术学院哈尔滨 150001

详细信息

通讯作者:
肖镜辉

中图分类号: TP391.12
计量
- 文章访问数: 2873
- HTML全文浏览量: 71
- PDF下载量: 1516
- 被引次数: 0
出版历程
- 收稿日期: 2006-09-12
- 修回日期: 2007-04-26
- 刊出日期: 2008-01-20

A Self-adaptive Lexicon Construction Algorithm for Chinese Language Modeling

1.
School of Computer Science and Techniques, Harbin Institute of Technology, Harbin 150001

More Information

Corresponding author: XIAO Jing-Hui

摘要

摘要: 词表的质量直接影响汉语语言模型的性能, 而当前汉语词典编撰工作同语言建模工作相脱离, 一方面使得现有的汉语语言模型受词表规模所限, 性能不能发挥到最优, 另一方面因为缺乏专业领域的词表, 难以建立面向特定领域的语言模型. 本文旨在通过建立优化词表的方式来提高现有汉语语言模型的性能, 并使其自动适应训练语料的领域. 本文首先将词表自动生成工作同汉语语言建模工作相结合, 构建一体化迭代算法框架, 在自动生成优化词表的同时能够获得高性能的汉语语言模型. 在该框架下, 本文提出汉字构词强度的概念来描述汉语的词法信息, 并将其作为词法特征与统计特征相结合, 构造一种基于多特征的汉语词表自动生成算法. 最后, 本文提出两种启发式方法, 自动根据训练语料的特点调整系统中的各项参数, 使系统能够自动适应训练语料的领域. 实验表明, 本文的方法能够在生成高质量词表的同时获得高性能的语言模型, 并且能够有效自动适应训练语料的领域.
- 词表自动生成 /
- 语言建模 /
- 汉字构词强度 /
- 自适应
Abstract: The lexicon quality affects the performance of Chinese language model directly. However, the lexicon compilation is separated from Chinese language modeling, resulting in two severe problems: firstly, the current language models can not achieve the optimal performance due to the limitation of lexicon scale; secondly, it is hard to apply the current language models to special areas due to the absence of lexicon. This paper aims to improve the performance of Chinese language model by constructing optimal lexicon. Meanwhile, it can self-adapt the area of training corpus automatically. Firstly, this paper combines the lexicon compilation with Chinese language modeling and proposes an iterative algorithm framework. Under this framework, it proposes the concept of character lexical significance (CLS) to describe Chinese lexical principle. Together with the statistical features, a multi-feature based algorithm is proposed for Chinese lexicon construction. Finally, it proposes two heuristic rules to adjust the parameters so as to self-adapt the area of training corpus. From the experimental results, it is found that the system can obtain the optimal Chinese lexicon as well as the high-performance Chinese language model. Moreover, the proposed techniques can self-adapt the area of training corpus successfully.
- Chinese lexicon construction /
- language modeling /
- Chinese lexicon significance /
- self-adaptive