面向口语统计语言模型建模的自动语料生成算法

司玉景; 肖业鸣; 徐及; 潘接林; 颜永红

doi:10.3724/SP.J.1004.2014.02808

面向口语统计语言模型建模的自动语料生成算法

doi: 10.3724/SP.J.1004.2014.02808 cstr: 32138.14.SP.J.1004.2014.02808

1.
中国科学院声学研究所语言声学与内容理解重点实验室北京 100190

基金项目:

国家高技术研究发展计划(863计划)(2012AA012503),国家自然科学基金(10925419,90920302,61072124,11074275,11161140319,91120001,61271426),中国科学院战略性先导科技专项(XDA06030100,XDA06030500),中国科学院重点部署项目(KGZD-EW-103-2)资助

详细信息

作者简介:
肖业鸣中国科学院声学研究所博士研究生.2008年获得北京航空航天大学学士学位.主要研究方向为大词汇量连续语音识别,深度学习和神经网络技术.E-mail: xiaoyeming@hccl.ioa.ac.cn

通讯作者:
司玉景中国科学院声学研究所博士研究生.2009年获得吉林大学通信工程学院信息工程系学士学位.主要研究方向为统计语言模型建模, 语音识别解码技术, 机器学习,深度神经网络技术, 自动语音文本同步技术.本文通信作者.E-mail:siyujinglj@126.com

计量
- 文章访问数: 2225
- HTML全文浏览量: 135
- PDF下载量: 725
- 被引次数: 0
出版历程
- 收稿日期: 2013-12-18
- 修回日期: 2014-06-03
- 刊出日期: 2014-12-20

Automatic Text Corpus Generation Algorithm towards Oral Statistical Language Modeling

1.
The Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190

Funds:

Supported by National High Technology Research and Development Program of China (863 Program) (2012AA012503), National Natural Science Foundation of China (10925419, 90920302, 61072124, 11074275, 11161140319, 91120001, 61271426), the Strategic Priority Research Program of Chinese Academy of Sciences (XDA06030100, XDA06030500), and the Chinese Academy of Sciences Priority Deployment Project (KGZD-EW-103-2)

摘要

摘要: 在资源相对匮乏的自动语音识别(Automatic speech recognition, ASR)领域, 如面向电话交谈的语音识别系统中, 统计语言模型(Language model, LM)存在着严重的数据稀疏问题. 本文提出了一种基于等概率事件的采样语料生成算法, 自动生成领域相关的语料, 用来强化统计语言模型建模. 实验结果表明, 加入本算法生成的采样语料可以缓解语言模型的稀疏性, 从而提升整个语音识别系统的性能. 在开发集上语言模型的困惑度相对降低7.5%, 字错误率(Character error rate, CER)绝对降低0.2个点; 在测试集上语言模型的困惑度相对降低6%, 字错误率绝对降低0.4点.
- 自动语音识别 /
- 资源匮乏 /
- 语言模型 /
- 等概率事件 /
- 语料生成算法
Abstract: Data sparseness is a serious issue for language model (LM) in automatic speech recognition (ASR) towards resource-lack domains, e.g. the telephone conversation speech recognition task. In this paper, an event of equal probability based text corpus generation algorithm is proposed in order to alleviate the sparseness of language model. Experimental results show that 7.5% relative reduction in perplexity and a 0.2% absolute reduction in character error rate (CER) can be obtained on the develop set. And, a 6% relative reduction in perplexity and a 0.4% absolute reduction in CER can be obtained on the test set.
- Automatic speech recognition (ASR) /
- resource-lack /
- language model (LM) /
- equality probability event /
- text corpus generation

HTML全文

参考文献(15)

[1]	Yang Xing-Jun, Chi Hui-Sheng. Digital Processing of Speech Signals. Beijing: Electronic Industry Press, 1995. 330-331(杨行竣, 迟惠生. 语音信号数字处理. 北京: 电子工业出版牡, 1995. 330-331)
[2]	Chen S F, Goodman J. An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics. Santa Cruz, CA, 1996. 310-318
[3]	Allauzen C, Riley M. Bayesian language model interpolation for mobile speech input. In: Proceedings of the 2011 Interspeech. Italy, 2011. 1429-1432
[4]	Khudanpur S, Wu J. A maximum entropy language model integrating n-grams and topic dependencies for conversational speech recognition. In: Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Phoenix, AZ: IEEE, 1999. 553-556
[5]	Schwenk H. CSLM —— a modular open-source continuous space language modeling toolkit. In: Proceedings of the 2013 Interspeech. Lyyon, France, 2013. 1198-1202
[6]	Mikolov T, Karafiát M, Burget L, Černocký J H, Khudanpur S. Recurrent neural network based language model. In: Proceedings of the 2010 INTERSPEECH. Lyon, France: ISCA, 2010. 1045-1048
[7]	Mikolov T, Deoras A, Kombrink S, Burget L, Cernocky J H. Empirical evaluation and combination of advanced language modeling techniques. In: Proceedings of the 2011 Interspeech. Italy, 2011. 605-608
[8]	Liu X, Wang Y, Chen X, Gales M J F, Woodland P C. Efficient lattice rescoring using recurrent neural network language models. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). FLORENCE, ITALY, 2014. 4941-4945
[9]	Huang Yun-Zhu, Wei Wei, Luo Yang-Yu, Li Cheng-Rong. Word-class expansion method about training corpus of language modal in restrcited domain. Application of Computer System, 2011, 20(11): 55-58 (黄韵竹, 韦玮, 罗杨宇, 李成荣. 限定领域语言模型训练语料的词类扩展方法. 计算机系统应用, 2011, 20(11): 55-58)
[10]	Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver, Canada: IEEE, 2013. 8624-8628
[11]	Sutskever Ilya. Training Recurrent Neural Networks [Ph.D. dissertation], University of Toronto, Canada, 2013.
[12]	Si Y J, Zhang Z, Li T, Pan J, Yan Y. Enhanced word classing for recurrent neural network language model. Journal of Information & Computational Science, 2013, 10(12): 3595-3604
[13]	Shao J, Li T, Zhang Q Q, Zhao Q W, Yan Y H. A one-pass real-time decoder using memory-efficient state network. IEICE Transactions on Information and Systems, 2008, 1(91): 529-537
[14]	Mikolov T, Kombrink S, Deoras A, Burget L, Cernocky J H. RNNLM-Recurrent neural network language modeling toolkit. In: Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, UK, 2011. 16-19
[15]	Shao Jian. Chinese Spoken Term Detection towards Large-Scale Telephone Conversational Speech [Ph.D. dissertation]. Institute of Acoustics, Chinese Academy of Sciences, China, 2008. (邵建. 面向大规模电话交谈语音的汉语语音检索[博士学位论文], 中国科学院声学研究所, 中国, 2008.)

施引文献

资源附件(0)

访问统计

计量

文章访问数: 2225
HTML全文浏览量: 135
PDF下载量: 725
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

面向口语统计语言模型建模的自动语料生成算法

doi: 10.3724/SP.J.1004.2014.02808 cstr: 32138.14.SP.J.1004.2014.02808

作者简介:
肖业鸣中国科学院声学研究所博士研究生.2008年获得北京航空航天大学学士学位.主要研究方向为大词汇量连续语音识别,深度学习和神经网络技术.E-mail: xiaoyeming@hccl.ioa.ac.cn

计量

Automatic Text Corpus Generation Algorithm towards Oral Statistical Language Modeling

计量

目录

留言板

面向口语统计语言模型建模的自动语料生成算法

doi: 10.3724/SP.J.1004.2014.02808 cstr: 32138.14.SP.J.1004.2014.02808

作者简介: 肖业鸣 中国科学院声学研究所博士研究生.2008年获得北京航空航天大学学士学位.主要研究方向为大词汇量连续语音识别,深度学习和神经网络技术.E-mail: xiaoyeming@hccl.ioa.ac.cn

计量

出版历程

Automatic Text Corpus Generation Algorithm towards Oral Statistical Language Modeling

计量

出版历程

目录

作者简介:
肖业鸣中国科学院声学研究所博士研究生.2008年获得北京航空航天大学学士学位.主要研究方向为大词汇量连续语音识别,深度学习和神经网络技术.E-mail: xiaoyeming@hccl.ioa.ac.cn