2.765

2022影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于词向量语义分类的微博实体链接方法

冯冲 石戈 郭宇航 龚静 黄河燕

冯冲, 石戈, 郭宇航, 龚静, 黄河燕. 基于词向量语义分类的微博实体链接方法. 自动化学报, 2016, 42(6): 915-922. doi: 10.16383/j.aas.2016.c150715
引用本文: 冯冲, 石戈, 郭宇航, 龚静, 黄河燕. 基于词向量语义分类的微博实体链接方法. 自动化学报, 2016, 42(6): 915-922. doi: 10.16383/j.aas.2016.c150715
FENG Chong, SHI Ge, GUO Yu-Hang, GONG Jing, HUANG He-Yan. An Entity Linking Method for Microblog Based on Semantic Categorization by Word Embeddings. ACTA AUTOMATICA SINICA, 2016, 42(6): 915-922. doi: 10.16383/j.aas.2016.c150715
Citation: FENG Chong, SHI Ge, GUO Yu-Hang, GONG Jing, HUANG He-Yan. An Entity Linking Method for Microblog Based on Semantic Categorization by Word Embeddings. ACTA AUTOMATICA SINICA, 2016, 42(6): 915-922. doi: 10.16383/j.aas.2016.c150715

基于词向量语义分类的微博实体链接方法

doi: 10.16383/j.aas.2016.c150715
基金项目: 

高等学校博士学科点专项科研基金资助 20121101120026

国家自然科学基金( 61502035

国家重点基础研究发展计划(973计划) 2013CB329303

国家高技术研究发展计划(863计划)( (2015AA015 404

详细信息
    作者简介:

    石戈 北京理工大学计算机学院博士研究生. 主要研究方向为自然语言处理,实体链接, 问答系统. E-mail: tang zhi@126.com

    郭宇航 北京理工大学计算机学院讲师.2014 年获哈尔滨工业大学计算机科学与技术学院博士学位. 主要研究方向为自然语言处理,信息抽取,机器翻译. E-mail: tang guoyuhang@bit.edu.cn

    龚静 北京理工大学计算机学院硕士研究生. 主要研究方向为自然语言处理,机器翻译,问答系统. E-mail: tang gongjing@bit.edu.cn

    黄河燕 北京理工大学计算机学院教授.1989 年获中国科学院计算技术研究所计算机科学与技术博士学位. 主要研究方向为自然语言处理和机器翻译社交网络与信息检索, 智能处理系统. E-mail: tang hhy63@bit.edu.cn

    通讯作者:

    冯冲 北京理工大学计算机学院副研究员. 2005 年获中国科学技术大学计算机科学系博士学位. 主要研究方向为自然语言处理, 信息抽取, 机器翻译. 本文通信作者. E-mail: fengchong@bit.edu.cn

An Entity Linking Method for Microblog Based on Semantic Categorization by Word Embeddings

Funds: 

Specialized Research Fund for the Doctoral Program of Higher Education 20121101120026

National Natural Science Foundation of China 61502035

Supported by National Basic Research Program of China (973 Program) 2013CB329303

National High Technology Research and Development Program of China (863 Program) (2015AA015 404

More Information
    Author Bio:

    SHI Ge Ph. D. candidate at the College of Computer Science and Tech- nology, Beijing Institute of Technology. His research interest covers natural lan- guage processing, entity linking, and question answering system

    GUO Yu-Hang Lecturer at the College of Computer Science and Tech- nology, Beijing Institute of Technology. He received his Ph. D. degree from Harbin Institute of Technology in 2014. His research interest covers natural language processing, information extraction, and machine translation

    GONG Jing Master student at the College of Computer Science and Tech- nology, Beijing Institute of Technology. Her research interest covers natural lan- guage processing, machine translation, and question an- swering system

    HUANG He-Yan Professor at the College of Computer Science and Tech- nology, Beijing Institute of Technology. She received her Ph. D. degree from the Institute of Computing Technology, Chinese Academy of Sciences. Her research interest cov- ers natural language processing, machine translation, social network, information retrieval, and intelligent processing system

    Corresponding author: (FENG Chong. Associate professor at the College of Computer Science and Technology, Beijing Institute of Technology. He received his Ph. D. degree from the Department of Computer Sci-ence, University of Science and Technology of China in 2005. His research interest covers natural language pro-cessing, information extraction, and machine translation.Corresponding author of this paper. E-mail:fengchong@bit.edu.cn
  • 摘要: 微博实体链接是把微博中给定的指称链接到知识库的过程,广泛应用于信息抽取、自动问答等自然语言处理任务(Natural language processing,NLP). 由于微博内容简短,传统长文本实体链接的算法并不能很好地用于微博实体链接任务. 以往研究大都基于实体指称及其上下文构建模型进行消歧,难以识别具有相似词汇和句法特征的候选实体. 本文充分利用指称和候选实体本身所含有的语义信息,提出在词向量层面对任务进行抽象建模,并设计一种基于词向量语义分类的微博实体链接方法. 首先通过神经网络训练词向量模板,然后通过实体聚类获得类别标签作为特征,再通过多分类模型预测目标实体的主题类别来完成实体消歧. 在NLPCC2014公开评测数据集上的实验结果表明,本文方法的准确率和召回率均高于此前已报道的最佳结果,特别是实体链接准确率有显著提升.
  • 图  1  词向量语义分类模型

    Fig.  1  Model of semantical categorization by word embeddings

    图  2  训练数据示例

    Fig.  2  Example of the training data

    图  3  实体链接过程

    Fig.  3  Process of entity linking

    图  4  本文方法在不同参数λ下的F1值

    Fig.  4  F1 scores of the combined measure with the λ parameter

    图  5  SCWE在不同参数k下的F1平均值

    Fig.  5  F1 scores of SCWE with the k features

    表  1  训练集数据统计

    Table  1  Statistics in training data

    平均每条微博中名词个数7.91
    同一语义类别名词个数超过7的微博34
    同一语义类别名词个数超过6的微博81
    同一语义类别名词个数超过5的微博207
    同一语义类别名词个数超过4的微博416
    同一语义类别名词个数超过3的微博502
    下载: 导出CSV

    表  2  同义词表举例

    Table  2  Examples of synonym lexicon

    文中实体表示(Key)标准实体表示(Value)
    迈克尔乔丹
    飞人
    篮球之神
    迈克尔·杰弗里·乔丹迈克尔·乔丹
    Michael Jordan
    Michael Jeffrey Jordan
    乔丹
    下载: 导出CSV

    表  3  歧义词表举例

    Table  3  Examples of ambiguity lexicon

    标准实体表示(Key)无歧义真实实体(List)
    苹果(果树)
    苹果(果实)
    苹果苹果(公司)
    苹果(人物)
    苹果(动漫角色)
    苹果(歌曲)
    下载: 导出CSV

    表  4  实体流行度表举例

    Table  4  Examples of entity frequency

    无歧义真实实体(Entity)实体出现次数(Frequency)
    苹果(果树)26
    苹果(果实)39
    苹果(公司)158
    苹果(人物)2
    下载: 导出CSV

    表  5  实体流行度权值

    Table  5  Weights of entity frequency

    流行度排行
    权值10.80.70.60.5
    下载: 导出CSV

    表  6  实验数据规模

    Table  6  Scale of experiment data

    数据类型数据规模
    同义词表Key总数4 293 406
    同义词表Value总数1 948 277
    歧义词表Key总数213 764
    歧义词表Value总数2 354 687
    实体总数4 369 348
    下载: 导出CSV

    表  7  in-KB 实验结果

    Table  7  Results of in-KB

    系统准确率召回率F1 值
    NLPCC0.79270.84880.8198
    SCWE+EF0.81370.85930.8358
    EF*0.76410.81420.7884
    CMEL0.79510.83450.8143
    下载: 导出CSV

    表  8  NIL 实验结果

    Table  8  Results of NIL

    系统准确率召回率F1 值
    NLPCC0.90240.86530.8835
    SCWE+EF0.91440.87630.8949
    EF*0.88710.86480.8758
    CMEL0.85430.86940.8461
    下载: 导出CSV

    表  9  in-KB 实验结果

    Table  9  Results of in-KB

    λ值准确率召回率F1 值
    00.75320.80160.7766
    0.20.76210.81580.788
    0.40.79430.83750.8153
    0.60.81370.85930.8358
    0.80.80320.84320.8227
    10.79830.84880.8228
    下载: 导出CSV

    表  10  NIL 实验结果

    Table  10  Results of NIL

    λ值准确率召回率F1 值
    00.84320.85320.8482
    0.20.86430.87130.8678
    0.40.89170.87320.8824
    0.60.91480.87620.8951
    0.80.90320.87540.8891
    10.90130.87430.8876
    下载: 导出CSV
  • [1] (中国微博服务. 2014年新浪微博用户发展报告[Online], available: http://www.199it.com/archives/324955.html. November 24, 2015)

    Chinese Microblog Service. Sina Weibo User Development Report in 2014[Online], available:http://www.199it.com/archives/324955.html. November 24, 2015
    [2] Guo Y H, Qin B, Liu T, Li S. Microblog entity linking by leveraging extra posts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, USA: Association for Computational Linguistic, 2013. 863-868
    [3] 杨锦锋, 于秋滨, 关毅, 蒋志鹏. 电子病历命名实体识别和实体关系抽取研究综述. 自动化学报, 2014, 40(8): 1537-1562

    Yang Jin-Feng, Yu Qiu-Bin, Guan Yi, Jiang Zhi-Peng. An overview of research on electronic medical record oriented named entity recognition and entity relation extraction. Acta Automatica Sinica, 2014, 40(8): 1537-1562
    [4] Shen W, Wang J Y, Han J W. Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(2): 443-460
    [5] Jiang L, Yu M, Zhou M, Liu X H, Zhao T J. Target-dependent twitter sentiment classification. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: 2011. 151-160
    [6] Shen W, Wang J Y, Luo P, Wang M. Linking named entities in tweets with knowledge base via user interest modeling. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2013. 68-76
    [7] Liu X H, Li Y T, Wu H C, Zhou M, Wei F R, Lu Y. Entity linking for tweets. In: Proceedings of the 51st Annual Meeting of the Association of Computational Linguistics. Sofia, Bulgaria: Association for Computational Linguistics, 2013. 1304-1311
    [8] 乌达巴拉, 汪增福. 一种基于组合语义的文本情绪分析模型. 自动化学报, 2015, 41(12): 2125-2137

    Odbal, Wang Zeng-Fu. Emotion analysis model using compositional semantics. Acta Automatica Sinica, 2015, 41(12): 2125-2137
    [9] NLPCC[Online], available:http://tcci.ccf.org.cn/conference/2014/pages/page04_sam.html. October 31, 2015
    [10] Hachey B, Radford W, Nothman J, Honnibal M, Curran J R. Evaluating entity linking with Wikipedia. Artificial Intelligence, 2013, 194: 130-150
    [11] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv: 1301.3781, 2013.
    [12] Hartigan J A, Wong M A. Algorithm AS 136: a k-means clustering algorithm. Journal of the Royal Statistical Society——Series C (Applied Statistics), 1979, 28(1): 100-108
    [13] Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 2014, 15: 3133-3181
    [14] 毛毅, 陈稳霖, 郭宝龙, 陈一昕. 基于密度估计的逻辑回归模型. 自动化学报, 2014, 40(1): 62-72

    Mao Yi, Chen Wen-Lin, Guo Bao-Long, Chen Yi-Xin. A novel logistic regression model based on density estimation. Acta Automatica Sinica, 2014, 40(1): 62-72
    [15] 周晓剑. 考虑梯度信息的ε-支持向量回归机. 自动化学报, 2014, 40(12): 2908-2915

    Zhou Xiao-Jian. Enhancing varepsilon-support vector regression with gradient information. Acta Automatica Sinica, 2014, 40(12): 2908-2915
    [16] King G, Zeng L C. Logistic regression in rare events data. Political Analysis, 2001, 9(2): 137-163
    [17] Guo Y H, Qin B, Li Y Q, Liu T, Lin S. Improving candidate generation for entity linking. In: Proceedings of the 18th International Conference on Applications of Natural Language to Information Systems. Salford, UK: Springer, 2013. 225-236
    [18] Wikipedia[Online], available:http://download.wikipedia.comzhwikilate-stzhwiki-latest-pages-articles.xml.bz2. October 31, 2015
    [19] 朱敏, 贾真, 左玲, 吴安峻, 陈方正, 柏玉. 中文微博实体链接研究. 北京大学学报(自然科学版), 2014, 50(1): 73-78

    Zhu Min, Jia Zhen, Zuo Ling, Wu An-Jun, Chen Fang-Zheng, Bai Yu. Research on entity linking of Chinese microblog. Acta Scientiarum Naturalium Universitatis Pekinensis, 2014, 50(1): 73-78
    [20] 郭宇航. 基于上下文的实体链指技术研究[博士学位论文], 哈尔滨工业大学, 中国, 2014.

    Guo Yu-Hang. Research on Context-based Entity Linking Technique[Ph.,D. dissertation], Harbin Institute of Technology, China, 2014.
    [21] Meng Z Y, Yu D, Xun E D. Chinese microblog entity linking system combining Wikipedia and search engine retrieval results. In: Proceedings of the 3rd CCF Conference on Natural Language Processing and Chinese Computing. Berlin Heidelberg: Springer, 2014. 449-456
  • 加载中
图(5) / 表(10)
计量
  • 文章访问数:  2469
  • HTML全文浏览量:  607
  • PDF下载量:  1920
  • 被引次数: 0
出版历程
  • 收稿日期:  2015-10-29
  • 录用日期:  2016-05-03
  • 刊出日期:  2016-06-20

目录

    /

    返回文章
    返回