多尺度视觉语义增强的多模态命名实体识别方法

王海荣; 徐玺; 王彤; 陈芳萍

doi:10.16383/j.aas.c230573

多尺度视觉语义增强的多模态命名实体识别方法

doi: 10.16383/j.aas.c230573

王海荣^{1, 2,},
徐玺^1,,
王彤^1,,
陈芳萍^1,

1.
北方民族大学计算机科学与工程学院银川 750021
2.
北方民族大学图像图形智能处理国家民委重点实验室银川 750021

基金项目: 宁夏自然科学基金(2023AAC03316), 宁夏回族自治区教育厅高等学校科学研究重点项目 (NYG2022051)资助

详细信息

作者简介:
王海荣：北方民族大学教授. 2015年获得东北大学博士学位. 主要研究方向为大数据知识工程与智能信息处理. 本文通信作者. E-mail: wanghr@nun.edu.cn

徐玺：北方民族大学计算机科学与工程学院硕士研究生. 主要研究方向为多模态信息抽取. E-mail: 20217403@stu.nmu.edu.cn

王彤：北方民族大学计算机科学与工程学院硕士研究生. 主要研究方向为多模态信息抽取. E-mail: is_wangtong@163.com

陈芳萍：北方民族大学计算机科学与工程学院硕士研究生. 主要研究方向为多模态信息抽取. E-mail: 17393213357@163.com

计量
- 文章访问数: 1052
- HTML全文浏览量: 916
- PDF下载量: 240
- 被引次数: 0
出版历程
- 收稿日期: 2023-09-13
- 录用日期: 2024-02-22
- 网络出版日期: 2024-04-01
- 刊出日期: 2024-06-27

Multi-scale Visual Semantic Enhancement for Multimodal Named Entity Recognition Method

WANG Hai-Rong^{1, 2
,},
XU Xi^1
,,
WANG Tong^1
,,
CHEN Fang-Ping^1
,

1.
School of Computer Science and Engineering, North Minzu University, Yinchuan 750021
2.
The Key Laboratory of Images & Graphics Intelligent Processing of State Ethnic Affairs Commission, North Minzu University, Yinchuan 750021

Funds: Supported by Natural Science Foundation of Ningxia (2023AAC03316) and Key Research Project of Education Department of Ningxia Hui Autonomous Region (NYG2022051)

More Information

Author Bio:
WANG Hai-Rong　Professor at No-rth Minzu University. She received her Ph.D. degree from Northeastern University in 2015. Her research interest covers big data knowledge engineering and intelligent information processing. Corresponding author of this paper

XU Xi　Master student at the Sch-ool of Computer Science and Engineering, North Minzu University. His main research interest is multimodal information extraction

WANG Tong　Master student at the School of Computer Science and Engineering, North Minzu Univer-sity. Her main research interest is multimodal information extraction

CHEN Fang-Ping　Master student at the School of Computer Science and Engineering, North Minzu University. Her main research interest is multimodal information extraction

摘要

摘要: 为解决多模态命名实体识别(Multimodal named entity recognition, MNER)方法研究中存在的图像特征语义缺失和多模态表示语义约束较弱等问题, 提出多尺度视觉语义增强的多模态命名实体识别方法(Multi-scale visual semantic enhancement for multimodal named entity recognition method, MSVSE). 该方法提取多种视觉特征用于补全图像语义, 挖掘文本特征与多种视觉特征间的语义交互关系, 生成多尺度视觉语义特征并进行融合, 得到多尺度视觉语义增强的多模态文本表示; 使用视觉实体分类器对多尺度视觉语义特征解码, 实现视觉特征的语义一致性约束; 调用多任务标签解码器挖掘多模态文本表示和文本特征的细粒度语义, 通过联合解码解决语义偏差问题, 从而进一步提高命名实体识别准确度. 为验证该方法的有效性, 在Twitter-2015和Twitter-2017数据集上进行实验, 并与其他10种方法进行对比, 该方法的平均F1值得到提升.
- 多模态命名实体识别 /
- 多任务学习 /
- 多模态融合 /
- Transformer
Abstract: To address the issues of semantic loss in image features and weak semantic constraints in multimodal representations encountered in the research of multimodal named entity recognition (MNER) methods, multi-scale visual semantic enhancement for multimodal named entity recognition method (MSVSE) is proposed. After supplementing image semantics by extracting multiple visual features, the semantic interaction and feature fusion between text features and various visual features are explored through a multimodal feature fusion module. This process outputs multi-scale visual semantic-enhanced multimodal text representations. The visual entity classifier is used to decode multi-scale visual semantic features to learn the semantic consistency between various visual features. The multi-task decoder is invoked to mine the fine-grained semantic representation in multimodal text repre-sentation and text features, and carry out joint decoding to solve the semantic bias problem, thereby further improving the accuracy of named entity recognition. To verify the effectiveness of the method, experiments were carried out on Twitter-2015 and Twitter-2017 respectively, and compared with other 10 methods. The average F1 values of the MSVSE on the two datasets have increased.
- Multimodal named entity recognition (MNER) /
- multi-task learning /
- multimodal fusion /
- Transformer

HTML全文

图 1 MSVSE模型框架

Fig. 1 The framework of MSVSE model

下载: 全尺寸图片幻灯片

图 2 多模态特征融合模块

Fig. 2 The multimodal feature fusion module

下载: 全尺寸图片幻灯片

图 3 多任务标签解码器

Fig. 3 The multi-task label decoder

下载: 全尺寸图片幻灯片

图 4 在Twitter-2015上的视觉实体分类性能比较

Fig. 4 Performance comparison of visual entity classification on Twitter-2015

下载: 全尺寸图片幻灯片

图 5 在Twitter-2017上的视觉实体分类性能比较

Fig. 5 Performance comparison of visual entity classification on Twitter-2017

下载: 全尺寸图片幻灯片

表 1 数据集上方法性能比较(%)

Table 1 Performance comparison of method on dataset (%)

方法	Twitter-2015					Twitter-2017
方法	PER	LOC	ORG	MISC	F1	PER	LOC	ORG	MISC	F1
MSB	86.44	77.16	52.91	36.05	73.47	—	—	—	—	84.32
MAF	84.67	81.18	63.35	41.82	73.42	91.51	85.80	85.10	68.79	86.25
UMGF	84.26	83.17	62.45	42.42	74.85	91.92	85.22	83.13	69.83	85.51
M3S	86.05	81.32	62.97	41.36	75.03	92.73	84.81	82.49	69.53	86.06
UMT	85.24	81.58	63.03	39.45	73.41	91.56	84.73	82.24	70.10	85.31
UAMNer	84.95	81.28	61.41	38.34	73.10	90.49	81.52	82.09	64.32	84.90
VAE	85.82	81.56	63.20	43.67	75.07	91.96	81.89	84.13	74.07	86.37
MNER-QG	85.68	81.42	63.62	41.53	74.94	93.17	86.02	84.64	71.83	87.25
RGCN	86.36	82.08	60.78	41.56	75.00	92.86	86.10	84.05	72.38	87.11
HvpNet	85.74	81.78	61.92	40.81	74.33	92.28	84.81	84.37	65.20	85.80
MSVSE	86.72	81.63	64.08	38.91	75.11	93.24	85.96	85.22	70.00	87.34
–HvpNet	0.98	–0.15	2.16	–1.90	0.78	0.96	1.15	0.85	4.80	1.54

下载: 导出CSV

表 2 模型结构消融实验(%)

Table 2 Structural ablation experiments for the model (%)

方法	Twitter-2015					Twitter-2017
方法	PER	LOC	ORG	MISC	F1	PER	LOC	ORG	MISC	F1
MSVSE	86.72	81.63	64.08	38.91	75.11	93.24	85.96	85.22	70.00	87.34
w/o自注意力机制	86.49	81.20	63.21	41.56	74.83	93.05	86.52	84.37	67.34	86.79
w/o相似度	86.33	81.59	63.15	40.84	74.91	92.94	86.59	84.07	68.24	86.75
w/o自注意力机制加相似度	86.80	81.38	63.32	39.62	74.67	92.97	85.87	84.41	67.96	86.67
w/o多任务标签解码器	86.49	81.78	62.68	37.60	74.69	92.98	84.83	85.02	71.66	87.14
w/o视觉实体分类器	86.52	81.64	63.06	39.89	74.79	93.37	84.83	85.82	66.24	86.92

下载: 导出CSV

表 3 联合编码器中视觉特征消融实验(%)

Table 3 Visual feature ablation experiments in the joint encoder (%)

文本	视觉标签	图像描述	Twitter-2015					Twitter-2017
文本	视觉标签	图像描述	PER	LOC	ORG	MISC	F1	PER	LOC	ORG	MISC	F1
$ \checkmark$	—	$ \checkmark$	86.72	81.63	64.08	38.91	75.11	93.24	85.96	85.22	70.00	87.34
$ \checkmark$	—	—	86.76	81.68	61.21	39.46	74.73	92.95	86.20	84.60	70.82	87.11
$ \checkmark$	$ \checkmark$	—	86.87	81.74	63.72	37.80	74.87	93.03	85.71	84.43	71.71	87.16
$ \checkmark$	$ \checkmark$	$ \checkmark$	86.51	81.85	62.20	38.36	74.72	93.73	85.96	84.62	70.97	87.38

下载: 导出CSV

表 4 多尺度视觉语义前缀中视觉特征消融实验(%)

Table 4 Visual feature ablation experiments in multi-scale visual semantic prefixes (%)

区域视觉特征	视觉标签	图像描述	Twitter-2015					Twitter-2017
区域视觉特征	视觉标签	图像描述	PER	LOC	ORG	MISC	F1	PER	LOC	ORG	MISC	F1
$ \checkmark$	$ \checkmark$	$ \checkmark$	86.72	81.63	64.08	38.91	75.11	93.24	85.96	85.22	70.00	87.34
$ \checkmark$	—	—	86.25	81.93	63.99	38.23	74.76	93.16	84.83	85.47	69.10	87.13
$ \checkmark$	$ \checkmark$	—	86.56	81.60	64.01	38.59	74.93	93.02	85.79	85.97	68.67	87.28
$ \checkmark$	—	$ \checkmark$	86.87	81.79	63.36	38.68	74.98	92.94	86.52	85.14	68.94	87.14

下载: 导出CSV

表 5 单尺度视觉特征下方法性能对比(%)

Table 5 Performance comparison of methods under single scale visual feature (%)

方法	单尺度视觉特征	Twitter-2015 F1	Twitter-2017 F1
MAF	区域视觉特征	73.42	86.25
MSB	图像标签	73.47	84.32
ITA	视觉标签	75.18	85.67
ITA	5个视觉描述	75.17	85.75
ITA	光学字符识别	75.01	85.64
MSVSE	only区域视觉特征	74.84	86.75
MSVSE	only视觉标签	74.66	87.17
MSVSE	only视觉描述	74.56	87.23
MSVSE	w/o视觉前缀	74.89	87.08
MSVSE (本文方法)		75.11	87.34

下载: 导出CSV

表 6 不同学习率的方法性能对比(%)

Table 6 Performance comparison of methods under different learning rates (%)

数据集	学习率($\times\; { {10}^{-5} }$)
数据集	1	2	3	4	5	6
Twitter-2015	73.4	75.0	75.1	74.8	74.6	74.5
Twitter-2017	87.1	86.8	87.3	87.5	87.2	87.3

下载: 导出CSV

表 7 参数量及时间效率对比

Table 7 Comparison of parameter number and time efficiency

方法	参数量(MB)	训练时间(s)	验证时间(s)
MSB	122.97	45.80	3.31
UMGF	191.32	314.42	18.73
MAF	136.09	103.39	6.37
ITA	122.97	65.40	4.69
UMT	148.10	156.73	8.59
HvpNet	143.34	70.36	9.34
MSVSE (本文方法)	119.27	75.81	7.03

下载: 导出CSV

表 8 基于预训练语言模型的MNER方法性能对比(%)

Table 8 Performance comparison of MNER method based on pre-trained language model (%)

方法	Twitter-2015	Twitter-2017
Glove-BiLSTM-CRF	69.15	79.37
BERT-CRF	71.81	83.44
BERT-large-CRF	73.53	86.81
XLMR-CRF	77.37	89.39
Prompting ChatGPT	79.33	91.43
MSVSE	75.11	87.34

下载: 导出CSV

参考文献(28)

[1]	Moon S, Neves L, Carvalho V. Multimodal named entity recognition for short social media posts. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, USA: NAACL Press, 2018. 852−860
[2]	Lu D, Neves L, Carvalho V, Zhang N, Ji H. Visual attention model for name tagging in multimodal social media. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: 2018. 1990−1999
[3]	Asgari-Chenaghlu M, Farzinvash M R, Farzinvash L, Balafar M A, Motamed C. CWI: A multimodal deep learning approach for named entity recognition from social media using character, word and image features. Neural Computing and Applications, 2022, 34(3): 1905−1922 doi: 10.1007/s00521-021-06488-4
[4]	Zhang Q, Fu J L, Liu X Y, Huang X J. Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence Conference, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans, USA: AAAI Press, 2018. 5674− 5681
[5]	Zheng C M, Wu Z W, Wang T, Cai Y, Li Q. Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Transactions on Multimedia, 2020, 23: 2520−2532
[6]	Wu Z W, Zheng C M, Cai Y, Chen J Y, Leung H F, Li Q. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: Proceedings of the 28th ACM International Conference on Multimedia. Seattle, USA: ACM, 2020. 1038−1046
[7]	Yu J F, Jiang J, Yang L, Xia R. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Virtual Event: 2020. 3342−3352
[8]	Xu B, Huang S Z, Sha C F, Wang H Y. MAF: A general matching and alignment framework for multimodal named entity recognition. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining. New York, USA: Association for Computing Machinery, 2022. 1215−1223
[9]	Wang X W, Ye J B, Li Z X, Tian J F, Jiang Y, Yan M, et al. CAT-MNER: Multimodal named entity recognition with knowledge refined cross-modal attention. In: Proceedings of the IEEE International Conference on Multimedia and Exposition. Taipei, China: 2022. 1−6
[10]	Zhang D, Wei S Z, Li S S, Wu H Q, Zhu Q M, Zhou G D. Multimodal graph fusion for named entity recognition with targeted visual guidance. In: Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2021. 14347− 14355
[11]	钟维幸, 王海荣, 王栋, 车淼. 多模态语义协同交互的图文联合命名实体识别方法. 广西科学, 2022, 29(4): 681−690 Zhong Wei-Xing, Wang Hai-Rong, Wang Dong, Che Miao. lmage-text joint named entity recognition method based on multimodal semantic interaction. Guangxi Sciences, 2022, 29(4): 681−690
[12]	Yu T, Sun X, Yu H F, Li Y, Fu K. Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing, 2021, 439 : 12−21
[13]	Wang X Y, Gui M, Jiang Y, Jia Z X, Bach N, Wang T, et al. ITA: Image-text alignments for multimodal named entity recognition. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, USA: Association for Computational Linguistics, 2022. 3176−3189
[14]	Liu L P, Wang M L, Zhang M Z, Qing L B, He X H. UAMNer: Uncertainty aware multimodal named entity recognition in social media posts. Applied Intelligence, 2022, 52(4): 4109−4125 doi: 10.1007/s10489-021-02546-5
[15]	李晓腾, 张盼盼, 勾智楠, 高凯. 基于多任务学习的多模态命名实体识别方法. 计算机工程, 2023, 49(4): 114−119 Li Xiao-Teng, Zhang Pan-Pan, Gou Zhi-Nan, Gao Kai. Multimodal named entity recognition method based on multi-task learning. Computer Engineering, 2023, 49(4): 114−119
[16]	Wang J, Yang Y, Liu K Y, Zhu Z P, Liu X R. M3S: Scene graph driven multi-granularity multi-task learning for multimodal NER. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31 : 111−120
[17]	Chen X, Zhang N Y, Li L, Yao Y Z, Deng S M, Tan C Q, et al. Good visual guidance make a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction. In: Proceedings of the Association for Computational Linguistics. Seattle, USA: Association for Computational Linguistics, 2022. 1607−1618
[18]	Jia M, Shen L, Shen X, Liao L J, Chen M, He X D, et al. MNER-QG: An end-to-end MRC framework for multimodal named entity recognition with query grounding. AAAI, 2022, 37(7): 8032−8040
[19]	Sun L, Wang J Q, Su Y D, Weng F S, Sun Y X, Zheng Z W, et al. RIVA: A pre-trained tweet multimodal model based on text-image relation for multimodal NER. In: Proceedings of the 28th International Conference on Computational Linguistics. Virtual Event: 2022. 1852−1862
[20]	Sun L, Wang J Q, Zhang K, Su Y D, Weng F S. RpBERT: A text-image relation propagation-based BERT model for multimodal NER. In: Proceedings of the AAAI Conference on Artificial Intelligence. Virtual Event: 2021. 13860−13868
[21]	Xu B, Huang S, Du M, Wang H Y, Song H, Sha C F, et al. Different data, different modalities reinforced data splitting for effective multimodal information extraction from social media posts. In: Proceedings of the 29th International Conference on Computational Linguistics. Virtual Event: 2022. 1855−1864
[22]	Zhao F, Li C H, Wu Z, Xing S Y, Dai X Y. Learning from different text-image pairs: A relation-enhanced graph convolutional network for multimodal NER. In: Proceedings of the 30th ACM International Conference on Multimedia, Association for Computing Machinery. New York, USA: 2022. 3983−3992
[23]	Zhou B H, Zhang Y, Song K H, Guo W Y, Zhao G Q, Wang W B, et al. A span-based multimodal variational autoencoder for semi-supervised multimodal named entity recognition. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022. 6293−6302
[24]	He K M, Gkioxari G, Dollár P, Girshick R. Mask-RCNN. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: 2017. 2980−2988
[25]	Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: 2015. 3156−3164
[26]	He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: 2016. 770−778
[27]	王海荣, 徐玺, 王彤, 荆博祥. 多模态命名实体识别方法研究进展. 多模态命名实体识别方法研究进展. 郑州大学学报(工学版), 2024, 45 (2): 60−71 doi: 10.13705/j.issn.1671-6833 Wang Hai-Rong, Xu Xi, Wang Tong, Jing Bo-Xiang. Research progress of multimodal named entity recognition. Journal of Zhengzhou University (Engineering Science), 2024, 45 (2): 60−71 doi: 10.13705/j.issn.1671-6833
[28]	Li J Y, Li H, Pan Z, Sun D, Wang J H, Zhang W K, et al. Prompting ChatGPT in MNER: Enhanced multimodal named entity recognition with auxiliary refined knowledge. In: Proceedings of the Association for Computational Linguistics. Singapore: 2023. 2787−2802