王海荣 徐玺 王彤 陈芳萍

王海荣, 徐玺, 王彤, 陈芳萍. 多尺度视觉语义增强的多模态命名实体识别方法. 自动化学报, 2024, 50(6): 1234−1245 doi: 10.16383/j.aas.c230573
Wang Hai-Rong, Xu Xi, Wang Tong, Chen Fang-Ping. Multi-scale visual semantic enhancement for multimodal named entity recognition method. Acta Automatica Sinica, 2024, 50(6): 1234−1245 doi: 10.16383/j.aas.c230573
基金项目: 宁夏自然科学基金(2023AAC03316), 宁夏回族自治区教育厅高等学校科学研究重点项目 (NYG2022051)资助

    王海荣:北方民族大学教授. 2015年获得东北大学博士学位. 主要研究方向为大数据知识工程与智能信息处理. 本文通信作者. E-mail: wanghr@nun.edu.cn

    徐玺:北方民族大学计算机科学与工程学院硕士研究生. 主要研究方向为多模态信息抽取. E-mail: 20217403@stu.nmu.edu.cn

    王彤:北方民族大学计算机科学与工程学院硕士研究生. 主要研究方向为多模态信息抽取. E-mail: is_wangtong@163.com

    陈芳萍:北方民族大学计算机科学与工程学院硕士研究生. 主要研究方向为多模态信息抽取. E-mail: 17393213357@163.com

Multi-scale Visual Semantic Enhancement for Multimodal Named Entity Recognition Method

Funds: Supported by Natural Science Foundation of Ningxia (2023AAC03316) and Key Research Project of Education Department of Ningxia Hui Autonomous Region (NYG2022051)
More Information
    Author Bio:

    WANG Hai-Rong Professor at No-rth Minzu University. She received her Ph.D. degree from Northeastern University in 2015. Her research interest covers big data knowledge engineering and intelligent information processing. Corresponding author of this paper

    XU Xi Master student at the Sch-ool of Computer Science and Engineering, North Minzu University. His main research interest is multimodal information extraction

    WANG Tong Master student at the School of Computer Science and Engineering, North Minzu Univer-sity. Her main research interest is multimodal information extraction

    CHEN Fang-Ping Master student at the School of Computer Science and Engineering, North Minzu University. Her main research interest is multimodal information extraction

  • 摘要: 为解决多模态命名实体识别(Multimodal named entity recognition, MNER)方法研究中存在的图像特征语义缺失和多模态表示语义约束较弱等问题, 提出多尺度视觉语义增强的多模态命名实体识别方法(Multi-scale visual semantic enhancement for multimodal named entity recognition method, MSVSE). 该方法提取多种视觉特征用于补全图像语义, 挖掘文本特征与多种视觉特征间的语义交互关系, 生成多尺度视觉语义特征并进行融合, 得到多尺度视觉语义增强的多模态文本表示; 使用视觉实体分类器对多尺度视觉语义特征解码, 实现视觉特征的语义一致性约束; 调用多任务标签解码器挖掘多模态文本表示和文本特征的细粒度语义, 通过联合解码解决语义偏差问题, 从而进一步提高命名实体识别准确度. 为验证该方法的有效性, 在Twitter-2015和Twitter-2017数据集上进行实验, 并与其他10种方法进行对比, 该方法的平均F1值得到提升.
  • 图  1  MSVSE模型框架

    Fig.  1  The framework of MSVSE model

    图  2  多模态特征融合模块

    Fig.  2  The multimodal feature fusion module

    图  3  多任务标签解码器

    Fig.  3  The multi-task label decoder

    图  4  在Twitter-2015上的视觉实体分类性能比较

    Fig.  4  Performance comparison of visual entity classification on Twitter-2015

    图  5  在Twitter-2017上的视觉实体分类性能比较

    Fig.  5  Performance comparison of visual entity classification on Twitter-2017

    表  1  数据集上方法性能比较(%)

    Table  1  Performance comparison of method on dataset (%)

    方法 Twitter-2015 Twitter-2017
    MSB 86.44 77.16 52.91 36.05 73.47 84.32
    MAF 84.67 81.18 63.35 41.82 73.42 91.51 85.80 85.10 68.79 86.25
    UMGF 84.26 83.17 62.45 42.42 74.85 91.92 85.22 83.13 69.83 85.51
    M3S 86.05 81.32 62.97 41.36 75.03 92.73 84.81 82.49 69.53 86.06
    UMT 85.24 81.58 63.03 39.45 73.41 91.56 84.73 82.24 70.10 85.31
    UAMNer 84.95 81.28 61.41 38.34 73.10 90.49 81.52 82.09 64.32 84.90
    VAE 85.82 81.56 63.20 43.67 75.07 91.96 81.89 84.13 74.07 86.37
    MNER-QG 85.68 81.42 63.62 41.53 74.94 93.17 86.02 84.64 71.83 87.25
    RGCN 86.36 82.08 60.78 41.56 75.00 92.86 86.10 84.05 72.38 87.11
    HvpNet 85.74 81.78 61.92 40.81 74.33 92.28 84.81 84.37 65.20 85.80
    MSVSE 86.72 81.63 64.08 38.91 75.11 93.24 85.96 85.22 70.00 87.34
    –HvpNet 0.98 –0.15 2.16 –1.90 0.78 0.96 1.15 0.85 4.80 1.54
    下载: 导出CSV

    表  2  模型结构消融实验(%)

    Table  2  Structural ablation experiments for the model (%)

    下载: 导出CSV

    表  3  联合编码器中视觉特征消融实验(%)

    Table  3  Visual feature ablation experiments in the joint encoder (%)

    $ \checkmark$$ \checkmark$86.7281.6364.0838.9175.1193.2485.9685.2270.0087.34
    $ \checkmark$86.7681.6861.2139.4674.7392.9586.2084.6070.8287.11
    $ \checkmark$$ \checkmark$86.8781.7463.7237.8074.8793.0385.7184.4371.7187.16
    $ \checkmark$$ \checkmark$$ \checkmark$86.5181.8562.2038.3674.7293.7385.9684.6270.9787.38
    下载: 导出CSV

    表  4  多尺度视觉语义前缀中视觉特征消融实验(%)

    Table  4  Visual feature ablation experiments in multi-scale visual semantic prefixes (%)

    $ \checkmark$$ \checkmark$$ \checkmark$86.7281.6364.0838.9175.1193.2485.9685.2270.0087.34
    $ \checkmark$86.2581.9363.9938.2374.7693.1684.8385.4769.1087.13
    $ \checkmark$$ \checkmark$86.5681.6064.0138.5974.9393.0285.7985.9768.6787.28
    $ \checkmark$$ \checkmark$86.8781.7963.3638.6874.9892.9486.5285.1468.9487.14
    下载: 导出CSV

    表  5  单尺度视觉特征下方法性能对比(%)

    Table  5  Performance comparison of methods under single scale visual feature (%)

    MSVSE (本文方法)75.1187.34
    下载: 导出CSV

    表  6  不同学习率的方法性能对比(%)

    Table  6  Performance comparison of methods under different learning rates (%)

    数据集 学习率($\times\; { {10}^{-5} }$)
    1 2 3 4 5 6
    Twitter-2015 73.4 75.0 75.1 74.8 74.6 74.5
    Twitter-2017 87.1 86.8 87.3 87.5 87.2 87.3
    下载: 导出CSV

    表  7  参数量及时间效率对比

    Table  7  Comparison of parameter number and time efficiency

    MSVSE (本文方法)119.2775.817.03
    下载: 导出CSV

    表  8  基于预训练语言模型的MNER方法性能对比(%)

    Table  8  Performance comparison of MNER method based on pre-trained language model (%)

    Prompting ChatGPT79.3391.43
    下载: 导出CSV
