Multi-agent Interactive Modeling and Authenticity Evaluation for Classroom Teaching
-
摘要: 随着大语言模型的发展, 多智能体虚拟课堂正成为低风险教学实验与策略验证的重要工具. 然而, 现有方法往往忽视真实课堂中的话语结构、学生潜在状态与同伴交互机制, 缺乏对教学互动真实性及干预效果的系统建模与评估. 为此, 提出IRF-Smi框架: 以发起–应答–反馈话语链条约束教学对话, 结合第一视角潜在状态建模与小世界社交网络, 刻画师生行为的动态演化及同伴影响. 同时构建教学互动真实性评测基准, 并采用Pearson相关系数、组内相关系数及平均绝对误差对模拟结果进行量化评估. 在50节K-12课堂数据上的实验表明, IRF-Smi相比AutoGen与MetaGPT在师生行为分布一致性方面表现更优; 此外, 游戏化教学策略带来显著收益, 验证了该框架用于教学机制研究与智能体行为验证的潜力.Abstract: With the development of large language models, multi-agent virtual classrooms are becoming an important tool for low-risk teaching experiments and strategy validation. However, existing methods often neglect the discourse structure, student latent states, and peer interaction mechanisms in real classrooms, lacking systematic modeling and evaluation of teaching interaction authenticity and intervention effects. To this end, the IRF-Smi framework is proposed: it constrains teaching dialogues using the initiation-response-feedback discourse chain, and incorporates first-person latent state modeling and small-world social networks to model the dynamic evolution of teacher-student behaviors and peer influence. A benchmark for teaching interaction authenticity is constructed, and the simulation results are quantitatively evaluated using Pearson correlation coefficient, intraclass correlation coefficient, and mean absolute error Experiments on 50 K-12 classroom sessions show that IRF-Smi achieves better consistency in teacher-student behavior distributions than AutoGen and MetaGPT. Moreover, gamified teaching strategies yield significant gains, demonstrating its potential for teaching mechanism research and agent behavior validation.1)
1 1https://github.com/SumnerLab/TalkMoves2)2 https://github.com/huggingface/peft -
表 1 教学行为拟真性评测基准上IRF-Smi与其他方法的对比
Table 1 Comparison of IRF-Smi with other methods on the teaching behavior authenticity evaluation benchmark
指标 角色 方法 4年级 5年级 6年级 MS HS PCC 教师 AutoGen 0.577 8 0.606 0 0.602 9 0.589 7 0.598 1 MetaGPT 0.605 0 0.621 7 0.626 9 0.626 3 0.629 1 IRF-Smi 0.643 1 0.678 6 0.680 5 0.660 7 0.693 8 学生 AutoGen 0.570 1 0.591 7 0.567 6 0.569 6 0.572 5 MetaGPT 0.588 6 0.581 4 0.597 6 0.595 8 0.587 0 IRF-Smi 0.636 3 0.676 5 0.679 2 0.637 5 0.681 5 ICC 教师 AutoGen 0.564 4 0.562 7 0.561 4 0.554 5 0.546 4 MetaGPT 0.584 0 0.567 6 0.576 1 0.574 5 0.581 6 IRF-Smi 0.645 7 0.662 1 0.671 5 0.643 7 0.669 9 学生 AutoGen 0.579 5 0.589 6 0.584 1 0.586 5 0.586 7 MetaGPT 0.581 7 0.603 9 0.597 4 0.605 6 0.578 6 IRF-Smi 0.630 9 0.675 3 0.676 1 0.647 3 0.666 3 MAE 教师 AutoGen 0.107 1 0.133 0 0.128 6 0.138 4 0.137 4 MetaGPT 0.109 1 0.128 4 0.132 8 0.142 0 0.135 0 IRF-Smi 0.100 1 0.119 0 0.124 5 0.124 8 0.127 3 学生 AutoGen 0.132 4 0.124 9 0.118 6 0.121 6 0.148 9 MetaGPT 0.126 3 0.122 3 0.123 8 0.121 0 0.144 3 IRF-Smi 0.107 4 0.109 8 0.106 3 0.113 9 0.132 5 表 2 IRF-Smi核心组件消融实验结果
Table 2 Ablation experimental results of IRF-Smi core components
变体 教师
PCC教师
ICC教师
MAE学生
PCC学生
ICC学生
MAEIRF-Smi 0.643 1 0.645 7 0.100 1 0.636 3 0.630 9 0.107 4 w/o IRF 0.621 8 0.602 7 0.131 6 0.610 4 0.607 9 0.126 8 w/o First-Person 0.648 9 0.631 5 0.123 8 0.628 7 0.621 6 0.121 7 w/o Small-World 0.653 2 0.638 4 0.122 6 0.641 1 0.636 8 0.118 9 表 3 课堂前后知识掌握变化(正确题数/10)
Table 3 Changes in knowledge acquisition before and after class (number of correct answers /10)
模型 时段 Sophia Liu Alex Wang Jason Emily Leo GPT-4o 课前 7 7 8 7 7 GPT-4o 课后 9 10 10 7 7 LLaMA3-7B 课前 6 5 6 5 6 LLaMA3-7B 课后 10 9 9 6 6 表 4 不同规模与连接密度下的计算开销对比
Table 4 Computational cost comparison under different scales and connection densities
设置 时间/IRF (s) token/IRF token/学生/IRF 5人, $ k=1 $ 224 620 34.7 5人, $ k=2 $ 257 935 119.2 100人, $ k=1 $ 249 1461 37.5 100人, $ k=2 $ 261 2192 126.6 A1 TalkMoves自动标注提示词(LLaMA3-8B LoRA微调)
A1 Prompt for TalkMoves automatic annotation (LLaMA3-8B with LoRA fine-tuning)
# 角色 你是一个课堂语言分析系统. 任务是将每一句课堂话语归类到最合适的TalkMoves类别. 请结合发言者角色(教师/学生)及下方定义判断其交际意图, 并输出正确标签. # 输入格式 话语: < 课堂中教师或学生说出的一句话> 发言者角色: < 教师 或 学生> # 可选标签 ## 教师话语行为 1. 无明显话语行为: 一般性陈述或离题表达, 无法归入以下类别. 2. 保持全班共同参与: 引导学生积极倾听, 并将注意力指向同伴观点. 3. 促使学生关联同伴观点: 提示学生回应或评价同学的贡献. 4. 复述: 原样或近似重复学生的话语内容. 5. 强调准确性: 要求学生使用更准确的数学表述或规范语言. 6. 重述/转述: 对学生观点进行改写或轻微扩展后再表达. 7. 追问推理: 鼓励学生解释理由、提供证据, 或建立概念之间的联系. ## 学生话语行为 1. 无明显话语行为: 一般性陈述或离题表达. 2. 关联同伴观点: 提及、评论或质疑同学观点. 3. 请求更多信息: 表达困惑、请求澄清或寻求帮助. 4. 提出结论/陈述: 给出事实性数学陈述或解题步骤. 5. 提供证据/推理: 解释思路、给出论证或推导过程. # 示例 [输入]话语: Okay someone to tell me how do we write five tenths, Regina. 发言者角色: < 教师> [输出]标签: 强调准确性 [输入]话语: Wait hang on, I meant Conrad was right. 发言者角色: < 学生> [输出] 标签: 关联同伴观点 -
[1] 郑逸宁, 余镇, 李不凡, 杨捷, 殷林琪, 印张悦, 等. 大语言模型的工具使用综述. 自动化学报, 2025, 51(11): 2371−2386 doi: 10.16383/j.aas.c240793Zheng Yi-Ning, Yu Zhen, Li Bu-Fan, Yang Jie, Yin Lin-Qi, Yin Zhang-Yue, et al. Survey of tool use in large language models. Acta Automatica Sinica, 2025, 51(11): 2371−2386 doi: 10.16383/j.aas.c240793 [2] Stahl M, Biermann L, Nehring A, Wachsmuth H. Exploring LLM prompting strategies for joint essay scoring and feedback generation. In: Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications. Mexico City, Mexico: ACL, 2024. 283–298 [3] Joshi I, Budhiraja R, Dev H, Kadia J, Ataullah M O, Mitra S, et al. ChatGPT in the classroom: An analysis of its strengths and weaknesses for solving undergraduate computer science questions. In: Proceedings of the 55th ACM Technical Symposium on Computer Science Education. New York, USA: ACM, 2024. 625–631 [4] 罗飙, 胡天猛, 周宇豪. 多智能体强化学习控制与决策研究综述. 自动化学报, 2025, 51(3): 510−539 doi: 10.16383/j.aas.c240392Luo Biao, Hu Tian-Meng, Zhou Yu-Hao. Survey on multi-agent reinforcement learning for control and decision-making. Acta Automatica Sinica, 2025, 51(3): 510−539 doi: 10.16383/j.aas.c240392 [5] Yue M, Lyu W, Mifdal W, Suh J, Zhang Y, Yao Z. MathVC: An LLM-simulated multi-character virtual classroom for mathematics education. arXiv preprint arXiv: 2404.06711, 2024. [6] Gherghel C, Yasuda S, Kita Y. Interaction during online classes fosters engagement with learning and self-directed study both in the first and second years of the COVID-19 pandemic. Computers & Education, 2023, 200: Article No. 104795 doi: 10.1016/j.compedu.2023.104795 [7] Zhang Z, Zhang-Li D, Yu J, Gong L, Zhou J, Hao Z, et al. Simulating classroom education with LLM-empowered agents. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Albuquerque, USA: ACL, 2025. 10364–10379 [8] Rustandi A. An analysis of IRF (initiation-response-feedback) on classroom interaction in EFL speaking class. EduLite: Journal of English Education, Literature and Culture, 2017, 2(1): 239−250 doi: 10.30659/e.2.1.239-250 [9] Xiao Y, He Q, Veldkamp B, Liu H. Exploring latent states of problem-solving competence using hidden Markov model on process data. Journal of Computer Assisted Learning, 2021, 37(5): 1232−1247 doi: 10.1111/jcal.12559 [10] 俞文武, 杨晓亚, 李海昌, 王瑞, 胡晓惠. 面向多智能体协作的注意力意图与交流学习方法. 自动化学报, 2023, 49(11): 2311−2325 doi: 10.16383/j.aas.c210430Yu Wen-Wu, Yang Xiao-Ya, Li Hai-Chang, Wang Rui, Hu Xiao-Hui. Attentional intention and communication for multi-agent learning. Acta Automatica Sinica, 2023, 49(11): 2311−2325 doi: 10.16383/j.aas.c210430 [11] Desmarais M C, Baker R S J D. A review of recent advances in learner and skill modeling in intelligent learning environments. User Modeling and User-Adapted Interaction, 2012, 22: 9−38 doi: 10.1007/s11257-011-9106-8 [12] 陈世明, 化俞新, 祝振敏, 赖强. 邻域交互结构优化的多智能体快速蜂拥控制算法. 自动化学报, 2015, 41(12): 2092−2099Chen Shi-Ming, Hua Yu-Xin, Zhu Zhen-Min, Lai Qiang. Fast flocking algorithm for multi-agent systems by optimizing local interactive topology. Acta Automatica Sinica, 2015, 41(12): 2092−2099 [13] Weeden K A, Cornwell B. The small-world network of college classes: Implications for epidemic spread on a university campus. Sociological Science, 2020, 7: 222−241 doi: 10.15195/v7.a9 [14] Song H F, Wang X J. Simple, distance-dependent formulation of the Watts-Strogatz model for directed and undirected small-world networks. Physical Review E, 2014, 90(6): Article No. 062801 doi: 10.1103/physreve.90.062801 [15] Suresh A, Jacobs J, Harty C, Perkoff M, Martin J H, Sumner T. The TalkMoves dataset: K-12 mathematics lesson transcripts annotated for teacher and student discursive moves. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, France: ELRA, 2022. 4654–4662 [16] Liu Z, Zhu Z, Zhu L, Jiang E, Hu X, Peppler K A, et al. ClassMeta: Designing interactive virtual classmate to promote VR classroom participation. In: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. New York, USA: ACM, 2024. 1–17 [17] Xu S, Wen H N, Pan H, Dominguez D, Hu D, Zhang X. Classroom simulacra: Building contextual student generative agents in online education for learning behavioral simulation. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. New York, USA: ACM, 2025. 1–26 [18] Shi Y, Liang R, Xu Y. EducationQ: Evaluating LLMs' teaching capabilities through multi-agent dialogue framework. arXiv preprint arXiv: 2504.14928, 2025. [19] Scarlatos A, Baker R S, Lan A. Exploring knowledge tracing in tutor-student dialogues using LLMs. In: Proceedings of the 15th International Learning Analytics and Knowledge Conference. New York, USA: ACM, 2025. 249–259 [20] Wang R, Zhang Q, Robinson C, Loeb S, Demszky D. Bridging the novice-expert gap via models of decision-making: A case study on remediating math mistakes. In: Proceedings of NAACL 2024. Mexico City, Mexico: ACL, 2024. 2174–2199 [21] Wan Y, Wu J, Abdulhai M, Shani L, Jaques N. Enhancing personalized multi-turn dialogue with curiosity reward. arXiv preprint arXiv: 2504.03206, 2025. [22] Kodama T, Kiyomaru H, Huang Y J, Kurohashi S. RecomMind: Movie recommendation dialogue with seeker's internal state. In: Proceedings of the Second Workshop on Social Influence in Conversations. Miami, USA: ACL, 2024. 46–63 [23] Hridi A P, Hoq M, Gao Z, Lynch C, Sahay R, Hosseinalipour S, et al. Privacy-preserving distributed link predictions among peers in online classrooms using federated learning. arXiv preprint arXiv: 2504.10456, 2025. [24] Balaban I, Filipović D, Zlatović M. Post hoc identification of student groups: Combining user modeling with cluster analysis. Education and Information Technologies, 2023, 28(6): 7265−7290 doi: 10.1007/s10639-022-11468-9 [25] Tu Q, Fan S, Tian Z, Yan R. CharacterEval: A Chinese benchmark for role-playing conversational agent evaluation. arXiv preprint arXiv: 2401, 2024. [26] Wu B, Sun K, Bai Z, Li Y, Wang B. RAIDEN Benchmark: Evaluating role-playing conversational agents with measurement-driven custom dialogues. In: Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE: ACL, 2025. 11086–11106 [27] Ofri O, Tabach M. Overt and covert participation in an argumentative whole-class discussion: Spread of ideas about quadratic functions. International Journal of Science and Mathematics Education, 2025, 23(3): 639−661 doi: 10.1007/s10763-024-10488-w [28] Mu S, Cui M, Huang X. Multimodal data fusion in learning analytics: A systematic review. Sensors, 2020, 20(23): Article No. 6856 doi: 10.3390/s20236856 [29] Anyon J. Social class and the hidden curriculum of work. Childhood Socialization. London: Routledge, 2017. 369–394 [30] Alharbi K, Cristea A I, Shi L, Tymms P, Brown C. Agent-based simulation of the classroom environment to gauge the effect of inattentive or disruptive students. In: Proceedings of the 17th International Conference on Intelligent Tutoring Systems. Cham: Springer, 2021. 211–223 [31] Apicella A, Arpaia P, Frosolone M, Improta G, Moccaldi N, Pollastro A. EEG-based measurement system for monitoring student engagement in learning 4.0. Scientific Reports, 2022, 12(1): Article No. 5857 doi: 10.1038/s41598-022-09578-y [32] Li Q, Ren Y, Wei T, Wang C, Liu Z, Yue J. A learning attention monitoring system via photoplethysmogram using wearable wrist devices. Artificial Intelligence Supported Educational Technologies. Cham: Springer, 2020. 133–150 [33] Qiu J, Tang J, Ma H, Dong Y, Wang K, Tang J. DeepInf: Social influence prediction with deep learning. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM, 2018. 2110–2119 [34] Smirnov I, Thurner S. Formation of homophily in academic performance: Students change their friends rather than performance. PLoS ONE, 2017, 12(8): Article No. e0183473 doi: 10.1371/journal.pone.0183473 [35] McGraw K O, Wong S P. Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1996, 1(1): 30−46 doi: 10.1037/1082-989x.1.1.30 [36] Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al. The Llama 3 herd of models. arXiv preprint arXiv: 2407.21783, 2024. [37] Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. LoRA: Low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. Virtual Event: ICLR, 2022. [38] Wu Q, Bansal G, Zhang J, Wu Y, Li B, Zhu E, et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. In: Proceedings of the ICLR 2024 Workshop on Large Language Model (LLM) Agents. Vienna, Austria: ICLR, 2024. [39] Hong S, Zhuge M, Chen J, Zheng X, Cheng Y, Wang J, et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In: Proceedings of the 12th International Conference on Learning Representations. Vienna, Austria: ICLR, 2024. [40] Hurst A, Lerer A, Goucher A P, Perelman A, Ramesh A, Clark A, et al. GPT-4o system card. arXiv preprint arXiv: 2410.21276, 2024. -
计量
- 文章访问数: 9
- HTML全文浏览量: 4
- 被引次数: 0
下载: