2.793

2018影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于 GBDT 的铁路事故类型预测及成因分析

钟敏慧 张婉露 李有儒 朱振峰 赵耀

钟敏慧, 张婉露, 李有儒, 朱振峰, 赵耀. 基于 GBDT 的铁路事故类型预测及成因分析. 自动化学报, 2020, 45(x): 1−9 doi: 10.16383/j.aas.c190630
引用本文: 钟敏慧, 张婉露, 李有儒, 朱振峰, 赵耀. 基于 GBDT 的铁路事故类型预测及成因分析. 自动化学报, 2020, 45(x): 1−9 doi: 10.16383/j.aas.c190630
Zhong Min-Hui, Zhang Wan-Lu, Li You-Ru, Zhu Zhen-Feng, Zhao Yao. GBDT based railway accident type prediction and cause analysis. Acta Automatica Sinica, 2020, 45(x): 1−9 doi: 10.16383/j.aas.c190630
Citation: Zhong Min-Hui, Zhang Wan-Lu, Li You-Ru, Zhu Zhen-Feng, Zhao Yao. GBDT based railway accident type prediction and cause analysis. Acta Automatica Sinica, 2020, 45(x): 1−9 doi: 10.16383/j.aas.c190630

基于 GBDT 的铁路事故类型预测及成因分析

doi: 10.16383/j.aas.c190630
基金项目: 科技创新 2030-“新一代人工智能”重大项目(2018AAA0102101), 中央高校基本科研业务费(2018JBZ001), 国家自然科学基金(61976018和61532005)资助
详细信息
    作者简介:

    钟敏慧:北京交通大学信息科学研究所硕士研究生. 主要研究方向为计算机视觉, 机器学习. 本文通信作者. E-mail: mhzhong@bjtu.edu.cn

    张婉露:北京交通大学信息科学研究所硕士研究生. 主要研究方向为计算机视觉, 深度学习. E-mail: wlzhang@bjtu.edu.cn

    李有儒:北京交通大学信息科学研究所硕士研究生. 主要研究方向为数据挖掘, 机器学习. E-mail: liyouru@bjtu.edu.cn

    朱振峰:北京交通大学信息科学研究所教授. 2005年获中国科学院自动化研究所模式识别国家重点实验室工学博士学位. 主要研究方向为图像视频分析与理解, 计算机视觉, 机器学习. E-mail: zhfzhu@bjtu.edu.cn

    赵耀:北京交通大学信息科学研究所教授, 所长. 1996年获北京交通大学工学博士学位. 主要研究方向为图像与视频编码, 数字水印与取证, 视频分析及理解, 人工智能. E-mail: yzhao@bjtu.edu.cn

GBDT Based Railway Accident Type Prediction and Cause Analysis

Funds: Supported by Science and Technology Innovation 2030 Major Program: New Generation Artificial Intelligence (2018AAA0102101), the Fundamental Research Funds for the Central Universities (2018JBZ001), National Natural Science Foundation of China (61976018 and 61532005)
  • 摘要: 运用数据挖掘技术进行铁路事故类型预测及成因分析, 对于建立铁路事故预警机制具有重要意义. 为此, 本文提出一种基于梯度提升决策树(Grandient Boosting Decision Tree, GBDT)的铁路事故类型预测及成因分析算法. 针对铁路事故记录数据缺失的问题, 提出一种基于属性分布概率的补全算法, 最大程度保持原有数据分布, 从而降低数据缺失对事故类型预测造成的影响. 针对铁路事故记录数据类别失衡的问题, 提出一种集成的GBDT模型, 完成对事故类型的鲁棒性预测. 在此基础上, 根据GBDT预测模型中特征重要度排序, 实现事故成因分析. 通过在开放数据库上进行实验, 验证了本文模型的有效性.
  • 图  1  基于GBDT的铁路事故类型预测及成因分析框架

    Fig.  1  The framework of GBDT-based railroad accident type prediction and cause analysis

    图  2  三种补全方法结果对比

    Fig.  2  Comparison of three methods results

    图  3  不同GBDT集成个数下分类准确率

    Fig.  3  Accuracy of classifiers with different number of GBDT

    图  4  混淆矩阵

    Fig.  4  Confusion matrix

    图  5  不同特征数量下预测结果

    Fig.  5  Prediction results of classifier with different features

    图  6  两类事故致因中不同因素的比例

    Fig.  6  Proportion of different factors in causes of two types of railroad accident

    表  1  原始数据描述

    Table  1  Description of original data

    RecordAccident typeAttribute
    Number 5 434 11 144
    下载: 导出CSV

    表  2  事故类型描述

    Table  2  Description of accident types

    TYPEDescribe
    1 Derailment
    2 Head on collision
    3 Rearend collision
    4 Side collision
    5 Raking collision
    6 Broken train collision
    7 Hwy-rail crossing
    8 RR Grade crossing
    9 Obstruction
    10 Fire
    11 Other impacts
    下载: 导出CSV

    表  3  数据集部分示例

    Table  3  Examples of the dataset

    NameDescribeNumType
    RAILROADRailroad code5 434Object
    CARSNum.of cars carrying hazmat5 434Int64
    TYPSPDTrain speed type5 086Object
    TRNDIRTrain direction5 161Float64
    TONSGross tonnage, excluding power units5 434Int64
    TYPEQType of consist5 081Object
    EQATTEquipment attended5 074Object
    CDTRHRNum.of hours conductors on duty3 628Int64
    ENGHRNum.of hours engineers on duty4 201Int64
    TRKNAMETrack identification5 434Object
    下载: 导出CSV

    表  4  预处理后数据描述

    Table  4  Description of preprocessed data

    RecordAccident typeAttribute
    Number5 43411119
    下载: 导出CSV

    表  5  三种方法补全前后特征TRNDIR取值分布

    Table  5  Distribution of the attribute TRNDIR values before and after three completion method

    Algorithm$a_j=1$$a_j=2$$a_j=3$$a_j=4$
    Before completion0.220.200.310.27
    Interpolation0.210.190.300.30
    Mode0.210.190.340.26
    Our algorithm0.220.200.310.27
    下载: 导出CSV

    表  6  不同采样率下集成GBDT分类准确率

    Table  6  Accuracy of classifiers with different sampling rate

    $\alpha$0.60.70.80.91.0
    Accuracy (%)0.8410.8460.8450.8520.848
    下载: 导出CSV

    表  7  各分类器性能对比

    Table  7  Performance comparison of classifiers

    ClassifierAccuracyPrecisionRecallF1
    DT0.7280.730.730.73
    RF0.7730.740.770.75
    ET0.7340.700.730.71
    GBDT0.8410.840.840.84
    ensemble GBDT0.8520.850.850.85
    下载: 导出CSV

    表  8  重要度排名前15的特征

    Table  8  Features of Top15 in importance

    No.NameDescription
    1LatitudeLatitude in decimal degrees
    2LongitudeLongitude in decimal degrees
    3CNTYCDFIPS County Code
    4HIGHSPDMaximum speed
    5TRKNAMETrack identification
    6RRCAR1Car initials (fist involved)
    7TEMPTemperature in degrees fahrenheit
    8MILEPOSTMilepost
    9STATIONNearest city and town
    10TRNSPDSpeed of train in miles per hour
    11RRCAR2Car initials (causing)
    12SUBDIVRailroad subdivision
    13ENGHRNum. of hours engineers on duty
    14CDTRHRNum. of hours conductors on duty
    15TONSGross tonnage
    下载: 导出CSV
  • [1] 1 Mehmed K. Data mining concepts, models, methods and algorithms. IIe Transaction, 2005, 36(5): 495−496
    [2] 冯士雍. 回归分析方法. 北京: 科学出版社, 1974.

    Feng Shi-Yong. Regression Analysis Method. Beijing: Science Press, 1974
    [3] 3 Rutkowski L, Jaworski M, Pietruczuk L, Duda P. Decision trees for mining data streams based on the gaussian approximation. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(1): 108−119 doi: 10.1109/TKDE.2013.34
    [4] 李定启, 程远平, 王海峰, 王亮, 周红星, 孙建华. 基于决策树ID3改进算法的煤与瓦斯突出预测. 煤炭学报, 2011, 36(4): 619−622

    4 Li Ding-Qi, Cheng Yuan-Ping, Wang Hai-Feng, Wang Liang, Zhou Hong-Xing, Sun Jian-Hua. Coal and gas outburst prediction based on improved decision tree ID3 algorithm. Journal of China Coal Society, 2011, 36(4): 619−622
    [5] 5 Breiman L. Random forest. Machine Learning, 2001, 45(1): 5−32 doi: 10.1023/A:1010933404324
    [6] 6 Friedman J H. Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 2001, 29(5): 1189−1232
    [7] 7 Friedman J H. Stochastic gradient boosting. Computational Statistics and Data Analysis, 2002, 38(4): 367−378 doi: 10.1016/S0167-9473(01)00065-2
    [8] 周志华. 机器学习. 北京: 清华大学出版社, 2016.

    Zhou Zhi-Hua. Machine Learning. Beijing: Tsinghua University Press, 2016.
    [9] 9 Schonlau M. Boosted regression (boosting): an introductory tutorial and a stata plugin. The Stata Journal, 2005, 5(3): 330−354 doi: 10.1177/1536867X0500500304
    [10] 翁小雄, 吕攀龙. 基于 GBDT 算法的地铁 IC 卡通勤人群识别. 重庆交通大学学报 (自然科学版), 2019, 38(5): 8−12

    10 Weng Xiao-Xiong, Lv Pan-Long. Subway IC card commuter crowd identification based on GBDT algorithm. Journal of Chongqing Jiaotong University(Natural Science), 2019, 38(5): 8−12
    [11] 11 Mursalin M, Zhang Yuan, Chen Yue-Hui, Chawla N V. Automated epileptic seizure detection using improved correlation-based feature selection with random forest classifier. Neurocomputing, 2017, 241: 204−214 doi: 10.1016/j.neucom.2017.02.053
    [12] 12 Cheng J, Li G, Chen X H. Research on travel time prediction model of freeway based on gradient boosting decision tree. IEEE Access, 2018, 7: 7466−7480
    [13] 13 Ma X, Ding C, Luan S, Wang Y, Wang Y P. Prioritizing influential factors for freeway incident clearance time prediction using the gradient boosting decision trees method. IEEE Transactions on Intelligent Transportation Systems, 2017, 18(9): 2303−2310 doi: 10.1109/TITS.2016.2635719
    [14] Su H W, Zhang W J, Li Z H. Analysis and prediction of water traffic accidents in jingtang port based on improved GM(1, 1) model. In: Proceedings of the 37th Chinese Control Conference (CCC). New York, USA: IEEE, 2018.2212−2217
    [15] Das S, Sun X D. Investigating the pattern of traffic crashes under rainy weather by association rules in data mining. In: Proceedings of the 93rd Transportation Research Board (TRB) Annual Meeting. Washington, USA: Nation Academy of Sciences, 2014
    [16] 金勇进. 缺失数据的统计处理, 北京: 中国统计出版社, 2009.

    Jin Yong-Jin. Statistical Processing of Missing Data. Beijing: China Statistics Press, 2009.
    [17] 金勇进. 调查中的数据缺失及处理 (I)-缺失数据及其影响. 数理统计与管理, 2001, 20(4): 58−60 doi: 10.3969/j.issn.1002-1566.2001.04.012

    17 Jin Yong-Jin. Data loss and processing in survey(I)) data missing and impact. Journal of Applied Statistics and Management, 2001, 20(4): 58−60 doi: 10.3969/j.issn.1002-1566.2001.04.012
    [18] 18 Collell G, Prelec D, Patil K R. A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data. Neurocomputing, 2018, 275: 330−340 doi: 10.1016/j.neucom.2017.08.035
    [19] 19 Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 2012, 42(4): 463−484 doi: 10.1109/TSMCC.2011.2161285
    [20] 朱振峰, 汤静远, 常冬霞, 赵耀. 基于 GBDT 的商品分配层次化预测模型. 北京交通大学学报, 2018, 42(2): 9−13+45 doi: 10.11860/j.issn.1673-0291.2018.02.002

    20 Zhu Zhen-Feng, Tang Jing-Yuan, Chang Dong-Xia, Zhao Yao. GBDT based hierarchical model for commodity distribution prediction. Journal of Beijing Jiaotong University, 2018, 42(2): 9−13+45 doi: 10.11860/j.issn.1673-0291.2018.02.002
    [21] 杨连报, 李平, 薛蕊, 马小宁, 吴艳华, 邹丹. 基于不平衡文本数据挖掘的铁路信号设备故障智能分类. 铁道学报, 2018, 40(2): 59−66 doi: 10.3969/j.issn.1001-8360.2018.02.009

    21 Yang Lian-Bao, Li Ping, Xue Rui, Ma Xiao-Ning, Wu YanHua, Zou Dan. Intelligent classification of faults of railway signal equipment based on imbalancd text data mining. Journal of the China Railway Society, 2018, 40(2): 59−66 doi: 10.3969/j.issn.1001-8360.2018.02.009
    [22] Federal Railroad Administration Office of Safety Analysis [Online], available: https://safetydata.fra.dot.gov/OfficeofSafety/Default.aspx, June 1, 2019
  • 加载中
计量
  • 文章访问数:  483
  • HTML全文浏览量:  327
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-09-11
  • 录用日期:  2020-01-17

目录

    /

    返回文章
    返回