递归最小二乘循环神经网络

赵杰; 张春元; 刘超; 周辉; 欧宜贵; 宋淇

doi:10.16383/j.aas.c190847

递归最小二乘循环神经网络

doi: 10.16383/j.aas.c190847

赵杰^1,,
张春元^1,,
刘超^1,,
周辉^1,,
欧宜贵^2,,
宋淇^1,

1.
海南大学计算机科学与技术学院海口 570228
2.
海南大学理学院海口 570228

基金项目: 国家自然科学基金(61762032, 61662019, 11961018)资助

详细信息

作者简介:
赵杰：海南大学计算机科学与技术学院硕士研究生. 主要研究方向为深度学习和强化学习.E-mail: zhaojie@lonelyme.cn

张春元：海南大学计算机科学与技术学院副教授. 2016年获得电子科技大学计算机软件与理论博士学位. 主要研究方向为深度学习与强化学习. 本文通信作者.E-mail: zcy7566@126.com

刘超：海南大学计算机科学与技术学院硕士研究生. 主要研究方向为深度学习与强化学习.E-mail: lcdyx0618@126.com

周辉：海南大学计算机科学与技术学院副教授. 2008年获得中国科学院软件研究所博士学位. 主要研究方向为自然语言处理, 人工智能写作与数据可视化.E-mail: zhouhui@hainanu.edu.cn

欧宜贵：海南大学理学院教授. 2003年获得中国科学技术大学博士学位. 主要研究方向为最优化算法.E-mail: ouyigui@126.com

宋淇：海南大学计算机科学与技术学院硕士研究生. 主要研究方向为深度学习与强化学习.E-mail: songqihnu@163.com

计量
- 文章访问数: 611
- HTML全文浏览量: 127
- PDF下载量: 271
- 被引次数: 0
出版历程
- 收稿日期: 2019-12-12
- 录用日期: 2020-04-07
- 网络出版日期: 2022-07-12
- 刊出日期: 2022-06-01

Recurrent Neural Networks With Recursive Least Squares

ZHAO Jie^1
,,
ZHANG Chun-Yuan^1
,,
LIU Chao^1
,,
ZHOU Hui^1
,,
OU Yi-Gui^2
,,
SONG Qi^1
,

1.
School of Computer Science and Technology, Hainan University, Haikou 570228
2.
School of Science, Hainan University, Haikou 570228

Funds: Supported by National Natural Science Foundation of China (61762032, 61662019, 11961018)

More Information

Author Bio:
ZHAO Jie　Master student at the School of Computer Science and Technology, Hainan University. His research interest covers deep learning and reinforcement learning

ZHANG Chun-Yuan　Associate professor at the School of Computer Science and Technology, Hainan University. He received his Ph.D. degree in computer software and theory from University of Electronic Science and Technology of China in 2016. His research interest covers deep learning and reinforcement learning. Corresponding author of this paper

LIU Chao　Master student at the School of Computer Science and Technology, Hainan University. His research interest covers deep learning and reinforcement learning

ZHOU Hui　Associate professor at the School of Computer Science and Technology, Hainan University. He received his Ph.D. degree from the Software Institute, Chinese Academy of Sciences in 2008. His research interest covers natural language processing, artificial intelligence writing, and data visualization

OU Yi-Gui　Professor at the Sch-ool of Science, Hainan University. He received his Ph.D. degree from University of Science and Technology of China in 2003. His main research interest is numerical optimization algorithm

SONG Qi　Master student at the School of Computer Science and Technology, Hainan University. Her research interest covers deep learning and reinforcement learning

摘要

摘要: 针对循环神经网络(Recurrent neural networks, RNNs)一阶优化算法学习效率不高和二阶优化算法时空开销过大, 提出一种新的迷你批递归最小二乘优化算法. 所提算法采用非激活线性输出误差替代传统的激活输出误差反向传播, 并结合加权线性最小二乘目标函数关于隐藏层线性输出的等效梯度, 逐层导出RNNs参数的迷你批递归最小二乘解. 相较随机梯度下降算法, 所提算法只在RNNs的隐藏层和输出层分别增加了一个协方差矩阵, 其时间复杂度和空间复杂度仅为随机梯度下降算法的3倍左右. 此外, 本文还就所提算法的遗忘因子自适应问题和过拟合问题分别给出一种解决办法. 仿真结果表明, 无论是对序列数据的分类问题还是预测问题, 所提算法的收敛速度要优于现有主流一阶优化算法, 而且在超参数的设置上具有较好的鲁棒性.
- 深度学习 /
- 循环神经网络 /
- 递归最小二乘 /
- 迷你批学习 /
- 优化算法
Abstract: In recurrent neural networks (RNNs), the first-order optimization algorithms usually converge slowly, and the second-order optimization algorithms commonly have high time and space complexities. In order to solve these problems, a new minibatch recursive least squares (RLS) optimization algorithm is proposed. Using the inactive linear output error to replace the conventional activation output error for backpropagation, together with the equivalent gradients of the weighted linear least squares objective function with respect to linear outputs of the hidden layer, the proposed algorithm derives the minibatch recursive least squares solutions of RNNs parameters layer by layer. Compared with the stochastic gradient descent algorithm, the proposed algorithm only adds one covariance matrix into each layer of RNNs, and its time and space complexities are almost three times as much. Furthermore, in order to address the adaptive problem of the forgetting factor and the overfitting problem of the proposed algorithm, two approaches are also presented, respectively, in this paper. The simulation results, on the classification and prediction problems of sequential data, show that the proposed algorithm has faster convergence speed than popular first-order optimization algorithms. In addition, the proposed algorithm also has good robustness in the selection of hyperparameters.
- Deep learning /
- recurrent neural network (RNN) /
- recursive least squares (RLS) /
- minibatch learning /
- optimization algorithm

HTML全文

图 1 RNN模型结构

Fig. 1 RNN model structure

下载: 全尺寸图片幻灯片

图 2 收敛性比较实验结果

Fig. 2 Experimental results on the convergence comparisons

下载: 全尺寸图片幻灯片

表 1 SGD-RNN与RLS-RNN复杂度分析

Table 1 Complexity analysis of SGD-RNN and RLS-RNN

		SGD-RNN	RLS-RNN
时间复杂度	$O_{s}$	${\rm{O}}(\tau mdh)$	—
	$Z_{s}$	—	${\rm{O}}(\tau mdh)$
	$H_{s}$	${\rm{O}}(\tau mh(h+a))$	${\rm{O}}(\tau mh(h+a))$
	${\Delta}^O_{s}$	${\rm{O}}(4\tau md)$	${\rm{O}}(3\tau md)$
	${\Delta}^H_{s}$	${\rm{O}}(\tau mh(h+d))$	${\rm{O}}(\tau mh(h+d))$
	${P}_{s}^O$	—	${\rm{O}}(2\tau mh^2)$
	${P}_{s}^H$	—	${\rm{O}}(2\tau m(h+a)^2)$
	${\Theta}_{s}^O$	${\rm{O}}(\tau mdh)$	${\rm{O}}(\tau mdh)$
	${\Theta}_{s}^H$	${\rm{O}}(\tau mh(h+a))$	${\rm{O}}(\tau mh(h+a))$
	合计	${\rm{O}}(\tau m(3dh+3h^2+2ha))$	${\rm{O}}(\tau m(7h^2+2a^2+3dh+6ha))$
空间复杂度	$\Theta_{s}^O$	${\rm{O}}(hd)$	${\rm{O}}(hd)$
	$\Theta_{s}^H$	${\rm{O}}(h(h+a))$	${\rm{O}}(h(h+a))$
	${P}_{s}^H$	—	${\rm{O}}((h+a)^2)$
	${P}_{s}^O$	—	${\rm{O}}(h^2)$
	合计	${\rm{O}}(h^2+hd+ha)$	${\rm{O}}(hd+3ha+a^2+3h^2)$

下载: 导出CSV

表 2 初始化因子$\alpha$鲁棒性分析

Table 2 Robustness analysis of the initializing factor $\alpha$

$\alpha$	0.01	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
MNIST分类准确率 (%)	97.10	97.36	97.38	97.35	97.57	97.70	97.19	97.27	97.42	97.25	97.60
IMDB分类准确率 (%)	72.21	73.50	73.24	73.32	74.02	73.01	73.68	73.25	73.20	73.42	73.12
股价预测MSE ($\times 10^{-4}$)	5.32	5.19	5.04	5.43	5.42	5.30	4.87	4.85	5.32	5.54	5.27
PM2.5预测MSE ($\times 10^{-3}$)	1.58	1.55	1.53	1.55	1.61	1.55	1.55	1.54	1.57	1.58	1.57

下载: 导出CSV

表 3 比例因子$\eta$鲁棒性分析

Table 3 Robustness analysis of the scaling factor $\eta$

$\eta$	0.1	1.0	2.0	3.0	4.0	5.0	6.0	7.0	8.0	9.0	10.0
MNIST分类准确率 (%)	97.80	97.59	97.48	97.61	97.04	97.62	97.44	97.33	97.38	97.37	97.45
IMDB分类准确率 (%)	73.58	73.46	73.62	73.76	73.44	73.82	73.71	72.97	72.86	73.12	73.69
股价预测MSE ($\times 10^{-4}$)	5.70	5.32	5.04	5.06	5.61	4.73	5.04	5.14	4.85	4.97	5.19
PM2.5预测MSE ($\times 10^{-3}$)	1.53	1.55	1.56	1.59	1.56	1.53	1.58	1.55	1.54	1.50	1.52

下载: 导出CSV

参考文献(35)

[1]	Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S. Recurrent neural network based language model. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association. Chiba, Japan: ISCA, 2010. 1045−1048
[2]	Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceeding of the 27th International Conference on Neural Information Processing Systems. Montréal, Canada: MIT Press, 2014. 3104−3112
[3]	Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks. In: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE, 2013. 6645−6649
[4]	Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks. In: Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA: JMLR.org, 2013. 1310−1318
[5]	Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994, 5(2): 157−166 doi: 10.1109/72.279181
[6]	Bottou L. On-line learning and stochastic approximations. On-Line Learning in Neural Networks. New York: Cambridge University Press, 1999. 9−42
[7]	Polyak B T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 1964, 4(5): 1−17 doi: 10.1016/0041-5553(64)90137-5
[8]	Sutskever I, Martens J, Dahl G, Hinton G. On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on International Conference on Machine Learning. Atlanta, USA: JMLR.org, 2013. 1139−1147
[9]	Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 2011, 12: 2121−2159
[10]	Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude [Online], available: https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf, February 19, 2020
[11]	Zeiler M D. ADADELTA: An adaptive learning rate method [Online], available: https://arxiv.org/abs/1212.5701\newblock, February 19, 2020
[12]	Kingma D P, Ba J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA: ICLR, 2015.
[13]	Keskar N S, Socher R. Improving generalization performance by switching from adam to SGD [Online], available: https://arxiv.org/abs/1712.07628, February 19, 2020
[14]	Reddi S J, Kale S, Kumar S. On the convergence of Adam and beyond. In: Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: OpenReview.net, 2018.
[15]	Martens J. Deep learning via hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning. Haifa, Israel: Omnipress, 2010. 735−742
[16]	Keskar N S, Berahas A S. adaQN: An adaptive quasi-newton algorithm for training RNNs. In: Proceedings of the 2016 Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Riva del Garda, Italy: Springer, 2016. 1−16.
[17]	Martens J, Grosse R. Optimizing neural networks with kronecker-factored approximate curvature. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR.org, 2015. 2408−2417
[18]	Shanno D F. Conditioning of quasi-newton methods for function minimization. Mathematics of Computation, 1970, 24(111): 647−656 doi: 10.1090/S0025-5718-1970-0274029-X
[19]	Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(1−3): 503−528 doi: 10.1007/BF01589116
[20]	Le Q V, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng A Y. On optimization methods for deep learning. In: Proceedings of the 28th International Conference on Machine Learning. Bellevue, USA: Omnipress, 2011. 265−272
[21]	Azimi-Sadjadi M R, Liou R J. Fast learning process of multilayer neural networks using recursive least squares method. IEEE Transactions on Signal Processing, 1992, 40(2): 446−450 doi: 10.1109/78.124956
[22]	谭永红. 多层前向神经网络的RLS训练算法及其在辨识中的应用. 控制理论与应用, 1994, 11(5): 594−599 Tan Yong-Hong. RLS training algorithm for multilayer feedforward neural networks and its application to identification. Control Theory and Applications, 1994, 11(5): 594−599
[23]	Xu Q, Krishnamurthy K, McMillin B, Lu W. A recursive least squares training algorithm for multilayer recurrent neural networks. In: Proceedings of the 1994 American Control Conference-ACC’94. Baltimore, USA: IEEE, 1994. 1712−1716
[24]	Peter T, Pospíchal J. Neural network training with extended Kalman filter Using graphics processing unit. In: Proceedings of the 18th International Conference on Artificial Neural Networks. Berlin, Germany: Springer, 2008. 198−207
[25]	Jaeger H. Adaptive nonlinear system identification with echo state networks. In: Proceedings of the 15th International Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2002. 609−616
[26]	Werbos P J. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 1990, 78(10): 1550−1560 doi: 10.1109/5.58337
[27]	Sherman J, Morrison W J. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 1950, 21(1): 124−127 doi: 10.1214/aoms/1177729893
[28]	Paleologu C, Benesty J, Ciochina S. A robust variable forgetting factor recursive least-squares algorithm for system identification. IEEE Signal Processing Letters, 2008, 15: 597−600 doi: 10.1109/LSP.2008.2001559
[29]	Albu F. Improved variable forgetting factor recursive least square algorithm. In: Proceedings of the 12th International Conference on Control Automation Robotics and Vision (ICARCV). Guangzhou, China: IEEE, 2012. 1789−1793
[30]	Ekșioğlu E M. RLS adaptive filtering with sparsity regularization. In: Proceedings of the 10th International Conference on Information Science, Signal Processing and Their Applications (ISSPA 2010). Kuala Lumpur, Malaysia: IEEE, 2010. 550−553
[31]	LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278−2324 doi: 10.1109/5.726791
[32]	Gawlik D. New York stock exchange [Online], available: https://www.kaggle.com/dgawlik/nyse, February 19, 2020
[33]	Liang X, Zou T, Guo B, Li S, Zhang H Z, Zhang S Y, et al. Assessing Beijing's PM2.5 pollution: Severity, weather impact, APEC and winter heating. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2015, 471(2182): Article No. 20150257
[34]	He K M, Zhang X Y, Ren S Q, Sun J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE, 2015. 1026−1034
[35]	Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: ACL, 2014. 1532−1543