Rademacher复杂度在统计学习理论中的研究

吴新星; 张军平; 吴新星; 张军平

doi:10.16383/j.aas.2017.c160149

[1]

Tikhonov A N, Arsenin V Y. Solution of Ill-posed Problems. Washington:Winston and Sons, 1977.

[2]

Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 1974, 19(6):716-723 doi: 10.1109/TAC.1974.1100705

[3]

Akaike H. A Bayesian analysis of the minimum AIC procedure. Annals of the Institute of Statistical Mathematics, 1978, 30(1):9-14 doi: 10.1007/BF02480194

[4]

Schwarz G. Estimating the dimension of a model. Annals of Statistics, 1978, 6(2):461-464 doi: 10.1214/aos/1176344136

[5]

Kolmogorov A N. On tables of random numbers. Sankhya:The Indian Journal of Statistics, 1963, 25:369-376 http://cn.bing.com/academic/profile?id=2319581f805b56782c9de35bf823fe97&encoded=0&v=paper_preview&mkt=zh-cn

[6]

Kolmogorov A N. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1965, 1(1):1-7 http://cn.bing.com/academic/profile?id=c9935bf10e0666e93d39e022ca502b3d&encoded=0&v=paper_preview&mkt=zh-cn

[7]

Chaitin G J. On the length of programs for computing finite binary sequences. Journal of the ACM, 1966, 13(4):547-569 doi: 10.1145/321356.321363

[8]

Solomonoff R J. A formal theory of inductive inference. Part II. Information and Control, 1964, 7(2):224-254 doi: 10.1016/S0019-9958(64)90131-7

[9]

Wallace C S, Boulton D M. An information measure for classification. The Computer Journal, 1968, 11(2):185-194 doi: 10.1093/comjnl/11.2.185

[10]

Vapnik V N, Chervonenkis A Y. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 1971, 16(2):264-280 doi: 10.1137/1116025

[11]

Vapnik V N. The Nature of Statistical Learning Theory. New York:Springer, 1995.

[12]

Vapnik V N. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 1999, 10(5):988-999 doi: 10.1109/72.788640

[13]

Popper K R. The Logic of Scientific Discovery. United Kingdom:Hutchinson, 1959.

[14]

Valiant L G. A theory of the learnable. Communications of the ACM, 1984, 27(11):1134-1142 doi: 10.1145/1968.1972

[15]

Kearns M J, Schapire R E. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 1994, 48(3):464-497 doi: 10.1016/S0022-0000(05)80062-5

[16]

Bartlett P L, Long P M, Williamson R C. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 1996, 52(3):434-452 doi: 10.1006/jcss.1996.0033

[17]

Shawe-Taylor J, Bartlett P L, Williamson R C, Anthony M. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 1998, 44(5):1926-1940 doi: 10.1109/18.705570

[18]

Koltchinskii V. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 2001, 47(5):1902-1914 doi: 10.1109/18.930926

[19]

Bartlett P L, Mendelson S. Rademacher and Gaussian complexities:risk bounds and structural results. The Journal of Machine Learning Research, 2003, 3:463-482 http://cn.bing.com/academic/profile?id=59b9b1ed8c26e154cc8cc3bb9f31c21f&encoded=0&v=paper_preview&mkt=zh-cn

[20]

Bartlett P L, Bousquet O, Mendelson S. Local Rademacher complexities. The Annals of Statistics, 2005, 33(4):1497-1537 doi: 10.1214/009053605000000282

[21]

Koltchinskii V. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 2006, 34(6):2593-2656 doi: 10.1214/009053606000001019

[22]

Koltchinskii V, Panchenko D. Rademacher processes and bounding the risk of function learning. High Dimensional Probability II. Boston:Birkhäuser, 2004. 443-457

[23]

Bousquet O, Koltchinskii V, Panchenko D. Some local measures of complexity of convex hulls and generalization bounds. Computational Learning Theory. Sydney, Australia:Springer, 2004. 59-73

[24]

张海, 徐宗本. 学习理论综述(I):稳定性与泛化性. 工程数学学报, 2008, 25(1):1-9 http://www.cnki.com.cn/Article/CJFDTOTAL-GCSX200801004.htm

Zhang Hai, Xu Zong-Ben. A survey on learning theory (I):stability and generalization. Chinese Journal of Engineering Mathematics, 2008, 25(1):1-9 http://www.cnki.com.cn/Article/CJFDTOTAL-GCSX200801004.htm

[25]

胡政发. 统计量复杂性估计及其在机器学习中的应用. 自动化学报, 2008, 34(10):1332-1336 http://www.aas.net.cn/CN/abstract/abstract17904.shtml

Hu Zheng-Fa. The statistic complexity estimates and its application to machine learning. Acta Automatica Sinica, 2008, 34(10):1332-1336 http://www.aas.net.cn/CN/abstract/abstract17904.shtml

[26]

陈将宏. Rademacher复杂性与支持向量机的推广性能[硕士学位论文], 湖北大学, 中国, 2005.

Chen Jiang-Hong. Rademacher Complexities and the Generalization Performance of SVM[Master dissertation], Hubei University, China, 2005.

[27]

Lei Y W, Ying Y M. Generalization analysis of multi-modal metric learning[Online], available:https://www.research-gate.net/publication/282969709, November 3, 2015

[28]

Lei Y W, Ding L X, Bi Y Z. Local Rademacher complexity bounds based on covering numbers[Online], available:http://arxiv.org/pdf/1510.01463v1.pdf, October 6, 2015

[29]

Lei Y W, Ding L X. Refined rademacher chaos complexity bounds with applications to the multikernel learning problem. Neural Computation, 2014, 26(4):739-760 doi: 10.1162/NECO_a_00566

[30]

Lei Y W, Ding L X, Zhang W S. Generalization performance of radial basis function networks. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(3):551-564 doi: 10.1109/TNNLS.2014.2320280

[31]

Lei Y W, Ding L X, Wu W L. Universal learning using free multivariate splines. Neurocomputing, 2013, 119(16):253-263

[32]

雷云文. Rademacher复杂度及其在机器学习中的应用[博士学位论文], 武汉大学, 中国, 2015.

Lei Yun-Wen. Rademacher Complexities with Applications to Machine Learning[Ph.D. dissertation], Wuhan University, China, 2015.

[33]

Xu C, Liu T L, Tao D C, Xu C. Local Rademacher complexity for multi-label learning[Online], available:http://arxiv.org/pdf/1410.6990.pdf, October 26, 2014

[34]

Gao W, Zhou Z H. Dropout Rademacher complexity of deep neural networks. Science China:Information Sciences, 2016, 59(7):072104:1-072104:12 doi: 10.1007/s11432-015-5470-z

[35]

de la Peña V, Giné E. Decoupling:From Dependence to Independence. New York:Springer-Verlag, 1999.

[36]

Cao Q, Guo Z C, Ying Y M. Generalization bounds for metric and similarity learning. Machine Learning, 2016, 102(1):115-132 doi: 10.1007/s10994-015-5499-7

[37]

Mohri M, Medina A M. New analysis and algorithm for learning with drifting distributions. In:Proceedings of the 23rd International Conference on Algorithmic Learning Theory. Lyon, France:Springer-Verlag, 2012. 124-138

[38]

Mohri M, Rostamizadeh A. Rademacher complexity bounds for non-I.I.D. processes. In:Proceedings of the 2008 Advances in Neural Information Processing Systems 21. Vancouver, BC, Canada:Curran Associates, Inc., 2008. 1097-1104

[39]

Rakhlin A, Sridharan K. On martingale extensions of Vapnik-Chervonenkis theory with applications to online learning. Measures of Complexity. Switzerland:Springer International Publishing, 2015. 197-215

[40]

Rakhlin A, Sridharan K, Tewari A. Sequential complexities and uniform martingale laws of large numbers. Probability Theory and Related Fields, 2015, 161(1-2):111-153 doi: 10.1007/s00440-013-0545-5

[41]

Rakhlin A, Sridharan K, Tewari A. Online learning:random averages, combinatorial parameters, and learnability. In:Proceedings of the 2010 Advances in Neural Information Processing Systems 23. Vancouver, BC, Canada:Curran Associates, Inc., 2010. 1984-1992

[42]

Kuznetsov V, Mohri M. Generalization bounds for time series prediction with non-stationary processes. In:Proceedings of the 25th International Conference on Algorithmic Learning Theory. Bled, Slovenia:Springer International Publishing, 2014. 260-274

[43]

Kuznetsov V, Mohri M. Forecasting non-stationary time series:from theory to algorithms[Online], available:http://www.cims.nyu.edu/~vitaly/pub/fts.pdf, July 12, 2016

[44]

Anguita D, Ghio A, Oneto L, Ridella S. A deep connection between the Vapnik-Chervonenkis entropy and the Rademacher complexity. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(12):2202-2211 doi: 10.1109/TNNLS.2014.2307359

[45]

Kääriäinen M. Relating the Rademacher and VC Bounds, Technical Report, C-2004-57, Department of Computer Science, University of Helsinki, Finland, 2004.

[46]

Srebro N, Sridharan K. Note on refined dudley integral covering number bound[Online], available:http://ttic.uchicago.edu/~karthik/dudley.pdf, July 12, 2016

[47]

Srebro N, Sridharan K, Tewari A. Smoothness, low noise and fast rates. In:Proceedings of the 2010 Advances in Neural Information Processing Systems 23. Vancouver, BC, Canada:Curran Associates, Inc., 2010. 2199-2207

[48]

V'Yugin V V. VC dimension, fat-shattering dimension, Rademacher averages, and their applications. Measures of Complexity. Switzerland:Springer International Publishing, 2015. 57-74

[49]

Ying Y M, Campbell C. Rademacher chaos complexities for learning the kernel problem. Neural Computation, 2010, 22(11):2858-2886 doi: 10.1162/NECO_a_00028

[50]

Massart P. Some applications of concentration inequalities to statistics. Annales De La Faculté Des Sciences De Toulouse, 2000, 9(2):245-303 doi: 10.5802/afst.961

[51]

Gnecco G, Sanguineti M. Approximation error bounds via Rademacher's complexity. Applied Mathematical Sciences, 2008, 2(4):153-176

[52]

Boucheron S, Bousquet O, Lugosi G. Theory of classification:a survey of some recent advances. ESAIM:Probability and Statistics, 2005, 9:323-375 doi: 10.1051/ps:2005018

[53]

Oneto L, Ghio A, Ridella S, Anguita D. Global Rademacher complexity bounds:from slow to fast convergence rates. Neural Processing Letters, 2016, 43(2):567-602 doi: 10.1007/s11063-015-9429-2

[54]

Oneto L, Ghio A, Ridella S, Anguita D. Fast convergence of extended Rademacher complexity bounds. In:Proceedings of the 2015 International Joint Conference on Neural Networks. Killarney, Ireland:IEEE, 2015. 1-10

[55]

Oneto L, Ghio A, Anguita D, Ridella S. An improved analysis of the Rademacher data-dependent bound using its self bounding property. Neural Networks, 2013, 44:107-111 doi: 10.1016/j.neunet.2013.03.017

[56]

Yu H F, Jain P, Kar P, Dhillon I S. Large-scale multi-label learning with missing labels. In:Proceedings of the 31st International Conference on Machine Learning. Beijing, China, 2014. 593-601

[57]

丁万鼎. 测度论概要. 合肥:安徽人民出版社, 2005.

Ding Wan-Ding. Essentials of Measure Theory. Hefei:Anhui People's Publishing House, 2005.

[58]

Hastie T, Friedman J, Tibshirani R. The Elements of Statistical Learning:Data Mining, Inference, and Prediction. New York:Springer, 2001.

[59]

van der Vaart A W, Wellner J A. Weak Convergence and Empirical Processes. New York:Springer-Verlag, 1996.

[60]

Ledoux M, Talagrand M. Probability in Banach Spaces:Isoperimetry and Processes. Berlin:Springer-Verlag, 1991.

[61]

Dudley R M. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1967, 1(3):290-330 doi: 10.1016/0022-1236(67)90017-1

[62]

Lozano F. Model selection using Rademacher penalization. In:Proceedings of the 2nd ICSC Symposium on Neural Networks. London:Academic Press, 2000.

[63]

Boucheron S, Lugosi G, Massart P. Concentration Inequalities:A Nonasymptotic Theory of Independence. United Kingdom:Oxford University Press, 2013.

[64]

周志华. 机器学习. 北京:清化大学出版社, 2016.

Zhou Zhi-Hua. Machine Learning. Beijing:Tsinghua University Press, 2016.

[65]

Mohri M, Rostamizadeh A, Talwalkar A. Foundations of Machine Learning. Cambridge:MIT Press, 2012.

[66]

汪洪桥, 孙富春, 蔡艳宁, 陈宁, 丁林阁. 多核学习方法. 自动化学报, 2010, 36(8):1037-1050 doi: 10.3724/SP.J.1004.2010.01037

Wang Hong-Qiao, Sun Fu-Chun, Cai Yan-Ning, Chen Ning, Ding Lin-Ge. On multiple kernel learning methods. Acta Automatica Sinica, 2010, 36(8):1037-1050 doi: 10.3724/SP.J.1004.2010.01037

[67]

McFee B, Lanckriet G. Learning multi-modal similarity. Journal of Machine Learning Research, 2011, 12:491-523 http://cn.bing.com/academic/profile?id=938f4935300ae26d31169f6fc0730139&encoded=0&v=paper_preview&mkt=zh-cn

[68]

Xie P T, Xing E P. Multi-modal distance metric learning. In:Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China:AAAI, 2013. 1806-1812

[69]

Hinton G E, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R R. Improving neural networks by preventing co-adaptation of feature detectors[Online], available:http://arxiv.org/pdf/1207.0580v1.pdf, July 3, 2012

[70]

Wan L, Zeiler M, Zhang S X, LeCun Y, Fergus R. Regularization of neural networks using dropconnect. In:Proceedings of the 30th International Conference on Machine Learning. Atlanta, Georgia, USA, 2013. 1058-1066

[71]

Rudin W. Functional Analysis (2nd edition). New York:McGraw-Hill, 1991.

[72]

Mendelson S. A few notes on statistical learning theory. Advanced Lectures on Machine Learning. Berlin Heidelberg:Springer, 2003. 1-40

[73]

Sauer N. On the density of families of sets. Journal of Combinatorial Theory, Series A, 1972, 13(1):145-147 doi: 10.1016/0097-3165(72)90019-2

[74]

Pollard D. Convergence of Stochastic Processes. New York:Springer-Verlag, 1984.

[75]

Duan H H. Bounding the fat shattering dimension of a composition function class built using a continuous logic connective[Online], available:http://arxiv.org/pdf/1105.4618v1. pdf, May 23, 2011

[76]

Anthony M, Bartlett P L. Neural Network Learning-Theoretical Foundations. Cambridge:Cambridge University Press, 2009.

[77]

Alon N, Ben-David S, Cesa-Bianchi N, Haussler D. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 1997, 44(4):615-631 doi: 10.1145/263867.263927

[78]

Kolmogorov A N, Tihomirov V M. ε-entropy and ε-capacity of sets in functional space. American Mathematical Society Translations, 1961, 17(2):277-364 http://citeseerx.ist.psu.edu/showciting?cid=139921

[79]

Bousquet O. New approaches to statistical learning theory. Journal of the Institute of Statistical Mathematics, 2003, 55(2):371-389 http://cn.bing.com/academic/profile?id=672bc751662466966b4cbbca41bef961&encoded=0&v=paper_preview&mkt=zh-cn

[80]

Wu P C, Hoi S C H, Xia H, Zhao P L, Wang D Y, Miao C Y. Online multimodal deep similarity learning with application to image retrieval. In:Proceedings of the 21st ACM International Conference on Multimedia. New York, USA:ACM, 2013. 153-162

[81]

Xia H, Wu P C, Hoi S C H. Online multi-modal distance learning for scalable multimedia retrieval. In:Proceedings of the 6th International Conference on Web Search and Data Mining. New York, USA:ACM, 2013. 455-464

[82]

Krzyzak A, Linder T. Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks, 1998, 9(2):247-256 doi: 10.1109/72.661120

[83]

Niyogi P, Girosi F. On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Computation, 1996, 8(4):819-842 doi: 10.1162/neco.1996.8.4.819

[84]

Park J, Sandberg I W. Universal approximation using radial-basis-function networks. Neural Computation, 1991, 3(2):246-257 doi: 10.1162/neco.1991.3.2.246

[85]

Girosi F. Approximation error bounds that use VC-bounds[Online], available:https://www.researchgate.net/public-ation/2782224_Approximation_Error_Bounds_That_Use_Vc-Bounds, February 18, 2013

[86]

Györfi L, Kohler M, Krzyżak A, Walk H. A Distribution-Free Theory of Nonparametric Regression. Berlin:Springer-Verlag, 2010.

[87]

Haussler D. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 1992, 100(1):78-150 doi: 10.1016/0890-5401(92)90010-D

[88]

Yu B. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 1994, 22(1):94-116 doi: 10.1214/aop/1176988849

[89]

Meir R. Nonparametric time series prediction through adaptive model selection. Machine Learning, 2000, 39(1):5-34 doi: 10.1023/A:1007602715810

[90]

Bartlett P L. Learning with a slowly changing distribution. In:Proceedings of the 5th Annual Workshop on Computational Learning Theory. New York, USA:ACM, 1992. 243-252

[91]

Mansour Y, Mohri M, Rostamizadeh A. Domain adaptation:learning bounds and algorithms. In:Proceedings of the 22nd Annual Conference on Learning Theory. Montreal, QC:Morgan Kaufmann Publishers, 2009.

[92]

Anguita D, Ghio A, Oneto L, Ridella S. In-sample and out-of-sample model selection and error estimation for support vector machines. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23(9):1390-1406 doi: 10.1109/TNNLS.2012.2202401

[93]

El-Yaniv R, Pechyony D. Transductive Rademacher complexity and its applications. Journal of Artificial Intelligence Research, 2007, 35:193-234

[94]

Tolstikhin I, Blanchard G, Kloft M. Localized complexities for transductive learning[Online], available:http://arxiv. org/pdf/1411.7200.pdf, November 26, 2014

[95]

Tolstikhin I, Zhivotovskiy N, Blanchard G. Permutational Rademacher complexity. Algorithmic Learning Theory. Switzerland:Springer International Publishing, 2015. 209-223

[96]

Chazottes J R, Collet P, Külske C, Redig F. Concentration inequalities for random fields via coupling. Probability Theory and Related Fields, 2007, 137(1):201-225

[97]

Belomestny D, Spokoiny V. Concentration inequalities for smooth random fields. Theory of Probability and Its Applications, 2014, 58(2):314-323 doi: 10.1137/S0040585X9798659X

[98]

Zhu X J, Rogers T T, Gibson B R. Human Rademacher complexity. In:Proceedings of the 2009 Advances in Neural Information Processing Systems 22. Vancouver, BC, Canada:Curran Associates, Inc., 2009. 2322-2330

[99]

Vahdat M, Oneto L, Ghio A, Anguita D, Funk M, Rauterberg M. Human algorithmic stability and human Radema-cher complexity. In:Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges, Belgium:Katholieke Universiteit Leuven, 2015. 313-318

[100]

Nolan D, Pollard D. U-processes:rates of convergence. The Annals of Statistics, 1987, 15(2):780-799 doi: 10.1214/aos/1176350374

[101]

Karlin S, Taylor H M. A First Course in Stochastic Processes (2nd edition). New York:Academic Press, 1975.