Fast Kernel Density Estimate Theorem and Scaling up Graph-based Relaxed Clustering Method
-
摘要: 首先证明了快速核密度估计 (Fast kernel density estimate, FKDE) 定理: 基于抽样子集的高斯核密度估计(KDE)与原数据集的KDE间的误差与抽样容量和核参数相关, 而与总样本容量无关. 接着本文揭示了基于高斯核形式的图论松弛聚类(Graph-based relaxed clustering, GRC)算法的目标表达式可分解成“Parzen窗加权和 + 平方熵”的形式, 即此时GRC可视作一个核密度估计问题, 这样基于KDE近似策略, 本文提出了大规模图论松弛聚类方法(Scaling up GRC by KDE approximation, SUGRC-KDEA). 较之先前的工作, 这一方法的优势在于为GRC作用于大规模数据集提供了更简单和易于实现的方案.Abstract: In this paper, the fast kernel density estimate (FKDE) theorem is presented firstly, which points out that the integrated squared error between the Gaussian kernel based KDE of the whole dataset and the one of a sampled subset is related to the sample size and the kernel width, but not to the size of the whole dataset. Next, it is deduced that the objective function of graph-based relaxed clustering (GRC) algorithm based on Gaussian kernel can be represented as two parts: weight sum of Parzen window (PW) and “quadratic entropy”, that is, GRC can also be viewed as a KDE problem. So the scaling up GRC by KDE approximation (SUGRC-KDEA) method is proposed according to the FKDE theorem. Compared with the previous work, the advantage of this method lies in that it provides an easier and more straightforward implementation for GRC on large datasets.
-
Key words:
- Kernel density estimate (KDE) /
- large data set /
- clustering /
- sampled subset
-
[1] Lee C, Zaiane O, Park H, Huang J, Greiner R. Clustering high dimensional data: a graph-based relaxed optimization approach. Information Sciences, 2008, 178(23): 4501-4511[2] Qian Peng-Jiang, Wang Shi-Tong, Deng Zhao-Hong, Xu Hua. Fast spectral clustering for large data sets using minimal enclosing ball. Acta Electronica Sinica, 2010, 38(9): 2035-2041(钱鹏江, 王士同, 邓赵红, 徐华. 基于最小包含球的大数据集快速谱聚类算法. 电子学报, 2010, 38(9): 2035-2041)[3] Deng Z H, Chung F L, Wang S T. FRSDE: fast reduced set density estimator using minimal enclosing ball approximation. Pattern Recognition, 2008, 41(4): 1363-1372[4] Tsang I, Kwok J, Zurada J. Generalized core vector machines. IEEE Transactions on Neural Networks, 2006, 17(5): 1126-1140[5] Badoiu M, Clarkson K L. Optimal core-sets for balls. Computational Geometry: Theory and Applications, 2008, 40(1): 14-22[6] Badoiu M, Har-Peled S, Indyk P. Approximate clustering via core-sets. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing. Quebec, Canada: ACM, 2002. 250-257[7] Tsang I, Kwok J, Cheung P. Core vector machines: fast SVM training on very large data sets. The Journal of Machine Learning Research, 2005, 6: 363-392[8] Xu D X. Energy, Entropy and Information Potential for Neural Computation [Ph.D. dissertation], University of Florida, USA, 1998[9] Maynou J, Gallardo-Chacon J J, Vallverdu M, Caminal P, Perera A. Computational detection of transcription factor binding sites through differential Renyi entropy. IEEE Transactions on Information Theory, 2010, 56(2): 734-741[10] Qian Peng-Jiang, Wang Shi-Tong, Deng Zhao-Hong. Fast adaptive similarity-based clustering using sparse Parzen window density estimation. Acta Automatica Sinica, 2011, 37(2): 179-187(钱鹏江, 王士同, 邓赵红. 基于稀疏Parzen窗密度估计的快速自适应相似度聚类方法. 自动化学报, 2011, 37(2): 179-187)[11] Jenssen R. Kernel entropy component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(5): 847-860[12] Chen S, Hong X, Harris C J. Probability density estimation with tunable kernels using orthogonal forward regression. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2010, 40(4): 1101-1114[13] Zeng X, Durrani T S. Estimation of mutual information using copula density function. Electronics Letters, 2011, 47(8): 493-494[14] Jeon B, Landgrebe D A. Fast Parzen density estimation using clustering-based branch and bound. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1994, 16(9): 950-954[15] Girolami M, He C. Probability density estimation from optimally condensed data samples. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(10): 1253-1264[16] Freedman D, Kisilev P. Fast data reduction via KDE approximation. In: Proceedings of the Data Compression Conference. Utah, USA: IEEE, 2009. 445-445[17] Heiler M, Keuchel J, Schnorr C. Semidefinite clustering for image segmentation with a-priori knowledge. In: Proceedings of the 27th Symposium of the German Association for Pattern Recognition. Vienna, Austria: Springer, 2005. 309-317[18] Steele J M. The Cauchy Schwarz Master Class: an Introduction to the Art of Mathematical Inequalities. New York: Cambridge University Press, 2004. 99-102[19] Yang M S, Wu K L. A similarity-based robust clustering method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(4): 434-448[20] Chen P H, Fan R E, Lin C J. A study on SMO-type decomposition methods for support vector machines. IEEE Transactions on Neural Networks, 2006, 17(4): 893-908[21] Fan R E, Chen P H, Lin C J. Working set selection using second order information for training support vector machines. The Journal of Machine Learning Research, 2005, 6: 1889-1918
点击查看大图
计量
- 文章访问数: 2251
- HTML全文浏览量: 62
- PDF下载量: 979
- 被引次数: 0