基于约束动态更新的半监督层次聚类算法

周晨曦; 梁循; 齐金山

doi:10.16383/j.aas.2015.c140859

基于约束动态更新的半监督层次聚类算法

doi: 10.16383/j.aas.2015.c140859

1.
中国人民大学信息学院北京 100872;
2.
淮阴师范学院计算机科学与技术学院淮安 223300

基金项目:

国家自然科学基金(71271211), 北京市自然科学基金(4132067),中国人民大学品牌计划(10XNI029)资助

详细信息

作者简介:
周晨曦中国人民大学硕士研究生. 主要研究方向为数据挖掘.E-mail: chnx.zhou@gmail.com

计量
- 文章访问数: 2227
- HTML全文浏览量: 121
- PDF下载量: 1137
- 被引次数: 0
出版历程
- 收稿日期: 2014-12-12
- 修回日期: 2015-03-20
- 刊出日期: 2015-07-20

A Semi-supervised Agglomerative Hierarchical Clustering Method Based on Dynamically Updating Constraints

1.
School of Information, Renmin University of China, Beijing 100872;
2.
School of Computer Science and Technology, Huaiyin Normal University, Huaian 223300

Funds:

Supported by National Natural Science Foundation of China (71271211), National Natural Science Foundation of Beijing (4132067), and Brand Plan of Renmin University of China (10XNI029)

摘要

摘要: 提出了一种基于约束动态更新的半监督层次聚类算法. 与现存的半监督层次聚类算法类似, 该算法也使用了必连和不连约束. 但不同的是, 该算法并不是在对满足必连约束的数据样本点进行预先划分的基础上依据不连约束进行聚合操作, 而是首先将约束扩展为一个闭包, 然后在这此基础上直接依据不连约束进行聚合操作, 并在聚合的过程中依据聚类结果动态地更新必连和不连约束, 以保证最终的聚类结果同时满足必连和不连约束. 该算法的优势在于省略了对必连约束的数据样本点进行预先划分的步骤, 这一改进能够保证数据样本点获得更为合理的聚合顺序, 从而得到更为准确的聚类结果. 本文具体给出了该算法基于Ward 层次聚类算法的实现, 提出了C-Ward算法.实验表明, 与其他同类算法相比, 无论是在人工模拟数据集还是在现实数据集上, 本文提出的算法都表现出了更高的准确性和更强的稳定性.
- 半监督聚类 /
- 层次聚类 /
- 约束 /
- 动态更新 /
- Ward算法
Abstract: A semi-supervised agglomerative hierarchical clustering method based on dynamically updating constraints is proposing in this research. Following the existing semi-supervised clustering algorithm, this method uses the must-link and cannot-link constraints. Instead of using the idea that the instances with must-link constraints are pre-clustered before agglomerating with the others, this method employs a more general and reasonable process. Firstly, must-link and cannot-link constraints are expanded to compose a constraints closure. Then, a standard agglomeration instructed by cannot-link constraints is processed. During this procedure, the must-link and cannot-link are dynamically updated according to the intermediate clustering results. This updating process guarantees the validity of the final results. The fundamental advantage of this method is omitting the pre-clustering process of the instances with must-link constraints. This modification ensures that data points gain a more reasonable agglomeration order, which may result in a significant improvement on the clustering results. This research also introduces an implementation of this model based on Ward's method, leading to the C-Ward algorithm. The experimental analyses on both artificial simulated datasets and real world datasets show that this method is much better than the others.
- Semi-supervised clustering /
- agglomerative hierarchical clustering /
- constraints /
- dynamically updating,Ward0s method /

HTML全文

参考文献(45)

[1]	Chapelle O, Schölkopf B, Zien A. Semi-Supervised Learning. Massachusetts: MIT Press, 2006. 2-5
[2]	Zhu X J. Semi-supervised learning. In: Proceedings of the 2010 Encyclopedia of Machine Learning. US: Springer, 2010. 892-897
[3]	Balcan M F, Blum A. A PAC-style model for learning from labeled and unlabeled data. In: Proceedings of the 2005 Learning Theory. Berlin, Heidelberg: Springer, 2005. 111-126
[4]	Kääriäinen M. Generalization error bounds using unlabeled data. In: Proceedings of the 2005 Learning Theory. Berlin, Heidelberg: Springer, 2005. 127-142
[5]	Singh A, Nowak R D, Zhu X J. Unlabeled data: now it helps, now it doesn't. In: Proceedings of the 2008 Advances in Neural Information Processing Systems 21 (NIPS). Vancouver, Canada, 2008. 1513-1520
[6]	Wagstaff K, Cardie C, Rogers S, Schrodl S. Constrained k-means clustering with background knowledge. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001. 577-584
[7]	Ruiz C, Spiliopoulou M, Menasalvas E. C-DBSCAN: density-based clustering with constraints. In: Proceedings of the 11th International Conference. Rough Sets, Fuzzy Sets, Data Mining and Granular Computing. Toronto, Canada: Springer, 2007. 216-223
[8]	Davidson I, Ravi S S. Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: Proceedings of the 2005 Knowledge Discovery in Databases: PKDD 2005. Berlin, Heidelberg: Springer, 2005. 59-70
[9]	Deng Chao, Guo Mao-Zu. Tri-training and data editing based semi-supervised clustering algorithm. Journal of Software, 2008, 19(3): 663-673 (in Chinese)
[10]	Wang Hong-Jun, Li Zhi-Shu, Qi Jian-Huai, Cheng Yang, Zhou Peng, Zhou Wei. Semi-supervised cluster ensemble model based on Bayesian network. Journal of Software, 2010, 21(11): 2814-2825 (in Chinese)
[11]	Basu S, Banerjee A, Mooney E R, Banerjee A, Mooney R J. Active semi-supervision for pairwise constrained clustering. In: Proceedings of the 2004 SIAM International Conference on Data Mining. Lake Buena Vista, FL: SIAM, 2004. 333-344
[12]	de Amorim RC. Constrained clustering with Minkowski weighted k-means. In: Proceedings of the 13th IEEE International Symposium Computational Intelligence & Informatics. Budapest: IEEE, 2012. 13-17
[13]	Yin Xue-Song, Hu En-Liang, Chen Song-Can. Discriminative semi-supervised clustering analysis with pairwise constraints. Journal of Software, 2008, 19(11): 2791-2802 (in Chinese)
[14]	Wang Y Y, Chen S C, Zhou Z-H. New semi-supervised classification method based on modified cluster assumption. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23(5): 689-702
[15]	Lin B B, Zhang C Y, He X F. Semi-supervised regression via parallel field regularization. In: Proceedings of the 2011 Advances in Neural Information Processing Systems 24 (NIPS). Granada, Spain, 2011. 433-441
[16]	Xiang S M, Nie F P, Zhang C S. Semi-supervised classification via local spline regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(11): 2039-2053
[17]	Nigam K, McCallum A K, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000, 39(2-3): 103-134
[18]	Nigam K P. Using Unlabeled Data to Improve Text Classification, Technical Report, CMU-CS-01-126, Carnegie Mellon University, Pittsburgh, 2001.
[19]	Cozman F G, Cohen I, Cirelo M C. Semi-supervised learning of mixture models. In: Proceedings of the 20th International Conference on Machine Learning. Washington D.C., 2003.
[20]	Bennett K P, Demiriz A. Semi-supervised support vector machines. In: Proceedings of the 1998 Advances in Neural Information Processing Systems. Cambridge: MIT Press, 1998. 368-374
[21]	Fung G, Mangasarian O. Semi-supervised Support Vector Machines for Unlabeled Data Classification, Technical Report, 99-05, Data Mining Institute, University of Wisconsin Madison, 1999
[22]	Chapelle O, Zien A. Semi-supervised learning by low density separation. In: Proceedings of the 10th Int Workshop Artificial Intelligence & Statistics, 2005. 57-64
[23]	Chapelle O, Sindhwani V, Keerthi S S. Branch and bound for semi-supervised support vector machines. In: Advances Neural Information Processing Systems (NIPS). Vancouver, Canada, 2006. 217-224
[24]	De Bie T, Cristianini N. Convex methods for transduction. In: Proceedings of the 2003 Advances Neural Information Processing Systems 16. Vancouver, Canada, 2003. 73-80
[25]	Blum A, Chawla S. Learning from labeled and unlabeled data using graph mincuts. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001. 19-26
[26]	Joachims T. Transductive learning via spectral graph partitioning. In: Proceedings of the 20th International Conference on Machine Learning. Washington D.C., 2003. 290-297
[27]	Zhu X J, Ghahramani Z. Towards Semi-supervised Classification with Markov Random Fields, Technical Report, CMU-CALD-02-106, Carnegie Mellon University, 2002.
[28]	Xiao Yu, Yu Jian. Semi-Supervised clustering based on affinity propagation algorithm. Journal of Software, 2008, 19(11): 2803-2813 (in Chinese)
[29]	Culp M, Michailidis G. An iterative algorithm for extending learners to a semi-supervised setting. In: Proceedings of the 2007 Joint Statistical Meetings. Salt Lake, Utah, 2007.
[30]	Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory. Madison: ACM, 1998. 92-100
[31]	Zhou Y, Goldman S. Democratic co-learning. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence. Boca Raton, FL: IEEE, 2004. 594-602
[32]	Zhou Z H, Li M. Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge & Data Engineering, 2005, 17(11): 1529-1541
[33]	Li M, Zhou Z H. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 2007, 37(6): 1088-1098
[34]	Murtagh F, Legendre P. Ward's hierarchical clustering method: clustering criterion and agglomerative algorithm. arXiv preprint arXiv, 2011: 1111.6285
[35]	Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1996. 103-114
[36]	Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1998. 73-84
[37]	Karypis G, Han E H, Kumar V. Chameleon: hierarchical clustering using dynamic modeling. Computer, 1999, 32(8): 68-75
[38]	Davidson I, Ravi S S. Clustering with constraints: feasibility issues and the k-means algorithm. In: Proceedings of the 2005 SIAM International Conference on Data Mining. Lake Buena Vista, FL: SIAM, 2005. 138-149
[39]	Cormack R M. A review of classification. J Royal Statistical Society. Series A (General), 1971, 134(3): 321-367
[40]	Gionis A, Mannila H, Tsaparas P. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): Article No.4
[41]	Zahn C T. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 1971, C-20(1): 68-86
[42]	Chang H, Yeung D Y. Robust path-based spectral clustering. Pattern Recognition, 2008, 41(1): 191-203
[43]	Dias D B, Madeo R C B, Rocha T, Biscaro H H, Peres S M. Hand movement recognition for Brazilian Sign Language: a study using distance-based neural networks. In: Proceedings of the 2009 International Joint Conference on Neural Networks. Atlanta, GA: IEEE, 2009. 697-704
[44]	Johnson B, Xie Z X. Classifying a high resolution image of an urban area using super-object information. ISPRS Journal of Photogrammetry and Remote Sensing, 2013, 83: 40-49
[45]	Hripcsak G, Rothschild A S. Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 2005, 12(3): 296-298