基于密度的聚类中心自动确定的混合属性数据聚类算法研究

陈晋音; 何辉豪

doi:10.16383/j.aas.2015.c150062

基于密度的聚类中心自动确定的混合属性数据聚类算法研究

doi: 10.16383/j.aas.2015.c150062

陈晋音^,,
何辉豪

1.
浙江工业大学信息工程学院杭州 310023

基金项目:

浙江省自然科学基金(Y14F020092), 宁波市自然科学基金 (2013A610070)资助

详细信息

作者简介:
何辉豪浙江工业大学信息学院硕士研究生. 数主要研究方向为据挖掘与应用, 聚类分析. E-mail: hhh zjut@163.com

通讯作者:
陈晋音博士, 浙江工业大学信息工程学院副教授. 主要研究方向为智能计算, 优化计算, 网络安全. 本文通信作者. E-mail: chenjinyin@zjut.edu.cn

计量
- 文章访问数: 2536
- HTML全文浏览量: 207
- PDF下载量: 2071
- 被引次数: 0
出版历程
- 收稿日期: 2015-02-03
- 修回日期: 2015-07-14
- 刊出日期: 2015-10-20

Research on Density-based Clustering Algorithm for Mixed Data with Determine Cluster Centers Automatically

CHEN Jin-Yin^,,
HE Hui-Hao

1.
Institute of Information Engineering, Zhejiang University of Technology, Hangzhou 310023

Funds:

Supported by Natural Science Foundation of Zhejiang Province (Y14F020092), Natural Science Foundation of Ningbo City (2013A610070)

摘要

摘要: 面对广泛存在的混合属性数据,现有大部分混合属性聚类算法普遍存在聚类质量低、聚类算法参数依赖性大、聚类类别个数和聚类中心无法准确自动确定等问题,针对这些问题本文提出了一种基于密度的聚类中心自动确定的混合属性数据聚类算法.该算法通过分析混合属性数据特征,将混合属性数据分为数值占优、分类占优和均衡型混合属性数据三类,分析不同情况的特征选取相应的距离度量方式.在计算数据集各个点的密度和距离分布图基础上,深入分析获得规律: 高密度且与比它更高密度的数据点有较大距离的数据点最可能成为聚类中心,通过线性回归模型和残差分析确定奇异点,理论论证这些奇异点即为聚类中心,从而实现了自动确定聚类中心.采用粒子群算法(Particle swarm optimization, PSO)寻找最优dc值,通过参数dc能够计算得到任意数据对象的密度和到比它密度更高的点的最小距离,根据聚类中心自动确定方法确定每个簇中心,并将其他点按到最近邻的更高密度对象的最小距离划分到相应的簇中,从而实现聚类.最终将本文提出算法与其他现有的多种混合属性聚类算法在多个数据集上进行算法性能比较,验证本文提出算法具有较高的聚类质量.
- 数据挖掘 /
- 混合属性 /
- 数据聚类 /
- 密度 /
- 混合距离度量
Abstract: For mixed data clustering, mostly current clustering algorithms have shortcomings such as low clustering efficiency, clustering parameter sensibility, clustering center number initialization and center determination difficulty. A density based cluster center self-determination mixed data clustering algorithm is proposed in this paper. Firstly, mixed data are divided into three types, including numeric dominant data, categorical dominant data and balanced data based on their data attributes analysis, and corresponding similarity metrics are designed for these three types of mixed data. Then, based on the density and distance relationship for each data object, an important conclusion is achieved that those data objects that have both higher density and larger distance than other data objects are more likely to be the cluster centers. So the linear regression model and residuals analysis are used to find those outliers that are fixed to be cluster centers automatically. The initialization value of dcis most crucial to clustering efficiency, so particle swarm optimization (PSO) algorithm is adopted to search the optimal dc by calculating the distance and density of each data object according to the automatic method for determining the cluster centers. After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. Finally, the performance of the proposed method is testified by a series of simulations on real-world datasets in comparison with other excellent clustering algorithms.
- Data mining /
- mixed attributes /
- data clustering /
- peak density /
- mixed distance measure methods

HTML全文

参考文献(30)

[1]	Huang Z X. Extensions to the K-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 1998, 2(3): 283-304
[2]	Jain A K, Dubes R C. Algorithms for Clustering Data. New Jersey: Prentice-Hall, 1988.
[3]	Han J, Kamber M. Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann, 2001.
[4]	Chen W F, Feng G C. Spectral clustering: a semi-supervised approach. Neurocomputing, 2012, 77(1): 229-242
[5]	Zhang W, Yoshida T, Tang X J, Wang Q. Text clustering using frequent itemsets. Knowledge-Based Systems, 2010, 23(5): 379-388
[6]	Hsu C C, Chen C L, Su Y W. Hierarchical clustering of mixed data based on distance hierarchy. Information Sciences, 2007, 177(20): 4474-4492
[7]	Hsu C C, Huang Y P. Incremental clustering of mixed data based on distance hierarchy. Expert Systems with Applications, 2008, 35(3): 1177-1185
[8]	Lloyd S P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 1982, 28(2): 129-137
[9]	Berget I, Mevik B H, Nas T. New modifications and applications of fuzzy C-means methodology. Computational Statistics & Data Analysis, 2008, 52(5): 2403-2418
[10]	Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. Washington: ACM Press, 1998. 73-84
[11]	S. H. Cluster Analysis Algorithms. West Sussex: Ellis Horwood Limited, 1980.
[12]	Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. Montreal: ACM Press, 1996. 103-114
[13]	Ester M, Kriegel H P, Sander J, Xu X W. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD. 1996. 226-232
[14]	Bi Kai, Wang Xiao-Dan, Xing Ya-Qiong. Fuzzy clustering ensemble based on fuzzy measure and DS evidence theory. Control and Decision, 2015, 30(5): 823-830 (毕凯, 王晓丹, 邢雅琼. 基于模糊测度和证据理论的模糊聚类集成方法. 控制与决策, 2015, 30(5): 823-830)
[15]	Liu Z G, Pan Q, Dezert J, Mercier G. Credal C-means clustering method based on belief functions. Knowledge-Based Systems, 2015, 74: 119-132
[16]	Huang Z X. A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Research Issues on Data Mining and Knowledge Discovery. Arizona: ACM Press, 1997. 1-8
[17]	Gan G, Wu J, Yang Z. A genetic fuzzy K-modes algorithm for clustering categorical data. Expert Systems with Applications, 2009, 36(2): 1615-1620
[18]	Barbara D, Couto J, Li Y. COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the 11th International Conference on Information and Knowledge Management. Virginia: ACM Press, 2002. 582-589
[19]	Huang Z X. Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining. Singapore: World Scientific Publishing, 1997. 21-34
[20]	Chatzis S P. A fuzzy C-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Systems with Applications, 2011, 38(7): 8684-8689
[21]	Gath I, Geva A B. Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1989, 711(7): 773-780
[22]	Zheng Z, Gong M G, Ma J J, Jiao L C, Wu Q D. Unsupervised evolutionary clustering algorithm for mixed type data. In: Proceedings of the 2010 IEEE Congress on Evolutionary Computation. Barcelona: IEEE, 2010. 1-8
[23]	Li C, Biswas G. Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(4): 673-690
[24]	Goodall D W. A new similarity index based on probability. Biometrics, 1966, 22(4): 882-907
[25]	Hsu C C, Chen Y C. Mining of mixed data with application to catalog marketing. Expert Systems with Applications, 2007, 32(1): 12-23
[26]	Ahmad A, Dey L. A K-mean clustering algorithm for mixed numeric and categorical data. Data & Knowledge Engineering, 2007, 63(2): 503-527
[27]	Ji J C, Bai T, Zhou C G, Ma C, Wang Z. An improved K-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 2013, 120: 590-596
[28]	Ji J C, Pang W, Zhou C G, Han X, Wang Z. A fuzzy K-prototype clustering algorithm for mixed numeric and categorical data. Knowledge-based Systems, 2012, 30: 129-135
[29]	Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science, 2014, 344(6191): 1492-1496
[30]	Wang Song-Gui, Shi Jian-Hong, Yin Su-Ju, Wu Mi-Xia. Introduction to Linear Models. Beijing: Science Press, 2004. (王松桂, 史建红, 尹素菊, 吴密霞. 线性模型引论. 北京: 科学出版社, 2004.)

施引文献

资源附件(0)

访问统计

计量

文章访问数: 2536
HTML全文浏览量: 207
PDF下载量: 2071
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

基于密度的聚类中心自动确定的混合属性数据聚类算法研究

doi: 10.16383/j.aas.2015.c150062

作者简介:
何辉豪浙江工业大学信息学院硕士研究生. 数主要研究方向为据挖掘与应用, 聚类分析. E-mail: hhh zjut@163.com

通讯作者:
陈晋音博士, 浙江工业大学信息工程学院副教授. 主要研究方向为智能计算, 优化计算, 网络安全. 本文通信作者. E-mail: chenjinyin@zjut.edu.cn

计量

Research on Density-based Clustering Algorithm for Mixed Data with Determine Cluster Centers Automatically

计量

目录

留言板

基于密度的聚类中心自动确定的混合属性数据聚类算法研究

doi: 10.16383/j.aas.2015.c150062

作者简介: 何辉豪 浙江工业大学信息学院硕士研 究生. 数主要研究方向为据挖掘与应用, 聚类分析. E-mail: hhh zjut@163.com

通讯作者: 陈晋音 博士, 浙江工业大学信息工程 学院副教授. 主要研究方向为智能计算, 优化计算, 网络安全. 本文通信作者. E-mail: chenjinyin@zjut.edu.cn

计量

出版历程

Research on Density-based Clustering Algorithm for Mixed Data with Determine Cluster Centers Automatically

计量

出版历程

目录

作者简介:
何辉豪浙江工业大学信息学院硕士研究生. 数主要研究方向为据挖掘与应用, 聚类分析. E-mail: hhh zjut@163.com

通讯作者:
陈晋音博士, 浙江工业大学信息工程学院副教授. 主要研究方向为智能计算, 优化计算, 网络安全. 本文通信作者. E-mail: chenjinyin@zjut.edu.cn