-
摘要: 聚类集成中的关键问题是如何根据不同的聚类器组合为最终的更好的聚类结果. 本文引入谱聚类思想解决文本聚类集成问题, 然而谱聚类算法需要计算大规模矩阵的特征值分解问题来获得文本的低维嵌入, 并用于后续聚类. 本文首先提出了一个集成算法, 该算法使用代数变换将大规模矩阵的特征值分解问题转化为等价的奇异值分解问题, 并继续转化为规模更小的特征值分解问题; 然后进一步研究了谱聚类算法的特性, 提出了另一个集成算法, 该算法通过求解超边的低维嵌入, 间接得到文本的低维嵌入. 在TREC和Reuters文本数据集上的实验结果表明, 本文提出的两个谱聚类算法比其他基于图划分的集成算法鲁棒, 是解决文本聚类集成问题行之有效的方法.Abstract: A critical problem in cluster ensemble is how to combine multiple clusters to yield a superior result. In this paper, the idea of spectral clustering algorithm is brought into the document cluster ensemble problem. Since spectral clustering algorithm needs to solve eigenvalue decomposition problem of a large scale matrix to get the low dimensional embedding of documents for later clustering, a fast spectral algorithm is first proposed, in which the large scale matrix eigenvalue decomposition problem is transformed to an equivalent singular value decomposition problem and then to a much smaller matrix eigenvalue decomposition problem. The characteristic of spectral clustering algorithm is further investigated and another spectral algorithm is proposed, in which the low dimensional embedding of documents are obtained indirectly by those of hyperedges. Experiments on TREC and Reuters document sets show that both proposed spectral algorithms outperform other cluster ensemble techniques based on graph partitioning, and can effectively solve document cluster ensemble problem.
-
Key words:
- Clustering analysis /
- cluster ensemble /
- spectral clustering /
- document clustering
点击查看大图
计量
- 文章访问数: 1823
- HTML全文浏览量: 38
- PDF下载量: 1663
- 被引次数: 0