An Exemplar Selection Algorithm for Protein Structures Clustering
-
摘要: 提出一种对蛋白质结构聚类中心进行选择的算法. 聚类是蛋白质结构预测过程中必不可少的一个后处理步骤, 而目前在蛋白质结构预测中常用的属性阈值(Quality threshold, QT)聚类算法依赖于由经验得出的聚类半径; 其他聚类算法, 如近邻传播(Affinity propagation, AP)聚类算法也存在影响聚类分布的参数. 为克服对主观经验参数的依赖,本文提出一种聚类中心选择算法(Exemplar selection algorithm, ESA), 用于对不同参数下的聚类结果进行分析,从而选择最佳聚类中心,进而确定聚类半径等经验参数. 该算法在真实蛋白质结构数据集上进行了实验,在未知经验参数情况下选择出最佳聚类中心, 同时也为不同聚类算法寻找适合相应数据集的客观聚类参数提供了支持.Abstract: This paper proposes an exemplar selection algorithm (ESA) for protein structures clustering, which is a necessary post-processing step for protein structure prediction. The widely-used quality threshold (QT) algorithm in protein structure prediction depends on clustering radius derived from experience, which also affects clustering distribution in other widely-used clustering algorithms such as affinity propagation (AP). The proposed exemplar selection algorithm can analyze clustering results, choose the best exemplar, and confirm clustering parameter such as clustering radius. Experimental results on real protein structure predictions confirm the effectiveness of our exemplar selection algorithm, which can choose the best exemplar with no experience parameter, and can find the best parameter fitting for data set.
-
Key words:
- Protein structure /
- clustering /
- quality threshold (QT) /
- affinity propagation (AP) /
- exemplar selection
-
[1] Anfinsen C B. Principles that govern the folding of protein chains. Science, 1973, 181(4096): 223-230[2] Bradley P, Misura K M S, Baker D. Toward high-resolution de novo structure prediction for small proteins. Science, 2005, 309(5742): 1868-1871[3] Zhang Y, Skolnick J. SPICKER: a clustering approach to identify near-native protein folds. Journal of Computational Chemistry, 2004, 25(6): 865-871[4] Wu S, Skolnich J, Zhang Y. Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biology, 2007, 5(1): 17-26 [5] Zhang Y. Template-based modeling and free modeling by I-TASSER in CASP7. Proteins: Structure, Function, and Bioinformatics, 2007, 69(S8): 108-117 [6] Yue Feng, Sun Liang, Wang Kuan-Quan, Wang Yong-Ji, Zuo Wang-Meng. State-of-the-art of cluster analysis of gene expression data. Acta Automatica Sinica, 2008, 34(2): 113-120(岳峰, 孙亮, 王宽全, 王永吉, 左旺孟. 基因表达数据的聚类分析研究进展. 自动化学报, 2008, 34(2): 113-120)[7] Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A. Critical assessment of methods of protein structure prediction --- round VII. Proteins: Structure, Function, and Bioinformatics, 2007, 69(S8): 3-9 [8] Heyer L J, Kruglyak S, Yooseph S. Exploring expression data: identification and analysis of coexpressed genes. Genome Research, 1999, 9: 1106-1115 [9] Wang Kai-Jun, Zhang Jun-Ying, Li Dan, Zhang Xin-Na, Guo Tao. Adaptive affinity propagation clustering. Acta Automatica Sinica, 2007, 33(12): 1242-1245(王开军, 张军英, 李丹, 张新娜, 郭涛. 自适应仿射传播聚类. 自动化学报, 2007, 33(12): 1242-1245)[10] Frey B J, Dueck D. Clustering by passing messages between data points. Science, 2007, 315(5814): 972-976[11] Shortle D, Simons K T, Baker D. Clustering of low-energy conformations near the native structures of small proteins. Proceedings of the National Academy of Sciences of the USA, 1998, 95(19): 11158-11162[12] Xiao Yu, Yu Jian. Semi-supervised clustering based on affinity propagation algorithm. Journal of Software, 2008, 19(11): 2803-2813(肖宇, 于剑. 基于近邻传播算法的半监督聚类. 软件学报, 2008, 19(11): 2803-2813)[13] Liu Ming, Wang Xiao-Long, Liu Yuan-Chao. A fast clustering algorithm for large-scale and high dimensional data. Acta Automatica Sinica, 2009, 35(7): 859-866(刘铭, 王晓龙, 刘远超. 一种大规模高维数据快速聚类算法. 自动化学报, 2009, 35(7): 859-866)[14] Ni Wei-Wei, Sun Zhi-Hui, Lu Jie-Ping. K-LDCHD --- a local density based k-neighborhood clustering algorithm for high dimensional space. Journal of Computer Research and Development, 2005, 42(5): 784-791(倪巍伟, 孙志挥, 陆介平. K-LDCHD --- 高维空间k邻域局部密度聚类算法. 计算机研究与发展, 2005, 42(5): 784-791)[15] Hubert M, Veeken S V. Outlier detection for skewed data. Journal of Chemometrics, 2008, 22(3-4): 235-246[16] Zhang Y. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics, 2008, 9(1): 40-47 [17] Rohl C A, Strrauss C E M, Misura K M S, Baker D. Protein structure prediction using Rosetta. Methods in Enzymology, 2004, 383: 66-93 [18] Kryshtafovych A, Milostan M, Szajkowski L, Daniluk P, Fidelis K. Casp6 data processing and automatic evaluation at the protein structure prediction center. Proteins: Structure, Function, and Bioinformatics, 2005, 61(S7): 19-23[19] Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 2004, 57(4): 702-710[20] Tress M, Ezkurdia I, Grana O, Lopez G, Valencia A. Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins: Structure, Function, and Bioinformatics, 2005, 61(S7): 27-45
点击查看大图
计量
- 文章访问数: 2237
- HTML全文浏览量: 57
- PDF下载量: 1033
- 被引次数: 0