Pattern Matching with Arbitrary-length Wildcards
-
摘要: 基因序列中,许多病毒并不是简单的直接复制自己,而是相邻字符间插入或者删除序列片段,如何从序列数据中检索这些病毒具有重要的研究价值.提出了一个更普遍的问题,带任意长度通配符的模式匹配问题(Pattern matching with arbitrary-length wildcards,PMAW),这里模式中不仅可以有多个通配符约束,而且每个通配符的约束可以是两个整数,也可以从整数到无穷大.给定序列S和带通配符的模式P,目标是从S中检索P的所有出现和每一次出现的匹配位置,并且要求任意两次出现不能共享序列中同一位置.为了有效地解决该问题,设计了两个基于位并行的匹配算法MOTW (Method of ocurrence then window)算法和MWTO (Method of window then ocurrence)算法.同时,MWTO算法进行细微改动就可以满足全局长度约束.实验结果既验证了算法求解问题的正确性,又验证了比相关的模式匹配算法具有更好的时间性能.Abstract: In genetic sequences, many viruses rarely reproduce themselves, but rather appear with a slightly different form in each of the occurrences. That is, sequence fragments may be inserted or deleted in adjacent characters. How to search for these viruses from the sequences has become an important research task. The paper presents a more general problem, named pattern matching with arbitrary-length wildcards (PMAW). Here, a pattern can have many wildcard constraints where the range of the wildcards may vary between two integer bounds or from an integer lower bound to infinity. Given sequence S and pattern P with arbitrary-length wildcards, this paper aims to search for all occurrences of P in S, and locate matching positions of each occurrence, where any two occurrences can not share the same position of S. In order to solve the problem effectively, two algorithms, MOTW (Method of ocurrence then window) and MWTO (Method of window then ocurrence), are proposed based on the bit-parallel technique. The MWTO algorithm can also meet the global length constraint with a minor modification. Experimental results validate the correctness of the proposed algorithms, and show that they perform better than the existing pattern matching algorithms.
-
Key words:
- Wildcard /
- pattern matching /
- bit-parallel /
- genetic sequence
-
[1] Pisanti N, Crochemore M, Grossi R, Sagot M F. Bases of motifs for generating repeated patterns with wildcards. IEEE/ACM Transactions on Computational Biology Bioinformatics, 2005, 2(1): 40-50 [2] Yue Feng, Sun Liang, Wang Kuan-Quan, Wang Yong-Ji, Zuo Wang-Meng. State-of-the-art of cluster analysis of gene expression data. Acta Automatica Sinica, 2008, 34(2): 113-120(岳峰, 孙亮, 王宽全, 王永吉, 左旺孟. 基因表达数据的聚类分析研究进展. 自动化学报, 2008, 34(2): 113-120) [3] Rahman M S, Iliopoulos C S, Lee I, Mohamed M, Smyth W F. Finding patterns with variable length gaps or don't cares. Computing and Combinatorics. Berlin, Heidelberg: Springer, 2006. 146-155 [4] Ding B L, Lo D, Han J W, Khoo S C. Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proceedings of the 25th International Conference on Data Engineering. Shanghai, China: IEEE, 2009. 1024-1035 [5] Wu X D, Zhu X Q, He Y, Arslan A N. PMBC: Pattern mining from biological sequences with wildcard constraints. Computers in Biology and Medicine, 2013, 43(5): 481-492 [6] Wang Bao-Xun, Liu Bing-Quan, Sun Cheng-Jie, Wang Xiao-Long, Sun Lin. Thread segmentation based answer detection in Chinese online forums. Acta Automatica Sinica, 2009, 39(1): 11-20(王宝勋, 刘秉权, 孙承杰, 王晓龙, 孙林. 基于论坛话题段落划分的答案识别. 自动化学报, 2013, 39(1): 11-20) [7] Pan Yun-He, Wang Jin-Long, Xu Cong-Fu. State-of-the-art on frequent pattern mining in data streams. Acta Automatica Sinica, 2006, 32(4): 594-602(潘云鹤, 王金龙, 徐从富. 数据流频繁模式挖掘研究进展. 自动化学报, 2006, 32(4): 594-602) [8] Tanbeer S K, Ahmed C F, Jeong B S, Lee Y K. Efficient frequent pattern mining over data streams. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. California, USA: ACM, 2008. 1447-1448 [9] Altschul S F, Gish W, Miller W, Myers E W, Lipman D J. Basic local alignment search tool. Journal of Molecular Biology, 1990, 215(3): 403-410 [10] Fischer M J, Paterson M S. String-matching and other products. Proceeding of the 7th SIAM AMS Complexity of Computation. Cambridge, USA, 1974. 113-125 [11] Clifford P, Clifford R. Simple deterministic wildcard matching. Knowledge and Information Systems, 2007, 101(2): 53-54 [12] Cole R, Gottlieb L A, Lewenstein M. Dictionary matching and indexing with errors and don't cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing. New York, USA: ACM, 2004. 91-100 [13] Kucherov G, Rusinowitch M. Matching a set of strings with variable length don't cares. Theoretical Computer Science, 1997, 78(1-2): 129-154 [14] Chen G, Wu X D, Zhu X Q, Arslan A N, He Y. Efficient string matching with wildcards and length constraints. Knowledge and Information Systems, 2006, 10(4): 399-419 [15] Wu You-Xi, Liu Ya-Wei, Guo Lei, Wu Xin-Dong. Subnettrees for strict pattern matching with general gaps and length constraints. Journal of Software, 2013, 24(5): 915-932(武优西, 刘亚伟, 郭磊, 吴信东. 子网树求解一般间隙和长度约束严格模式匹配. 软件学报, 2013, 24(5): 915-932) [16] Wu You-Xi, Wu Xin-Dong, Jiang He, Min Fan. A heuristic algorithm for MPMGOOC. Chinese Journal of Computers, 2011, 34(8): 1452-1462(武优西, 吴信东, 江贺, 闵帆. 一种求解MPMGOOC问题的启发式算法. 计算机学报, 2011, 34(8): 1452-1462) [17] Bille P, Görtz I L, Vildhój H W, Wind D K. String matching with variable length gaps. Theoretical Computer Science, 2012, 443(1): 25-34 [18] Wu Xin-Dong, Xie Fei, Huang Yong-Ming, Hu Xue-Gang, Gao Jun. Mining sequential patterns with wildcards and the one-off condition. Journal of Software, 2013, 24(8): 1804-1815(吴信东, 谢飞, 黄咏明, 胡学钢, 高隽. 带通配符和One-Off条件的序列模式挖掘. 软件学报, 2013, 24(8): 1804-1815) [19] Ji X N, Bailey J, Dong G Z. Mining minimal distinguishing subsequence patterns with gap constraints. Knowledge and Information Systems, 2007, 11(3): 259-286 [20] Bille P, Thorup M. Languages and Programming. Berlin: Springer Heidelberg, 2009. 171-182 [21] Navarro G, Raffinot M. New techniques for regular expression searching. Algorithmica, 2005, 41(2): 89-116 [22] Indyk P. Faster algorithms for string matching problems: matching the convolution bound. In: Proceedings of the 39th Symposium on Foundations of Computer Science. Washington, DC, USA: IEEE, 1998. 166-173 [23] Kalai A. Efficient pattern-matching with don't cares. In: Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms. Philadelphia, PA, USA: ACM, 2002. 655-656 [24] Manber U, Baeza-Yates R. An algorithm for string matching with a sequence of don't cares. Information Processing Letters, 1991, 37(3): 133-136 [25] Navarro G, Raffinot M. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. Journal of Computational Biology, 2003, 10(6): 903-923 [26] Min F, Wu X D, Lu Z Y. Pattern matching with independent wildcard gaps. In: Proceedings of the 8th IEEE International Conference on Dependable, Autonomic and Secure Computing. Chengdu, China: IEEE, 2009. 194-199 [27] Ferreira P G, Azevedo P J. Protein sequence pattern mining with constraints. In: Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Berlin Heidelberg: Springer, 2005. 96-107 [28] Aho A V, Corasick M J. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 1975, 18(6): 333-340 [29] Guo D, Hong X L, Hu X G, Gao J, Liu Y L, Wu G Q, Wu X D. A bit-parallel algorithm for sequential pattern matching with wildcards. Cybernetics and Systems, 2011, 42(6): 382-401 [30] Blumer A, Blumer J, Haussler D, Ehrenfeucht A, Chen M T, Seiferas J. The smallest automation recognizing the subwords of a text. Theoretical Computer Science, 1985, 40(1): 31-55 [31] Zhang M, Zhang Y, Hu L. A faster algorithm for matching a set of patterns with variable length don't cares. Information Processing Letters, 2010, 110(6): 216-220 [32] Haapasalo T, Silvasti P, Sippu S, Soisalon-Soininen E. Online dictionary matching with variable-length gaps. In: Proceedings of the 10th International Symposium, SEA Kolimpari, Chania, Crete, Greece. Berlin Heidelberg: Springer, 2011. 76-87
点击查看大图
计量
- 文章访问数: 1939
- HTML全文浏览量: 93
- PDF下载量: 1041
- 被引次数: 0