Verbatim Implementation of a Fast and Space-Efficient Indexed Pattern Matching Algorithm

Institution: | University of Helsinki |
---|---|

Department: | |

Year: | 2016 |

Keywords: | Tietojenkäsittelytiede |

Posted: | 02/05/2017 |

Record ID: | 2086480 |

Full text PDF: | http://hdl.handle.net/10138/167173 |

Approximate string matching considers finding a given text pattern in another text while allowing some number of differences. In the offline version of the problem, the text is known beforehand and is processed to generate an indexing data structure. While the problem has received a lot of attention and it has many practical uses in bioinformatics, the common tools often do not make use of the algorithms the time and space complexities of which are the best ones known. Hence it is interesting to compare the performance of an efficient algorithm to tools that make use of heuristics. In this work, a pattern matching algorithm by T.-W. Lam, W.-K. Sung and S.-S. Wong is described. An implementation of the algorithm is provided and tested against two other tools, namely Erne 2 and readaligner 2012. The algorithm by Lam, Sung and Wong searches the text for the pattern while allowing one mismatch or difference, that is, also allowing character insertion and deletion. It makes use of certain types of compressed suffix array and compressed suffix tree that provide fast operations. Additionally, to restrict the search to relevant parts of the suffix tree, a sample is taken from the suffix array and the sampled indices are stored into a data structure that provides double logarithmic worst case range queries. To find the pattern in the text while allowing k errors, the algorithm is combined with a dynamic programming algorithm. The latter is used to find partial matches with k - 1 errors. The candidate occurrences are located from the suffix tree and these branches are used with Lam, Sung and Wong's algorithm. For a text of length n and a pattern of length m drawn from an alphabet of size σ, the time complexity of the algorithm is O(σ^k m^k (k + log log n) + occ) using an O(n (log n)^(1/2) log σ)-bit indexing data structure, where occ is the number of occurrences in the text, given that σ is O(2^((log n)^(1/2))). For short patterns, this is the best known time complexity with an indexing data structure of the given size. The results indicate that in practice relying on heuristics yields better results in terms of time and memory use. While such an outcome is not remarkable, some important data structures were implemented in the process. An implementation of S. S. Rao's compressed suffix array already existed but it was rewritten to allow using different supporting data structures for e.g. rank and select support. The inverse suffix array described by Lam, Sung and Wong was implemented. Also, while implementations of X and Y-fast were available, to the author's knowledge a publicly available implementation of these combined with perfect hashing had not been produced before. Moreover, no implementation of Lam, Sung and Wong's algorithm was known to exist before.