|Institution:||University of New South Wales|
|Department:||Computer Science & Engineering|
|Keywords:||Protein Sequence Similarity Search; Locality-Sensitive Hashing; MapReduce; Big Data; Cloud Computing; Metagenomics; ScalLoPS|
|Full text PDF:||http://handle.unsw.edu.au/1959.4/53681|
Metagenomic studies produce large datasets that are estimated to grow at a faster rate than the available computational capacity. A key step in the metagenomic workflows is protein sequence similarity search, which is computationally intensive over large datasets. Several techniques and tools have been developed for this purpose. However, these were designed and implemented for single-node computing infrastructure and are not able to handle the exponential increase in data. Several efforts, directed to improve the performance of protein sequence search through parallel and distributed computing, have enormously improved the performance of searching over large bioinformatic datasets. However, these are best suited for static deployment over large-scale computing infrastructure such as clusters or supercomputers. Access to such infrastructure is not available to most of the active scientists in the world. The recent arrival of cloud computing that enables availing computing as a utility, has resulted in broad access to distributed system infrastructure. Recently, popular protein sequence similarity search tools, such as BLAST, have been re-engineered to capitalise on cloud computing. Most of these efforts, however, follow a naive, static, ‘embarrassingly parallel’ approach and are not able to take advantage of scaling up through resource elasticity in cloud computing. This thesis presents a novel approach for protein sequence similarity search that builds on Locality-Sensitive Hashing (LSH), a well-known methodology for high-performance similarity search over large-scale, multi-dimensional sets of documents. LSH has been demonstrated to be easily adaptable for distribution over cloud infrastructure through Key-Value schemes such as MapReduce. Firstly, this thesis introduces a set of algorithms based on LSH that solves the challenges outlined previously. Secondly, the thesis details the design of ScalLoPS, a search tool similar to BLAST, that implements these and subsequent sequence alignment algorithms, using the MapReduce distributed programming paradigm. Finally, the performance, sensitivity, and scalability of ScalLoPS is compared to current-state tools such BLAST using datasets derived from actual metagenomic workflows, and on resources of a cloud provider. ScalLoPS is shown to be able to scale up in an inverse exponential distribution over the number of processing nodes, while approximating the quality of BLAST results.