Accurate and fast taxonomic profiling of microbial communities

by Damon Shahrivar

Institution: KTH Royal Institute of Technology
Year: 2015
Keywords: Engineering and Technology; Electrical Engineering, Electronic Engineering, Information Engineering; Other Electrical Engineering, Electronic Engineering, Information Engineering; Teknik och teknologier; Elektroteknik och elektronik; Annan elektroteknik och elektronik; Teknologie masterexamen - Trådlösa system; Master of Science - Wireless Systems
Record ID: 1357442
Full text PDF: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-162919


With the advent of next generation sequencing there has been an explosion of the size of data that needs to be processed, where next generation sequencing yields basepairs of DNA in the millions. The rate at which the size of data increases supersedes Moores law therefore there is a huge demand for methods to nd meaningful labels of sequenced data. Studies of microbial diversity of a sample is one such challenge in the eld of metagenomics. Finding the distribution of a bacterial community has many uses for example, obesity control. Existing methods often resort to read-by-read classication which can take several days of computing time in a regular desktop environment, excluding genomic scientists without access to huge clusters of computational units. By using sparsity enforcing methods from the general sparse signal processing eld (such as compressed sensing), solutions have been found to the bacterial community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. The inference task is reduced to a general statistical model based on kernel density estimation techniques that are solved by existing convex optimization tools. The objective is to o er a reasonably fast community composition estimation method. This report proposes, clustering as a means of aggregating data to improve existing techniques run-time and biological delity. Use of convex optimization tools to increase the accuracy of mixture model parameters are also explored and tested. The work is concluded by experimentation on proposed improvements with satisfactory results. The use of Dirichlet mixtures is explored as a parametric model of the sample distribution where it is deemed that the Dirichlet is a good choice for aggregation of k-mer feature vectors but the use of Expectation Maximization is unt for parameter estimation of bacterial 16s rRNA samples. Finally, a semi-supervised learning method found on distance based classication of taxa has been implemented and tested on real biological data with high biological delity. ; Nya tekniker inom DNA-sekvensering har givit upphov till en explosion pa data som nns att tillga. Nasta generations DNA-sekvensering generar baspar som stracker sig i miljonerna och mangden data okas i en exponentiell takt, vilket ar varfor det nns ett stort behov av ny skalbar metodik som kan analysera kvantitiv data for att fa ut relevant information. Den bakteriella artfordelning av ett provror ar en sadan problemst allning inom meta-genomik, vilket har era tillampningsomraden som exempelvis, studier av fettma. I dagslaget sa ar den vanligaste metoden for att fa ut artfordelningen genom att klassiera DNA-strangarna av bakterierna, vilket ar en tidskravande losning som kan ta upp emot ett dygn for att processera…