AbstractsBiology & Animal Science

Patterns and algorithms in high-throughput sequencing count data

by Alessandro Mammana




Institution: Freie Universität Berlin
Department:
Year: 2016
Posted: 02/05/2017
Record ID: 2087123
Full text PDF: http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000102558


Abstract

Proteins interacting with the genome, such as histones and transcription factors, play a major role in the regulation of gene expression. These interactions can be detected with ChIP-seq, which provides sequences of non-negative integers, called count signals, quantifying the presence of a given protein at each genomic locus. However, the computational analysis of count signals is challenging, as the biological patterns are complex and the datasets are large. In this thesis, we propose accurate and efficient algorithms for 3 different pattern detection problems in count signals. First, we present an algorithm that infers the genomic locations of positioned nucleosomes from histone ChIP-seq experiments. This method can integrate measurements for different histone marks and uses a wavelet to detect the count pattern corresponding to positioned nucleosomes. When compared with previous approaches using biological and simulated data, our method shows a higher precision and reduced runtimes. Next, we introduce an algorithm that annotates genomic regions according to the regulatory processes acting on them. The labels of this annotation, called chromatin states, are learned automatically from the measurements of multiple histone marks. Unlike previous approaches, our method characterizes chromatin states with a rigorous probabilistic model of the count signals. The resulting annotation is shown to be more strongly associated to DNA accessibility and transcription, as well as more robust and comprehensive compared to previous approaches. Lastly, we present an algorithm for finding transcription factor binding sites from ChIP-exo data (a method similar to ChIP-seq). Our algorithm learns the genomic sequences that attract the transcription factor (the motif) and the count pattern observable at binding sites (the footprint) at once. We show that our method finds the correct motif and detects interpretable footprints in 4 different datasets. Moreover, our approach can distinguish different categories of binding sites in the same experiment. Overall, the proposed algorithms represent an advancement in the automatic detection of biological patterns, as they are more accurate and in some cases considerably faster than existing approaches. Finally, they are based on a mathematical framework that is general and likely to be important for future research. Proteine, die mit dem Genom interagieren, spielen eine wichtige Rolle in der Regulation der Genexpression. Diese Interaktionen können mit Hilfe sogenannter ChIP-seq Experimente detektiert werden. Die resultierenden Messungen lassen sich durch Sequenzen von nicht-negativen ganzen Zahlen darstellen, die Zählsignale genannt werden und die die Proteinmenge in jedem Lokus quantifizieren. Die Analyse dieser Signale wird jedoch im Allgemeinen durch die Komplexität der biologischen Muster und der Größe der Datensätze erschwert. In der vorliegenden Arbeit werden Algorithmen für drei Mustererkennungsprobleme in Zählsignalen vorgeschlagen. Als erstes wird ein Algorithmus präsentiert,…