AbstractsMathematics

Methods for DNA Methylation Sequencing Analysis and their Application on Cancer Data

by Helene Kretzmer




Institution: Universität Leipzig
Department:
Year: 2016
Posted: 02/05/2017
Record ID: 2065871
Full text PDF: http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-203416


Abstract

The fundamental subject of this thesis is the development of tools for the analysis of DNA methylation data as well as their application on bisulfite sequencing data comprising a large number of samples. DNA methylation is one of the major epigenetic modifications. It affects the cytosines of the DNA and is essential for the normal development of cells and tissues. Unusual alterations are associated with a variety of diseases and, specially, in cancergeneous tissues global changes in the DNA methylation level have been detected. To sequence DNA methylation on single nucleotide resolution, the sequences are treated with sodium bisulfite before sequencing, whereby unmethylated cytosines are represented as thymines. Thus, specialized techniques are required to process and analyze these kind of data. Here, the bisulfite analysis toolkit BAT is introduced, that is designed to facilitate an quick analysis of bisulfite treated DNA methylation sequencing data. It covers all steps of processing raw sequencing data up to calling of differential DNA methylation. At the begin of analysis, sodium bisulfite treated sequence data are aligned and DNA methylation rates for each covered cytosine in the reference genome are called. Subsequently, BAT integrates annotation data and performs basic analysis, i. e., methylation rate distribution plots and hierarchical clustering of the samples. In addition, calling of differentially methylated regions is performed and statistics of called regions are automatically created. Finally, DNA methylation and gene expression data integration is covered by the calculation of correlating regions. Secondly, a novel algorithm, metilene, for the calculation of differentially methylated regions (DMRs) between two groups of samples is introduced. Existing methods are limited in terms of detection sensitivity as well as time and memory consumption. Our approach is based on a circular binary segmentation, using a scoring function to detect sub-regions that show a stronger difference between the mean methylation levels of two groups than the surrounding background. These sub-regions are tested using a two-dimensional Kolmogorov Smirnov test (2D-KS test) [Fasano 1987] for significant differences taking all samples of each group into account. The use of the non-parametric 2D-KS test allows to avoid assumptions about a background distribution. Furthermore, the two dimensions of the problem, i. e., (i) the detection of a region, such that (ii) the methylation rates of the samples in the groups are significantly different, are taken into account in a single test. The algorithm calls DMRs in sufficiently short time on single sample comparisons as well as on about 50 samples per group. Furthermore, it works on whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) data and is able so estimate missing data points from the methylation rates of other samples in the group. Benchmarks on simulated and real data sets show that metilene outperforms other existing methods and is… Advisors/Committee Members: Stadler, Peter F. (referee), Vingron, Martin (referee).