AbstractsBiology & Animal Science

Prediction of transcription factor co-occurrence using rank based statistics

by Alena van Bömmel

Institution: Freie Universität Berlin
Department: FB Mathematik und Informatik
Degree: PhD
Year: 2015
Record ID: 1118645
Full text PDF: http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000099081


One of the key questions in molecular biology is how cells with the same genetic code are able to differentiate into a large variety of cell types. The differentiation of the cell is controlled through the regulation of gene expression - a cellular mechanism that activates only a specific part of the genetic information. One of the main factors of the gene regulatory mechanisms are specific proteins called transcription factors (TFs). TFs bind with sequence preferences to regulatory regions in the DNA to control the expression of their target genes. They usually do not act alone but in a combinatorial manner, thus regulating cell-type-specific gene expression. This combinatorial cooperation of TFs is critical for the achievement of the cell type specificity of the cell. But, the experimental techniques that are able to detect the combinatorial cooperation of TFs on the DNA are sparse. The aim of this thesis is to predict co-occurrence of TFs in the regulatory genomic regions using estimated binding affinity of TFs to DNA. In detail, transcription factors are represented by ranked lists of their target genes, and then several rank based statistics are applied to detect significant associations between TF pairs. In the second part, tissue-specific co-occurrence of TFs is assessed, which is of much larger interest than general TF co-occurrence. Including additional information about tissue specificity of the corresponding genomic regions led to introducing a third dimension (or third ranked lists) for the association measure. Thus, the problem of the association of two TFs in tissue-specific promoters is translated into a 3-way contingency table. Then, the significance of the association of the two TFs can be assessed with the corresponding statistical tests. However, the choice of the correct null model in the table has a major impact on the obtained results. Since there is no general rule how to choose the underlying null model in the analysis of the TF co-occurrence we developed a new strategy to select the most appropriate model. These results were previously published (Myšicková and Vingron, 2012). We then use the newly available experimental results of the DNA accessibility assessed by DNase-seq technique over many different cell types. This novel data set requires a new method to find associated TFs. Here, we define a log ratio of two p-values of Fisher’s exact test: the first one is derived from cell-type-specific open DNase-hypersensitive sites (CTS-DHSs); the second one is derived from ubiquitous open DHSs. Thus, TF pairs with a large log ratio are strongly associated in the CTS-DHSs but not associated in the ubiquitous DHSs. With this approach we ensured that the predicted associated TF pairs co-occur in a cell-type-specific manner. With both methods, we are able to predict a large number of co-occurring TF pairs in various human tissues. The predicted co-occurring TF pairs are in significant agreement with other computational studies and are enriched for known protein-protein interactions. In addition,…