AbstractsMathematics

Bayesian Stochastic Partition Models For Markovian Dependence Structures

by Väinö Jääskinen




Institution: University of Helsinki
Department: Department of Mathematics and Statistics
Year: 2015
Keywords: tilastotiede
Record ID: 1145684
Full text PDF: http://hdl.handle.net/10138/152780


Abstract

In various fields of knowledge we can observe that the availability of potentially useful data is increasing fast. A prime example is the DNA sequence data. This increase is both an opportunity and a challenge as new methods are needed to benefit from the big data sets. This has sparked a fruitful line of research in statistics and computer science that can be called machine learning. In this thesis, we develop machine learning methods based on the Bayesian approach to statistics. We address a fairly general problem called clustering, i.e. dividing a set of objects to non-overlapping group based on their similarity, and apply it to models with Markovian dependence structures. We consider sequence data in a finite alphabet and present a model class called the Sparse Markov chain (SMC). It is a special case of a Markov chain (MC) model and offers a parsimonious description of the data generating mechanism. A Variable length Markov chain (VLMC) is a popular sparse model presented earlier in the literature and it has a representation as an SMC model. We develop Bayesian clustering methodology for learning the SMC and other Markovian models. Another problem that we study in this thesis is causal inference. We present a model and an algorithm for learning causal mechanisms from data. The model can be considered as a stochastic extension of the sufficient-component cause model that is popular in epidemiology. In our model there are several causal mechanisms each with its own parameters. A mixture distribution gives a probability that an outcome variable is associated with a mechanism. Applications that are considered in this thesis come mainly from computational biology. We cluster states of Markovian models estimated from DNA sequences. This gives an efficient description of the sequence data when comparing to methods reported in the literature. We also cluster DNA sequences with Markov chains, which results in a method that can be used for example in the estimation of bacterial community composition in a sample from which DNA is extracted. The causal model and the related learning algorithm are able to estimate mechanisms from fairly challenging data. We have developed the learning algorithms with big data sets in mind. Still, there is a need to develop them further to handle ever larger data sets. Tieteeseen ja teknologiaan liittyen voidaan huomata, että potentiaalisesti hyödyllisen datan määrä on vuosi vuodelta suurempi. Hyvä esimerkki on DNA-sekvenssidata, jonka määrä kasvaa varsinkin mittalaitteiden kehityksen myötä. Tämä kasvu on sekä mahdollisuus että haaste, sillä entistä suurempien aineistojen hyödyntämiseen tarvitaan uusia menetelmiä. On syntynyt uusi koneoppimisen tieteenala, joka yhdistää menetelmiä sekä teoriaa tilastotieteestä ja tietojenkäsittelytieteestä. Tässä tilastotieteen alaan kuuluvassa väitöskirjatyössä on kehitetty koneoppimisen menetelmiä lähtien tilastotieteen Bayes-paradigmasta, joka perustuu epävarmuuden mallintamiseen todennäköisyyksien avulla. Keskeinen ongelma on klusterointi: miten…