AbstractsPhilosophy & Theology

Truth finding in databases

by Bo Zhao

Institution: University of Illinois – Urbana-Champaign
Year: 2013
Keywords: data integration
Record ID: 2000839
Full text PDF: http://hdl.handle.net/2142/42470


In practical data integration systems, it is common for the data sources being integrated to provide conflicting information about the same entity. Consequently, a major challenge for data integration is to derive the most complete and accurate integrated records from diverse and sometimes conflicting sources. We term this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving the truth finding problem. In this thesis, we propose probabilistic models that can automatically infer true records and source quality without any supervision on both categorical data and numerical data. We further develop a new entity matching framework that considers source quality based on truth-finding models. On categorical data, in contrast to previous methods, our principled approach leverages a generative process of two types of errors (false positive and false negative) by modeling two different aspects of source quality. In so doing, ours is also the first approach designed to merge multi-valued attribute types. Our method is scalable, due to an efficient sampling-based inference algorithm that needs very few iterations in practice and enjoys linear time complexity, with an even faster incremental variant. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches to the truth finding problem on categorical data. While in practice, numerical data is not only ubiquitous but also of high value, e.g. price, weather, census, polls and economic statistics. Quality issues on numerical data can also be even more common and severe than categorical data due to its characteristics. Therefore, in this thesis we propose a new truth-finding method specially designed for handling numerical data. Based on Bayesian probabilistic models, our method can leverage the characteristics of numerical data in a principled way, when modeling the dependencies among source quality, truth, and claimed values. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches in both effectiveness and efficiency. We further observe that modeling source quality not only can help decide the truth but also can help match entities across different sources. Therefore, as a natural next step, we integrate truth finding with entity matching so that we could infer matching of entities, true attributes of entities and source quality in a joint fashion. This is the first entity matching approach that involves modeling source quality and truth finding. Experiments show that our approach can outperform state-of-the-art baselines.