AbstractsMathematics

Visualization and interpretability in probabilistic dimensionality reduction models

by Alessandra Tosi




Institution: Universitat Politècnica de Catalunya
Department:
Year: 2014
Record ID: 1125852
Full text PDF: http://hdl.handle.net/10803/285013


Abstract

Over the last few decades, data analysis has swiftly evolved from being a task addressed mainly within the remit of multivariate statistics, to an endevour in which data heterogeneity, complexity and even sheer size, driven by computational advances, call for alternative strategies, such as those provided by pattern recognition and machine learning. Any data analysis process aims to extract new knowledge from data. Knowledge extraction is not a trivial task and it is not limited to the generation of data models or the recognition of patterns. The use of machine learning techniques for multivariate data analysis should in fact aim to achieve a dual target: interpretability and good performance. At best, both aspects of this target should not conflict with each other. This gap between data modelling and knowledge extraction must be acknowledged, in the sense that we can only extract knowledge from models through a process of interpretation. Exploratory information visualization is becoming a very promising tool for interpretation. When exploring multivariate data through visualization, high data dimensionality can be a big constraint, and the use of dimensionality reduction techniques is often compulsory. The need to find flexible methods for data modelling has led to the development of non-linear dimensionality reduction techniques, and many state-of-the-art approaches of this type fall in the domain of probabilistic modelling. These non-linear techniques can provide a flexible data representation and a more faithful model of the observed data compared to the linear ones, but often at the expense of model interpretability, which has an impact in the model visualization results. In manifold learning non-linear dimensionality reduction methods, when a high-dimensional space is mapped onto a lower-dimensional one, the obtained embedded manifold is subject to local geometrical distortion induced by the non-linear mapping. This kind of distortion can often lead to misinterpretations of the data set structure and of the obtained patterns. It is important to give relevance to the problem of how to quantify and visualize the distortion itself in order to interpret data in a more faithful way. The research reported in this thesis focuses on the development of methods and techniques for explicitly reintroducing the local distortion created by non-linear dimensionality reduction models into the low-dimensional visualization of the data that they produce, as well as in the definition of metrics for probabilistic geometries to address this problem. We do not only provide methods only for static data, but also for multivariate time series. The reintegration of the quantified non-linear distortion into the visualization space of the analysed non-linear dimensionality reduction methods is a goal by itself, but we go beyond it and consider alternative adequate metrics for probabilistic manifold learning. For that, we study the role of textit{Random geometries}, that is, distributions of manifolds, in machine learning and data…