AbstractsComputer Science

Dimensionality reduction for visual exploration of similarity structures

by Jarkko Venna




Institution: Helsinki University of Technology; Teknillinen korkeakoulu
Department: Department of Computer Science and Engineering
Year: 2007
Keywords: Computer science; dimensionality reduction; exploratory data analysis; information retrieval; information visualization; manifold learning; Markov Chain Monte Carlo; dimensionaalisuuden pienentäminen; eksploratiivinen data-analyysi; informaation visualisointi; Markov-ketju Monte Carlo; monistojen oppiminen; tiedonhaku
Record ID: 1142760
Full text PDF: https://aaltodoc.aalto.fi/handle/123456789/2875


Abstract

Visualizations of similarity relationships between data points are commonly used in exploratory data analysis to gain insight on new data sets. Answers are searched for questions like: Does the data consist of separate groups of points? What is the relationship of the previously known interesting data points to other data points? Which points are similar to the points known to be of interest? Visualizations can be used both to amplify the cognition of the analyst and to help in communicating interesting similarity structures found in the data to other people. One of the main problems faced in information visualization is that while the data is typically very high-dimensional, the display is limited to only two or at most three dimensions. Thus, for visualization, the dimensionality of the data has to be reduced. In general, it is not possible to preserve all pairwise relationships between data points in the dimensionality reduction process. This has lead to the development of a large number of dimensionality reduction methods that focus on preserving different aspects of the data. Most of these methods were not developed to be visualization methods, which makes it hard to assess their suitability for the task of visualizing similarity structures. This problem is made more severe by the lack of suitable quality measures in the information visualization field. In this thesis a new visualization task, visual neighbor retrieval, is introduced. It formulates information visualization as an information retrieval task. To assess the performance of dimensionality reduction methods in this task two pairs of new quality measures are introduced and the performance of several dimensionality reduction methods are analyzed. Based on the insight gained on the existing methods, three new dimensionality reduction methods (NeRV, fNeRV and LocalMDS) aimed for the visual neighbor retrieval task, are introduced. All three new methods outperform other methods in numerical experiments; they vary in their speed and accuracy. A new color coding scheme, similarity-based color coding, is introduced in this thesis for visualization of similarity structures, and the applicability of the new methods in the task of creating graph layouts is studied. Finally, new approaches to visually studying the results and convergence of Markov Chain Monte Carlo methods are introduced. Samankaltaisuussuhteiden visualisointia käytetään eksploratiivisessa data-analyysissä usein ensimmäisenä askeleena uuden datajoukon tarkastelussa. Tavoitteena on muodostaa alustava käsitys datan rakenteesta ja tuottaa vastaus kysymyksiin kuten: Jakautuuko data erillisiin ryhmiin? Mikä on aiemmin havaittujen kiinnostavien datapisteiden suhde uusiin tuntemattomiin datapisteisiin? Mitkä pisteet ovat samankaltaisia kuin kiinnostaviksi tiedetyt pisteet? Visualisointi voi sekä helpottaa datan analyysiä että auttaa havaittujen rakenteiden kommunikoinnissa. Informaation visualisoinnissa data on tyypillisesti korkeaulotteista. Tämä on ongelmallista, koska näytöllä ei pystytä…