AbstractsComputer Science

Comparison of Data Preprocessing Techniques on Software Sources for Topic Modeling

by John Willems




Institution: Open Universiteit Nederland
Department:
Year: 2014
Keywords: Latent Dirichlet Allocation; Topic Modeling; Rascal; Metrics; source code
Record ID: 1245218
Full text PDF: http://hdl.handle.net/1820/5350


Abstract

Studies have shown that topic modeling with Latent Dirichlet Allocation (LDA) is a useful (semi-)unsupervised technique to reveal information about a software system that was not known before. As topic modeling uses unstructured data we found no consensus in literature how to conduct data preprocessing on software source code to extract unstructured data. In this thesis we want to find the data preprocessing technique that leads to the most optimal topic distribution for a given software system, therefore we create an experiment in which we compare four data preprocessing techniques. We select two techniques from literature, we define one by ourselves and we try one technique in which we take the software source code as-is. To measure the differences between the four techniques we use structural coupling metrics. We develop software that is dedicated to our experiment in the domain-specific language Rascal and in Java. Results suggest there is minor difference between the four techniques when we perform the experiment for two software systems. This implies we can use the software source code as-is for topic modeling. If future work confirms this preliminary result it means a significant reduction of effort using topic modeling for software systems.