AbstractsComputer Science

Defining Tags by Linking to Knowledge Bases

by Geir Ivar Hanssen




Institution: Norwegian University of Science and Technology
Department:
Year: 2014
Record ID: 1279335
Full text PDF: http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-24813


Abstract

This thesis looks into the process of automatically expanding image searches based on tags and the definitions of terms from public knowledge bases. To this end, we will try to extract terms related to a query. The process of finding these terms is known as feature extraction. The text collection on which we perform this feature extraction is, in this thesis, based on text retrieved from public knowledge bases using the original query. The program will in other words, first retrieve related documents. It will then pick out related terms using either Chi-Squared, or an approach I've coined "Neighbouring Terms", or NT. The latter is an approach that is much quicker to process, and may prove to give good precision to term extraction, despite not having to perform such a demanding process beforehand. This thesis will also look into different variables in these kind of processes to find the best approach to both Chi-Squared and NT. Because this automatic term extraction is set to work on a limited size of articles, there is a question of how many articles would be needed to get the best results. There are also several similarity models to consider when building something like this. For that reason, this thesis also looks into the different results obtained when working with models like the Vector Space model, Okapi BM25 and the Language Model. Other variables that this thesis looks into is whether or not term pre-processing, like stop word removal and stemming, are beneficial or not. Also, what gives the best results between searching for abstracts based on their title or their contents, and with how many terms can a query be expanded without losing too much relatedness.To evaluate the terms suggested by these methods, this thesis looks into the P@n values for 20 queries, as well as using metrics such as MAP (Mean Average Precision) to evaluate the sum of the results for each approach. To avoid biased evaluation, we also perform a user survey. We present the results of a survey where 32 people have given their opinion on the different terms suggested by the system, and how related to a given query they are. The main conclusion in this thesis is that NT does run faster than Chi-Squared, but while results did vary, the precision values on an average fell in favour of Chi-Squared. That said, it did not perform better by much, and with future improvements it could prove a viable solution in automatically generating semantically related terms without having to perform heavy processing.