Translation-based Ranking in Cross-Language Information Retrieval

Felix Hieber

Abstracts Computer Science

by Felix Hieber

Institution:	Universität Heidelberg
Department:	Neuphilologische Fakultät
Degree:	PhD
Year:	2015
Record ID:	1117826
Full text PDF:	http://www.ub.uni-heidelberg.de/archiv/18696

Abstract

Today's amount of user-generated, multilingual textual data generates the necessity for information processing systems, where cross-linguality, i.e the ability to work on more than one language, is fully integrated into the underlying models. In the particular context of Information Retrieval (IR), this amounts to rank and retrieve relevant documents from a large repository in language A, given a user's information need expressed in a query in language B. This kind of application is commonly termed a Cross-Language Information Retrieval (CLIR) system. Such CLIR systems typically involve a translation component of varying complexity, which is responsible for translating the user input into the document language. Using query translations from modern, phrase-based Statistical Machine Translation (SMT) systems, and subsequently retrieving monolingually is thus a straightforward choice. However, the amount of work committed to integrate such SMT models into CLIR, or even jointly model translation and retrieval, is rather small. In this thesis, I focus on the shared aspect of ranking in translation-based CLIR: Both, translation and retrieval models, induce rankings over a set of candidate structures through assignment of scores. The subject of this thesis is to exploit this commonality in three different ranking tasks: (1) "Mate-ranking" refers to the task of mining comparable data for SMT domain adaptation through translation-based CLIR. "Cross-lingual mates" are direct or close translations of the query. I will show that such a CLIR system is able to find in-domain comparable data from noisy user-generated corpora and improves in-domain translation performance of an SMT system. Conversely, the CLIR system relies itself on a translation model that is tailored for retrieval. This leads to the second direction of research, in which I develop two ways to optimize an SMT model for retrieval, namely (2) by SMT parameter optimization towards a retrieval objective ("translation ranking"), and (3) by presenting a joint model of translation and retrieval for "document ranking". The latter abandons the common architecture of modeling both components separately. The former task refers to optimizing for preference of translation candidates that work well for retrieval. In the core task of "document ranking" for CLIR, I present a model that directly ranks documents using an SMT decoder. I present substantial improvements over state-of-the-art translation-based CLIR baseline systems, indicating that a joint model of translation and retrieval is a promising direction of research in the field of CLIR. Die Menge an mehrsprachigen, benutzergenerierten Textdaten erzeugt zunehmend einen Bedarf an informationsverarbeitenden Systemen, in denen eine sprachenübergreifende Verarbeitung vollständig in den zugrundeliegenden Modellen integriert ist. Im Kontext der Suche von Textdokumenten, im Folgenden Information Retrieval (IR) genannt, bedeutet dies die Erzeugung eines Rankings über Dokumente in Sprache…

AbstractsComputer Science

Abstract

Abstracts Computer Science