AbstractsLanguage, Literature & Linguistics

Robust handling of out-of-vocabulary words in deep language processing

by João Ricardo Silva

Institution: Universidade de Lisboa
Year: 2014
Keywords: Teses de doutoramento - 2014
Record ID: 1318069
Full text PDF: http://www.rcaap.pt/detail.jsp?id=oai:repositorio.ul.pt:10451/11956


Tese de doutoramento, Informática (Ciências da Computação), Universidade de Lisboa, Faculdade de Ciências, 2014 Deep grammars handle with precision complex grammatical phenomena and are able to provide a semantic representation of their input sentences in some logic form amenable to computational processing, making such grammars desirable for advanced Natural Language Processing tasks. The robustness of these grammars still has room to be improved. If any of the words in a sentence is not present in the lexicon of the grammar, i.e. if it is an out-of-vocabulary (OOV) word, a full parse of that sentence may not be produced. Given that the occurrence of such words is inevitable, e.g. due to the property of lexical novelty that is intrinsic to natural languages, deep grammars need some mechanism to handle OOV words if they are to be used in applications to analyze unrestricted text. The aim of this work is thus to investigate ways of improving the handling of OOV words in deep grammars. The lexicon of a deep grammar is highly thorough, with words being assigned extremely detailed linguistic information. Accurately assigning similarly detailed information to OOV words calls for the development of novel approaches, since current techniques mostly rely on shallow features and on a limited window of context, while there are many cases where the relevant information is to be found in wider linguistic structure and in long-distance relations. The solution proposed here consists of a classifier, SVM-TK, that is placed between the input to the grammar and the grammar itself. This classifier can take a variety of features and assign to words deep lexical types which can then be used by the grammar when faced with OOV words. The classifier is based on support-vector machines which, through the use of kernels, allows the seamless use of features encoding linguistic structure in the classifier. This dissertation focuses on the HPSG framework, but the method can be used in any framework where the lexical information can be encoded as a word tag. As a case study, we take LX-Gram, a computational grammar for Portuguese, to improve its robustness with respect to OOV verbs. Given that the subcategorization frame of a word is a substantial part of what is encoded in an HPSG deep lexical type, the classifier takes graph encoding grammatical dependencies as features. At runtime, these dependencies are produced by a probabilistic dependency parser. The SVM-TK classifier is compared against the state-of-the-art approaches for OOV handling, which consist of using a standard POS-tagger to assign lexical types, in essence doing POS-tagging with a highly granular tagset. Results show that SVM-TK is able to improve on the state-of-the-art, with the usual data-sparseness bottleneck issues imposing this to happen when the amount of training data is large enough. As gramáticas de processamento profundo lidam de forma precisa com fenómenos linguisticos complexos e são capazes de providenciar uma representação semântica das frases que lhes são…