AbstractsComputer Science

Automatic extraction of definitions

by Rosa Del Gaudio

Institution: Universidade de Lisboa
Year: 2014
Keywords: Engenharia informática; Processamento da linguagem natural; Extracção de informação; Teses de doutoramento - 2014
Record ID: 1318183
Full text PDF: http://www.rcaap.pt/detail.jsp?id=oai:repositorio.ul.pt:10451/10818


Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2014 This doctoral research work provides a set of methods and heuristics for building a definition extractor or for fine-tuning an existing one. In order to develop and test the architecture, a generic definitions extractor for the Portuguese language is built. Furthermore, the methods were tested in the construction of an extractor for two languages different from Portuguese, which are English and, less extensively, Dutch. The approach presented in this work makes the proposed extractor completely different in nature in comparison to the other works in the field. It is a matter of fact that most systems that automatically extract definitions have been constructed taking into account a specific corpus on a specific topic, and are based on the manual construction of a set of rules or patterns capable of identifyinf a definition in a text. This research focused on three types of definitions, characterized by the connector between the defined term and its description. The strategy adopted can be seen as a "divide and conquer"approach. Differently from the other works representing the state of the art, specific heuristics were developed in order to deal with different types of definitions, namely copula, verbal and punctuation definitions. We used different methodology for each type of definition, namely we propose to use rule-based methods to extract punctuation definitions, machine learning with sampling algorithms for copula definitions, and machine learning with a method to increase the number of positive examples for verbal definitions. This architecture is justified by the increasing linguistic complexity that characterizes the different types of definitions. Numerous experiments have led to the conclusion that the punctuation definitions are easily described using a set of rules. These rules can be easily adapted to the relevant context and translated into other languages. However, in order to deal with the other two definitions types, the exclusive use of rules is not enough to get good performance and it asks for more advanced methods, in particular a machine learning based approach. Unlike other similar systems, which were built having in mind a specific corpus or a specific domain, the one reported here is meant to obtain good results regardless the domain or context. All the decisions made in the construction of the definition extractor take into consideration this central objective. Este trabalho de doutoramento visa proporcionar um conjunto de métodos e heurísticas para a construção de um extractor de definição ou para melhorar o desempenho de um sistema já existente, quando usado com um corpus específico. A fim de desenvolver e testar a arquitectura, um extractor de definic˛ões genérico para a língua Portuguesa foi construído. Além disso, os métodos foram testados na construção de um extractor para um idioma diferente do Português, nomeadamente Inglês, algumas heurísticas também foram testadas…