AbstractsComputer Science

Improved Named Entity Recognition for Information Extraction in Record-Based Data

by Chun Yong Moon




Institution: University of New South Wales
Department: Computer Science & Engineering
Year: 2014
Keywords: Record-Based Data; Information Extraction; Named Entity Recognition; Named Entity; Named Entity Recognizer
Record ID: 1053373
Full text PDF: http://handle.unsw.edu.au/1959.4/53888


Abstract

The massive increase of information on the internet in recent years has brought about so called “information overload" for users where they might find it difficult to process vast amount of information in a meaningful and efficient way. For that reason, Information Extraction (IE) has been actively researched to provide assistance to users in recognizing core Names Entities from data. One of the problems in existing Named Entity Recognizers is that they do not perform well on record-based (rather than sentence-based) text. This study was aimed at improving the Name Entity Recognition for IE in record-based data. Our first approach of this study exploited the characteristic of repeated rows and columns in record-based data. In particular, our method works to identify the most likely pattern that is repeated in a given record-based text region. Using the key patterns from every record-based text region, we could then identify missing or mis-categorised entities and correct the mistakes to improve the recognition rate. According our experiment results, this approach, column-based Name Entity Recognition, correctly categorized as Person, Organization and Location/Country entities, up to 96%, 95% and 98% precision/recall rate respectively. In our second approach, we ignored the columns in record-based data and treated the text region as a sequence without new lines. We have then identified numerous features in such a sequence which might help indicate repetitive records in the sequence. Our method does not require any training beforehand. With the outcome of the examination of the second approach, Personal, Organization and Location/Country entities were correctly categorized, up to 79%, 84% and 98% precision/recall rate respectively. Our approaches can help boost the Named Entity recognition rate in many forms of numerous column types. The results showed increased performance compared to the existing Name Entity Recognizers.