Lena Pons
Evaluating Relation Extractors to Integrate into a Knowledge Base Population Pipeline
Natural language processing (NLP) is the process of converting unstructured natural language into a structured representation, such that computers can extract information from it. Information Extraction (IE) is the related discipline that focuses on creating the structured representation. A structured representation, or knowledge base, can then be queried to extract information. A knowledge base can be used to answer questions, find relationships or patterns in the text, or extract lists of all matching examples of some pattern from the text base.
The structured data in a knowledge base is comprised of entities and relations. Entities can be thought of as proper nouns, while relations are concepts that link two entities. For example, from the sentence “Albert Einstein was born in Ulm, Germany,” the entities are “Albert Einstein” and “Ulm, Germany” and the relation that connects them is “born in.” The National Institute of Standards and Technology (NIST) supports research in several areas of NLP, including generating knowledge bases from a large dataset of text. Information extraction from the text base can be divided into sub tasks of extracting the entities and extracting the relations. Working from an existing software system that generates a knowledge base from a large collection of text, one approach to extracting more information from the text is to increase the number of relations extracted.
In the literature, different relation extractors perform better or worse at finding different types of relations. One potential approach to improving overall relation extraction is to employ more than one software tool to extract relations. This approach requires devising some means of discovering and aligning the differences extracted relations from multiple relation extractors.
A relation extractor called OLLIE was run on a small test set of newswire documents (Mausam 2012). Based initially on human inspection, some relations were extracted by OLLIE that were not present in extractions over the same documents using the existing software pipeline. A program that could reconcile the extractions in both documents using their location in the text was used to align the extractions that were present in both extractions. This step is necessary to determine the additional relations that were not captured in the existing system.
The new relations extracted using OLLIE will then be incorporated into the knowledge base generated by the existing system. The knowledge base is tested using queries that can be answered by information expected to be found in the knowledge base.
Running OLLIE on a larger subset of documents will be used to quantify how many additional relations can be captured that were not in the original extraction. The extractions should be evaluated in two ways. First, a count of the number of additional relations that were extracted by OLLIE. Second, additional relations should be sampled to evaluate what percent of them are accurate. A relation can be considered accurate if a human reviewer finds that the relation extracted from a sentence represents the information present in the original text.