Back to Search Start Over

Building an Entity-Centric Stream Filtering Test Collection for TREC 2012

Authors :
MASSACHUSETTS INST OF TECH CAMBRIDGE
Frank, John R
Kleiman-Weiner, Max
Roberts, Daniel A
Niu, Feng
Zhang, Ce
Re, Christopher
Soboroff, Ian
MASSACHUSETTS INST OF TECH CAMBRIDGE
Frank, John R
Kleiman-Weiner, Max
Roberts, Daniel A
Niu, Feng
Zhang, Ce
Re, Christopher
Soboroff, Ian
Source :
DTIC
Publication Year :
2012

Abstract

The Knowledge Base Acceleration track in TREC 2012 focused on a single task: filter a time-ordered corpus for documents that are highly relevant to a predefined list of entities. KBA differs from previous filtering evaluations in two primary ways: the stream corpus is 100x larger than previous filtering collections, and the use of entities as topics enables systems to incorporate structured knowledge bases (KB), such as Wikipedia, as external data sources. A successful KBA system must do more than resolve the meaning of entity mentions by linking documents to the KB: it must also distinguish centrally relevant documents that are worth citing in the entity's WP article. This combines thinking from natural language processing (NLP) and information retrieval (IR). Filtering tracks in TREC have typically used queries based on topics described by a set of keyword queries or short descriptions, and annotators have generated relevance judgments based on their personal interpretation of the topic. For TREC 2012, we selected a set of filter topics based on Wikipedia entities: 27 people and 2 organizations. Such named entities are more familiar in NLP than IR. We also constructed an entirely new stream corpus spanning 4,973 consecutive hours from October 2011 through April 2012. It contains over 400M documents, which we augmented with named entity classification tagging for the 40% of the documents identified as English. Each document has a timestamp that places it in the stream. The 29 target entities were mentioned infrequently enough in the corpus that NIST assessors could judge the relevance of most of the mentioning documents (91%). Judgments for documents from before January 2012 were provided to TREC teams as training data for filtering documents from the remaining hours. Run submissions were evaluated against the assessor-generated list of citation-worthy documents. We present peak F_1 scores averaged across the entities for all run submissions. High scoring system<br />Presented at the Twenty-First Text REtrieval Conference (TREC 2012) held in Gaithersburg, Maryland, November 6-9, 2012. The conference was co-sponsored by the National Institute of Standards and Technology (NIST) the Defense Advanced Research Projects Agency (DARPA) and the Advanced Research and Development Activity (ARDA). U.S. Government or Federal Rights License

Details

Database :
OAIster
Journal :
DTIC
Notes :
text/html, English
Publication Type :
Electronic Resource
Accession number :
edsoai.ocn872732731
Document Type :
Electronic Resource