Start Over

Improving corpus reproducibility through modular text transformations and connected data set.

Authors :: Pulliza, Jonathan
Shah, Chirag
Source :: Proceedings of the Association for Information Science & Technology; 2018, Vol. 55 Issue 1, p883-884, 2p
Publication Year :: 2018
Abstract: The Enron Email Corpus is one of the most utilized collections of documents in Natural Language Processing, Machine Learning, and Network Analysis. Different groups of researchers have transformed the corpus, changing the content and format to meet their needs. The many distinct versions can all claim to be the Enron Email Corpus, though they are as distinct from the original publicly available collection as they are from each other. Researchers then have to determine the usefulness of a particular version in comparison to the many others available, as well as ascertain what has been done to the collection and how it would affect their specific research goal. This is especially important for reproducing a particular research method onto a different corpus, as transposing a method necessitates a deep understanding of the original data in the experiment. This project models the various transformations performed on different versions of the collection to form a network of connected datasets, highlighting the most important nodes as well as the most common transformations. Traversing different paths between nodes offers the community a way to model and reproduce the necessary data work performed on one collection that can be transposed onto other collections. [ABSTRACT FROM AUTHOR]

Subjects :: EMAIL
BIG data
NATURAL language processing
NETWORK analysis (Communication)
MACHINE learning
CITATION analysis

Details

Language :: English
ISSN :: 23739231
Volume :: 55
Issue :: 1
Database :: Complementary Index
Journal :: Proceedings of the Association for Information Science & Technology
Publication Type :: Conference
Accession number :: 134431160
Full Text :: https://doi.org/10.1002/pra2.2018.14505501159

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Improving corpus reproducibility through modular text transformations and connected data set.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Improving corpus reproducibility through modular text transformations and connected data set.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources