Back to Search Start Over

Efficient n-gram analysis in R with cmscu

Authors :
Rick Dale
Jason K. Davis
Suzanne Sindi
David W. Vinson
Source :
Behavior research methods. 48(3)
Publication Year :
2016

Abstract

We present a new R package, cmscu, which implements a Count-Min-Sketch with conservative updating (Cormode and Muthukrishnan Journal of Algorithms, 55(1), 58–75, 2005), and its application to n-gram analyses (Goyal et al. 2012). By writing the core implementation in C++ and exposing it to R via Rcpp, we are able to provide a memory-efficient, high-throughput, and easy-to-use library. As a proof of concept, we implemented the computationally challenging (Heafield et al. 2013) modified Kneser–Ney n-gram smoothing algorithm using cmscu as the querying engine. We then explore information density measures (Jaeger Cognitive Psychology, 61(1), 23–62, 2010) from n-gram frequencies (for n=2,3) derived from a corpus of over 2.2 million reviews provided by a Yelp, Inc. dataset. We demonstrate that these text data are at a scale beyond the reach of other more common, more general-purpose libraries available through CRAN. Using the cmscu library and the smoothing implementation, we find a positive relationship between review information density and reader review ratings. We end by highlighting the important use of new efficient tools to explore behavioral phenomena in large, relatively noisy data sets.

Details

ISSN :
15543528
Volume :
48
Issue :
3
Database :
OpenAIRE
Journal :
Behavior research methods
Accession number :
edsair.doi.dedup.....fc8ee2e04fe1a689f01b689a77e28d27