Back to Search
Start Over
Efficient n-gram analysis in R with cmscu
- Source :
- Behavior research methods. 48(3)
- Publication Year :
- 2016
-
Abstract
- We present a new R package, cmscu, which implements a Count-Min-Sketch with conservative updating (Cormode and Muthukrishnan Journal of Algorithms, 55(1), 58–75, 2005), and its application to n-gram analyses (Goyal et al. 2012). By writing the core implementation in C++ and exposing it to R via Rcpp, we are able to provide a memory-efficient, high-throughput, and easy-to-use library. As a proof of concept, we implemented the computationally challenging (Heafield et al. 2013) modified Kneser–Ney n-gram smoothing algorithm using cmscu as the querying engine. We then explore information density measures (Jaeger Cognitive Psychology, 61(1), 23–62, 2010) from n-gram frequencies (for n=2,3) derived from a corpus of over 2.2 million reviews provided by a Yelp, Inc. dataset. We demonstrate that these text data are at a scale beyond the reach of other more common, more general-purpose libraries available through CRAN. Using the cmscu library and the smoothing implementation, we find a positive relationship between review information density and reader review ratings. We end by highlighting the important use of new efficient tools to explore behavioral phenomena in large, relatively noisy data sets.
- Subjects :
- Theoretical computer science
Computer science
Big data
Experimental and Cognitive Psychology
Scale (descriptive set theory)
computer.software_genre
Information theory
01 natural sciences
050105 experimental psychology
010104 statistics & probability
Arts and Humanities (miscellaneous)
Developmental and Educational Psychology
Humans
0501 psychology and cognitive sciences
0101 mathematics
General Psychology
business.industry
05 social sciences
Search Engine
R package
n-gram
Proof of concept
Data Interpretation, Statistical
Core (graph theory)
Psychology (miscellaneous)
Data mining
business
computer
Smoothing
Algorithms
Software
Behavioral Research
Subjects
Details
- ISSN :
- 15543528
- Volume :
- 48
- Issue :
- 3
- Database :
- OpenAIRE
- Journal :
- Behavior research methods
- Accession number :
- edsair.doi.dedup.....fc8ee2e04fe1a689f01b689a77e28d27