Descriptor: "zipf's law" / Topic: computer.software_genre - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"zipf's law"' showing total 307 results

Start Over Descriptor "zipf's law" Topic computer.software_genre

307 results on '"zipf's law"'

1. Characteristics of Malay translated hadith corpus

Author: Siti Syakirah Sazali, Nurazzah Abd Rahman, and Zainab Binti Abu Bakar
Subjects: General Computer Science, Zipf's law, business.industry, Computer science, media_common.quotation_subject, Search engine indexing, 020206 networking & telecommunications, Context (language use), 02 engineering and technology, Ambiguity, computer.software_genre, language.human_language, Field (computer science), 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), language, 020201 artificial intelligence & image processing, Artificial intelligence, business, Cluster analysis, computer, Natural language processing, media_common, Malay
Abstract: Annotated corpus can greatly assist in the natural language processing field. For example, computers can understand more of the document context, and indexing and clustering in information retrieval can be done precisely with less or no ambiguity of words. However, there are only a few annotated corpora in Malay language, which are not publicly shared. In this paper, we delve into analysing and annotating Malay translated hadith documents in terms of tagging and entities. There are three phases, which are manual filtering and cleaning, analysing the corpus and creating the benchmark. As the result, an analysis and benchmark of Malay translated hadith corpus were produced in term of part-of-speech and named entities tags that follows the Zipf’s law distribution.
Published: 2022
Full Text: View/download PDF

2. Zipf’s law analysis on the leaked Iranian users’ passwords

Author: Amir Jalaly Bidgoly and Zeinab Alebouyeh
Subjects: Password, Password policy, Security analysis, Authentication, Software_OPERATINGSYSTEMS, Zipf's law, Computer science, English language, Computer security, computer.software_genre, ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS, Computational Theory and Mathematics, Hardware and Architecture, Computer Science (miscellaneous), Selection (linguistics), computer, Software, Vulnerability (computing)
Abstract: Textual passwords are one of the most common methods of authentication and an important factor in systems security. Knowing the correct distribution of users’ passwords can play an important role in defining password policies and preventing various attacks. Culture and language can affect the pattern of users’ password selection and consequently, influence the vulnerability of passwords to guessing attacks. Therefore, knowing the distribution of English users’ passwords may not be appropriate for the security analysis of non-English users’ passwords. The main purpose of this paper is to analyze the passwords of Iranian users and investigating their differences from English-speaking users. The paper also examines the existence of Zipf’s law on Iranian passwords as the most well-known distribution for passwords. Password analysis of Iranian users shows that the popular password length between Iranian users and users of other countries is not much different, but in terms of the combination of characters used in the passwords, Iranian users are more inclined to use numeric passwords while English language users are more inclined to use passwords made up of alphabet. In this paper, Zipf’s law is reviewed on five datasets of Iranian users’ passwords using three different approaches including PDF, PDF with removing unpopular passwords and, CDF. Among these methods, in the CDF method, the passwords best matched with a Zipf’s law distribution between 0.02 and 0.07. Finally, the robustness of Iranians’ passwords to statistical guessing attacks has been measured and it is concluded that the passwords of Iranian users are more vulnerable to guessing attacks than English language users.
Published: 2021
Full Text: View/download PDF

3. Domain-based Latent Personal Analysis and its use for impersonation detection in social media

Author: Osnat Mokryn and Hagit Ben-Shoshan
Subjects: FOS: Computer and information sciences, Vocabulary, Computer Science - Computation and Language, Correctness, Zipf's law, business.industry, Computer science, media_common.quotation_subject, computer.software_genre, Signature (logic), Computer Science Applications, Education, Domain (software engineering), Ranking (information retrieval), Human-Computer Interaction, Set (abstract data type), Artificial intelligence, business, Computation and Language (cs.CL), computer, Natural language processing, Word (computer architecture), media_common
Abstract: Zipf’s law defines an inverse proportion between a word’s ranking in a given corpus and its frequency in it, roughly dividing the vocabulary into frequent words and infrequent ones. Here, we stipulate that within a domain an author’s signature can be derived from, in loose terms, the author’s missing popular words and frequently used infrequent words. We devise a method, termed Latent Personal Analysis (LPA), for finding domain-based attributes for entities in a domain: their distance from the domain and their signature, which determines how they most differ from a domain. We identify the most suitable distance metric for the method among several and construct the distances and personal signatures for authors, the domain’s entities. The signature consists of both over-used terms (compared to the average) and missing popular terms. We validate the correctness and power of the signatures in identifying users and set existence conditions. We test LPA in several domains, both textual and non-textual. We then demonstrate the use of the method in explainable authorship attribution: we define algorithms that utilize LPA to identify two types of impersonation in social media: (1) authors with sockpuppets (multiple) accounts and (2) front-users accounts, operated by several authors. We validate the algorithms and employ them over a large-scale dataset obtained from a social media site with over 4000 users. We corroborate these results using temporal rate analysis. LPA can further be used to devise personal attributes in a wide range of scientific domains in which the constituents have a long-tail distribution of elements.
Published: 2021
Full Text: View/download PDF

4. Efficient communication in written and performed music

Author: Laurent Bonnasse-Gahot, Centre d'Analyse et de Mathématique sociales (CAMS), Centre National de la Recherche Scientifique (CNRS)-École des hautes études en sciences sociales (EHESS), and École des hautes études en sciences sociales (EHESS)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Male, Range (music), efficient communication, Computer science, harmony, Writing, [SHS.INFO]Humanities and Social Sciences/Library and information sciences, Cognitive Neuroscience, Experimental and Cognitive Psychology, Information theory, computer.software_genre, 050105 experimental psychology, 03 medical and health sciences, 0302 clinical medicine, Artificial Intelligence, Humans, 0501 psychology and cognitive sciences, Relevance (information retrieval), music, Language, information theory, [SHS.MUSIQ]Humanities and Social Sciences/Musicology and performing arts, MIDI, Zipf's law, Syntax (programming languages), business.industry, Communication, 05 social sciences, Piano, Phonetics, computer.file_format, [SCCO.LING]Cognitive science/Linguistics, Authorship, Artificial intelligence, business, computer, 030217 neurology & neurosurgery, Natural language processing
Abstract: International audience; Since its inception, Shannon's information theory has attracted interest for the study of language and music. Recently, a wide range of converging studies have shown how efficient communication pervades language, from phonetics to syntax. Efficient principles imply that more resources should be assigned to highly informative items. For instance, average information content was shown to be a better predictor of word length than frequency, revisiting one of the famous Zipf's law. However , in spite of the success of the efficient communication framework in the study of language and speech, very little work has investigated its relevance in the analysis of music. Here, we examine the organization of harmonic information in two large corpora of Western music, one made of MIDI files directly sequenced from scores, and the other made of MIDI recordings of live performances of highly skilled piano players. We show that there is a clear positive relationship between (contextual) information content of harmonic sequences and two essential musical properties, namely duration and loudness: the more unexpected a harmonic event is, the longer and the louder it is.
Published: 2022
Full Text: View/download PDF

5. Why Do Parameter Values in the Zipf-Mandelbrot Distribution Sometimes Explode?

Author: Ján Mačutek
Subjects: 050101 languages & linguistics, Linguistics and Language, Zipf's law, Distribution (number theory), business.industry, 05 social sciences, Mandelbrot set, computer.software_genre, 01 natural sciences, Language and Linguistics, 010305 fluids & plasmas, 0103 physical sciences, 0501 psychology and cognitive sciences, Artificial intelligence, business, computer, Word (computer architecture), Natural language processing, Mathematics
Abstract: The Zipf-Mandelbrot distribution serves as a mathematical model for ranked frequencies in many areas of scientific research, including linguistics. Many linguistic units, like e.g., words or word n...
Published: 2021
Full Text: View/download PDF

6. A Methodology of Using a Concordancer and Table Processor for Authorship Attribution

Author: V. A. Yatsko
Subjects: Zipf's law, business.industry, Computer science, 05 social sciences, Process (computing), 050905 science studies, computer.software_genre, Authorship attribution, Table (database), Artificial intelligence, 0509 other social sciences, 050904 information & library sciences, business, General Economics, Econometrics and Finance, Concordancer, computer, Natural language processing
Abstract: The paper proposes an original methodology of authorship attribution based on the deviations from Zipf distribution and statistical data obtained with the help of a concordance program and computations performed in a table processor. The methodology involves finding distances between input texts and a reference text basing on deviations of stop-words frequencies. The results that have been achieved prove that the proposed methodology allows performing efficient authorship attribution and that it can be used in the educational process to develop student skills and competencies pertaining to natural language processing.
Published: 2020
Full Text: View/download PDF

7. Estimating Video Popularity From Past Request Arrival Times in a VoD System

Author: Tianjiao Wang, Chamil Jayasundara, Moshe Zukerman, Ampalavanapillai Nirmalathas, Elaine Wong, Chathurika Ranaweera, Chang Xing, and Bill Moran
Subjects: General Computer Science, Zipf's law, Computer science, Popularity prediction, General Engineering, 020206 networking & telecommunications, 02 engineering and technology, Variation (game tree), computer.software_genre, Popularity, Measure (mathematics), request statistic, video-on-demand, Zipf distribution, 0202 electrical engineering, electronic engineering, information engineering, non-homogeneous Poisson process, 020201 artificial intelligence & image processing, General Materials Science, lcsh:Electrical engineering. Electronics. Nuclear engineering, Cache, Data mining, pre-placement, lcsh:TK1-9971, computer
Abstract: Efficient provision of Video-on-Demand (VoD) services requires that popular videos are stored in a cache close to users. Video popularity (defined by requested count) prediction is, therefore, important for optimal choice of videos to be cached. The popularity of a video depends on many factors and, as a result, changes dynamically with time. Accurate video popularity estimation that can promptly respond to the variations in video popularity then becomes crucial. In this paper, we analyze a method, called Minimal Inverted Pyramid Distance (MIPD), to estimate a video popularity measure called the Inverted Pyramid Distance (IPD). MIPD requires choice of a parameter, $k$ , representing the number of past requests from each video used to calculate its IPD. We derive, analytically, expressions to determine an optimal value for $k$ , given the requirement on ranking a certain number of videos with specified confidence. In order to assess the prediction efficiency of MIPD, we have compared it by simulations against four other prediction methods: Least Recency Used (LRU), Least Frequency Used (LFU), Least Recently/Frequently Used (LRFU), and Exponential Weighted Moving Average (EWMA). Lacking real data, we have, based on an extensive literature review of real-life VoD system, designed a model of VoD system to provide a realistic simulation of videos with different patterns of popularity variation, using the Zipf (heavy-tailed) distribution of popularity and a non-homogeneous Poisson process for requests. From a large number of simulations, we conclude that the performance of MIPD is, in general, superior to all of the other four methods.
Published: 2020
Full Text: View/download PDF

8. Incremental word processing influences the evolution of phonotactic patterns

Author: Adam King, Andrew Wedel, and Adam Ussishkin
Subjects: Phonotactics, Linguistics and Language, Zipf's law, business.industry, Computer science, 05 social sciences, Word processing, computer.software_genre, Information theory, 050105 experimental psychology, Language and Linguistics, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0501 psychology and cognitive sciences, Artificial intelligence, 0305 other medical science, business, computer, Natural language processing
Abstract: Listeners incrementally process words as they hear them, progressively updating inferences about what word is intended as the phonetic signal unfolds in time. As a consequence, phonetic cues positioned early in the signal for a word are on average more informative about word-identity because they disambiguate the intended word from more lexical alternatives than cues late in the word. In this contribution, we review two new findings about structure in lexicons and phonological grammars, and argue that both arise through the same biases on phonetic reduction and enhancement resulting from incremental processing. (i) Languages optimize their lexicons over time with respect to the amount of signal allocated to words relative to their predictability: words that are on average less predictable in context tend to be longer, while those that are on average more predictable tend to be shorter. However, the fact that phonetic material earlier in the word plays a larger role in word identification suggests that languages should also optimize the distribution of that information across the word. In this contribution we review recent work on a range of different languages that supports this hypothesis: less frequent words are not only on average longer, but also contain more highly informative segments early in the word. (ii) All languages are characterized by phonological grammars of rules describing predictable modifications of pronunciation in context. Because speakers appear to pronounce informative phonetic cues more carefully than less informative cues, it has been predicted that languages should be less likely to evolve phonological rules that reduce lexical contrast at word beginnings. A recent investigation through a statistical analysis of a cross-linguistic dataset of phonological rules strongly supports this hypothesis. Taken together, we argue that these findings suggest that the incrementality of lexical processing has wide-ranging effects on the evolution of phonotactic patterns.
Published: 2019
Full Text: View/download PDF

9. Scale Characteristics and Optimization of Park Green Space in Megacities Based on the Fractal Measurement Model: A Case Study of Beijing, Shanghai, Guangzhou, and Shenzhen

Author: Miaoyao Nie, Zhen Li, and Wanmin Zhao
Subjects: Scale (ratio), Computer science, Geography, Planning and Development, 0211 other engineering and technologies, TJ807-830, scale hierarchical structure, 02 engineering and technology, 010501 environmental sciences, Management, Monitoring, Policy and Law, computer.software_genre, TD194-195, 01 natural sciences, Fractal dimension, Renewable energy sources, Fractal, Beijing, GE1-350, urban park green space, fractal theory, 0105 earth and related environmental sciences, Zipf's law, Basis (linear algebra), Environmental effects of industries and plants, Renewable Energy, Sustainability and the Environment, optimization discussion, Urban spatial structure, 021107 urban & regional planning, Environmental sciences, Megacity, Data mining, Zipf model, computer
Abstract: This paper applies fractal theory to research of green space in megacity parks due to the lack of a sufficient qualitative description of the scale structure of park green space, a quantifiable evaluation system, and operable planning methods in traditional studies. Taking Beijing, Shanghai, Guangzhou, and Shenzhen as examples, GIS spatial analysis technology and the Zipf model are used to calculate the fractal dimension (q), the goodness of fit (R2), and the degree of difference (C) to deeply interpret the connotation of indicators and conduct a comparative analysis between cities to reveal fractal characteristics and laws. The research results show that (1) the fractal dimension is related to the complexity of the park green space system, (2) the fractal dimension characterizes the hierarchical iteration of the park green space to a certain extent and reflects the internal order of the scale distribution, (3) the scale distribution of green space in megacity parks deviates from the ideal pyramid configuration, and (4) there are various factors affecting the scale structure of park green space, such as natural base conditions, urban spatial structure, and the continuation of historical genes working together. On this basis, a series of targeted optimization strategies are proposed.
Published: 2021

10. Leveraging Zipf’s Law to Analyze Statistical Distribution of Chinese Corpus

Author: Rongbin Wei, Qing Lei, and Haifeng Li
Subjects: Vocabulary, Zipf's law, business.industry, Computer science, media_common.quotation_subject, Feature extraction, Sentiment analysis, Text segmentation, Context (language use), computer.software_genre, Segmentation, Artificial intelligence, business, computer, Word (computer architecture), Natural language processing, media_common
Abstract: In term of Chinese natural language processing, it exits one particular problem that how to choose the strategy of word segmentation, which commonly includes char-based and word-based. Targeted at sentiment analysis of short text comparing with long text, the word-based segmentation faces the other problem that there are the more ambiguous or unregistered words in context of short text. The feature extraction done by the different Chinese Word Segmentation impact the statistic distribution of features, and further the accuracy of sentiment analysis. This paper evaluates five Chinese segmentation strategy effect on Sentiment Analysis of Short Text. We chose two word-based Chinese Word Segmentation (CWS), and three char-based n-gram and made usage of Zipf’s law to quantify and present the result of word segmentation.
Published: 2021
Full Text: View/download PDF

11. How (Non-)Optimal is the Lexicon?

Author: Damián E. Blasi, Ryan Cotterell, Irene Nikkarinen, Kyle Mahowald, Tiago Pimentel, Toutanova, Kristina, Rumshisky, Anna, Zettlemoyer, Luke, Hakkani-Tur, Dilek, Beltagy, Iz, Bethard, Steven, Cotterell, Ryan, Chakraborty, Tanmoy, and Zhou, Yichao
Subjects: FOS: Computer and information sciences, Vocabulary, Computer science, media_common.quotation_subject, computer.software_genre, Lexicon, 050105 experimental psychology, 03 medical and health sciences, 0302 clinical medicine, 0501 psychology and cognitive sciences, media_common, Structure (mathematical logic), Computer Science - Computation and Language, Zipf's law, business.industry, 05 social sciences, Statistical model, Feature (linguistics), Artificial intelligence, business, Computation and Language (cs.CL), computer, 030217 neurology & neurosurgery, Natural language processing, Natural language, Generative grammar
Abstract: The mapping of lexical meanings to wordforms is a major feature of natural languages. While usage pressures might assign short words to frequent meanings (Zipf’s law of abbreviation), the need for a productive and open-ended vocabulary, local constraints on sequences of symbols, and various other factors all shape the lexicons of the world’s languages. Despite their importance in shaping lexical structure, the relative contributions of these factors have not been fully quantified. Taking a coding-theoretic view of the lexicon and making use of a novel generative statistical model, we define upper bounds for the compressibility of the lexicon under various constraints. Examining corpora from 7 typologically diverse languages, we use those upper bounds to quantify the lexicon’s optimality and to explore the relative costs of major constraints on natural codes. We find that (compositional) morphology and graphotactics can sufficiently account for most of the complexity of natural codes—as measured by code length., Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ISBN:978-1-954085-46-6
Published: 2021

12. Linguistic Complexity: Relationships Between Phoneme Inventory Size, Syllable Complexity, Word and Clause Length, and Population Size

Author: Jürgen Pilz and Gertraud Fenk-Oczlon
Subjects: 050101 languages & linguistics, Population, Sample (statistics), computer.software_genre, lcsh:Communication. Mass media, 030507 speech-language pathology & audiology, 03 medical and health sciences, Linguistic sequence complexity, parallel texts, 0501 psychology and cognitive sciences, education, Mathematics, cross-linguistic correlations, education.field_of_study, Zipf's law, business.industry, Population size, 05 social sciences, phoneme inventory size, lcsh:P87-96, clause length, word length, Variable (computer science), Artificial intelligence, syllable complexity, Syllable, 0305 other medical science, business, computer, Natural language processing, Word (group theory)
Abstract: Starting from a view on language as a complex, hierarchically organized system composed of many parts that have many interactions, this paper investigates statistical relationships between the linguistic variables ‘phoneme inventory size’, ‘syllable complexity’, ‘length of words’, ‘length of clauses’, and the nonlinguistic variable ‘population size’. By analyzing parallel textual material of 61 languages (18 language families) we found strong positive correlations between phoneme inventory size and mean syllable complexity (measured as number of phonemes) and between phoneme inventory size and mean number of monosyllabic words. We observed significant negative correlations between phoneme inventory size and the mean length of words and the mean length of clauses, measured as number of syllables. We then correlated the linguistic complexity data with estimated speaker population sizes and could reveal that languages with more speakers tend to have * more phonemes per syllable, * shorter words in number of syllables, * a higher number of monosyllabic words, * and a higher number of words per clause. Moreover, we reproduce the results of former studies that found a positive correlation between population size and phoneme inventory size for our language sample. The findings are discussed in light of previous research and within the framework of Systemic Typology. We propose that syllable complexity is a key factor in the correlations identified in this study, and that Zipf’s law of Abbreviation explains the associations between ‘word length’, ‘phoneme inventory size’ and the extralinguistic variable ‘population size’.
Published: 2021
Full Text: View/download PDF

13. Formation of vocabularies in a decentralized graph-based approach to human language

Author: Wenceslao Palma, Felipe Urbina, and Javier Vera
Subjects: Text corpus, Optimization problem, Computer science, Population, computer.software_genre, Vocabulary, Computer Science::Digital Libraries, 01 natural sciences, 010305 fluids & plasmas, 0103 physical sciences, Computer Graphics, Humans, 010306 general physics, education, Language, education.field_of_study, Zipf's law, business.industry, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Models, Theoretical, Maxima and minima, Word lists by frequency, Bipartite graph, Artificial intelligence, business, computer, Natural language processing
Abstract: Zipf's law establishes a scaling behavior for word frequencies in large text corpora. The appearance of Zipfian properties in vocabularies (viewed as an intermediate phase between referentially useless one-word systems and one-to-one word-meaning vocabularies) has been previously explained as an optimization problem for the interests of speakers and hearers. Remarkably, humanlike vocabularies can be viewed also as bipartite graphs. Thus, the aim here is double: within a bipartite-graph approach to human vocabularies, to propose a decentralized language game model for the formation of Zipfian properties. To do this, we define a language game in which a population of artificial agents is involved in idealized linguistic interactions. Numerical simulations show the appearance of a drastic transition from an initially disordered state towards three kinds of vocabularies. Our results open ways to study Zipfian properties in language, reconciling models seeing communication as a global minima of information entropic energies and models focused on self-organization.
Published: 2021
Full Text: View/download PDF

14. Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa -- A Large Romanian Sentiment Data Set

Author: Radu Tudor Ionescu, Anca Maria Tache, and Gaman Mihaela
Subjects: Self-organizing map, FOS: Computer and information sciences, Computer Science - Computation and Language, Zipf's law, Generalization, Computer science, business.industry, Romanian, computer.software_genre, language.human_language, language, Artificial intelligence, Computational linguistics, business, Cluster analysis, computer, Computation and Language (cs.CL), Word (computer architecture), Natural language processing, Natural language
Abstract: Romanian is one of the understudied languages in computational linguistics, with few resources available for the development of natural language processing tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data Set, which is composed of 15,000 positive and negative reviews collected from one of the largest Romanian e-commerce platforms. We employ two sentiment classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (bag-of-word-embeddings generated by clustering word embeddings with k-means). As an additional contribution, we replace the k-means clustering algorithm with self-organizing maps (SOMs), obtaining better results because the generated clusters of word embeddings are closer to the Zipf's law distribution, which is known to govern natural language. We also demonstrate the generalization capacity of using SOMs for the clustering of word embeddings on another recently-introduced Romanian data set, for text categorization by topic., Comment: Accepted at EACL 2021
Published: 2021
Full Text: View/download PDF

15. Empirical Laws of Natural Language Processing for Neural Language Generated Text

Author: Sumedha and Rajesh Rohilla
Subjects: Sequence, Zipf's law, Computer science, business.industry, Natural language generation, computer.software_genre, Law, Natural (music), Artificial intelligence, Language model, business, computer, Natural language, Generative grammar, Natural language processing, Heap (data structure)
Abstract: In the domain of Natural Language Generation and Processing, a lot of work is being done for text generation. As the machines become able to understand the text and language, it leads to a significant reduction in human involvement. Many sequence models show great work in generating human like text, but the amount of research work done to check the extent up to which their results match the man-made texts are limited in number. In this paper, the text is generated using Long Short Term Memory networks (LSTMs) and Generative Pretrained Transformer-2 (GPT-2). The text by neural language models based on LSTMs and GPT-2 follows Zipf’s law and Heap’s law, two statistical representations followed by every natural language generated text. One of the main findings is about the influence of parameter Temperature on the text produced. The LSTM generated text improves as the value of Temperature increases. The comparison between GPT-2 and LSTM generated text also shows that text generated using GPT-2 is more similar to natural text than that generated by LSTMs.
Published: 2021
Full Text: View/download PDF

16. High-Throughput Analysis of Urban Textures using Methods from Molecular Simulation

Author: Gerald J. Wang and Ryan Rusali
Subjects: Structure (mathematical logic), Relation (database), Zipf's law, Computer science, Rank (computer programming), Perspective (graphical), Molecular simulation, 02 engineering and technology, 021001 nanoscience & nanotechnology, computer.software_genre, 01 natural sciences, Workflow, 0103 physical sciences, Data mining, 010306 general physics, 0210 nano-technology, computer, Scaling
Abstract: A central pursuit of micro- and nano-scale engineering is describing the small-scale structure of materials in terms of spatial relationships between positions of particles (representing, e.g., individual atoms). In order to carry out such analyses, molecular simulation practitioners have developed a large array of fast methods to compute spatial relationships on particle-resolved data sets. We have developed a computational workflow that directly applies molecular simulation methods to GIS building data, in which individual buildings (instead of, e.g., atoms) are treated as the particles of interest. In so doing, we enable efficient quantification of "urban textures" consisting of >> O(103) buildings. This interdisciplinary toolkit potentially opens the door to new vistas for urban-systems modeling. As one of a few early examples, we provide evidence for a novel scaling relationship between size and ordinal rank of building clusters within several American cities, reminiscent of Zipf's Law. This scaling relation suggests a new perspective on fractal-like organization of urban environments.
Published: 2020
Full Text: View/download PDF

17. Empirical Laws of Natural Language Processing for Hindi Language

Author: Mahesh R. Shirsath, Hrishikesh Khandare, Manali Musale, Arun Babhulgaonkar, Adwait Tekale, and Atharv A. Kurdukar
Subjects: Hindi, Zipf's law, business.industry, Computer science, 05 social sciences, ComputingMilieux_LEGALASPECTSOFCOMPUTING, Mandelbrot set, computer.software_genre, 050105 experimental psychology, language.human_language, Type–token distinction, 03 medical and health sciences, 0302 clinical medicine, Law, language, Text normalization, 0501 psychology and cognitive sciences, Artificial intelligence, business, computer, 030217 neurology & neurosurgery, Natural language processing, Heap (data structure)
Abstract: Empirical laws are the statistical laws that describe the relation between entities in a large dataset. They are readily found in nature, and findings have been proven by observations [1]. The primary objective of this study is to verify some of the empirical laws such as Zipf’s law, Mandelbrot’s approximation, and Heap’s law for Hindi language corpus. This involves collecting a corpus, performing text normalization, tokenizing it to get a list of words, identifying word types and their frequency, sorting and ranking the data based on frequency, and representing the relation between the frequency and rank of the word types to validate Zipf’s law and Mandelbrot’s approximation. For Heap’s law, the relation between the number of word types and tokens for different subsets of the corpus is considered. Based on our observations, the Hindi language satisfies the laws mentioned above.
Published: 2020
Full Text: View/download PDF

18. Classification Between Machine Translated Text and Original Text By Part Of Speech Tagging Representation

Author: Nancirose Piazza
Subjects: Vocabulary, Frequentist probability, Word embedding, Zipf's law, Computer science, business.industry, media_common.quotation_subject, computer.software_genre, Word lists by frequency, Decision boundary, Probability distribution, Trigram, Artificial intelligence, business, computer, Natural language processing, media_common
Abstract: Classification between machine-translated text and original text are often tokenized on vocabulary of the corpi. With N-grams larger than uni-gram, one can create a model that estimates a decision boundary based on word frequency probability distribution; however, this approach is exponentially expensive because of high dimensionality and sparsity. Instead, we let samples of the corpi be represented by part-of-speech tagging which is significantly less vocabulary. With less trigram permutations, we can create a model with its tri-gram frequency probability distribution. In this paper, we explore less conventional ways of approaching techniques for handling documents, dictionaries, and the likes.
Published: 2020
Full Text: View/download PDF

19. Research on the Applicability of Benford’s Law in Chinese Texts

Author: Rongyi Cui, Junlong Gao, and Yahui Zhao
Subjects: 0209 industrial biotechnology, Kullback–Leibler divergence, Zipf's law, Computer science, business.industry, Text segmentation, 02 engineering and technology, computer.software_genre, Benford's law, Word lists by frequency, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Range (statistics), Entropy (information theory), Probability distribution, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing
Abstract: This paper aims to research the applicability of Benford’s Law in Chinese texts. Firstly, the Chinese corpus was collected and word segmentation was performed. The distributions of the first digit of frequency were calculated for words, low-frequency words and single characters respectively in Chinese texts, and the relative entropy (Kullback-Leibler distance) between the distributions and the general Benford’s law. Secondly, the parameter value range of the Generalized Benford’s law was researched, and in view of the limitation of Zipf’s law that is only applicable to large amounts of data, we carried out a statistical analysis of small-scale data. Then, the experimental analysis of the probability of the first digit of the word frequency of the single character data was carried out to explore the applicability of the Generalized Benford’s law for single澡character data. Finally, the applicability of Benford’s law was investigated for artificially modified corpus. The results show that the words and characters in Chinese texts conform to the Benford’s law, and Benford’s law overcomes the limitation of Zipf’s law on the size of the data sets, and the Generalized Benford’s law has the ability to discriminate the natural quality of the corpus, which has important practical significance for Chinese information processing.
Published: 2020
Full Text: View/download PDF

20. tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification

Author: Jan Kralj, Nada Lavrač, Senja Pollak, Matej Martinc, and Blaž Škrlj
Subjects: FOS: Computer and information sciences, text classification, Computer science, semantic enrichment, Semantic space, Parallel algorithm, 02 engineering and technology, computer.software_genre, 01 natural sciences, Theoretical Computer Science, Robustness (computer science), 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, vectorization, feature construction, 010301 acoustics, Computer Science - Computation and Language, Zipf's law, Artificial neural network, business.industry, 020206 networking & telecommunications, taxonomies, Human-Computer Interaction, Personality type, Drug side effects, short documents, Artificial intelligence, business, computer, Computation and Language (cs.CL), Software, Natural language processing
Abstract: The use of background knowledge is largely unexploited in text classification tasks. This paper explores word taxonomies as means for constructing new semantic features, which may improve the performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy-based features, and demonstrate its use on six short text classification problems: prediction of gender, personality type, age, news topics, drug side effects and drug effectiveness. The constructed semantic features, in combination with fast linear classifiers, tested against strong baselines such as hierarchical attention neural networks, achieves comparable classification results on short text documents. The algorithm's performance is also tested in a few-shot learning setting, indicating that the inclusion of semantic features can improve the performance in data-scarce situations. The tax2vec capability to extract corpus-specific semantic keywords is also demonstrated. Finally, we investigate the semantic space of potential features, where we observe a similarity with the well known Zipf's law., Comment: Accepted at CSL journal
Published: 2020

21. Evolving Losses for Unsupervised Video Representation Learning

Author: AJ Piergiovanni, Anelia Angelova, and Michael S. Ryoo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, 010501 environmental sciences, Machine learning, computer.software_genre, 01 natural sciences, Evolutionary computation, Machine Learning (cs.LG), Search algorithm, 0202 electrical engineering, electronic engineering, information engineering, 0105 earth and related environmental sciences, Zipf's law, business.industry, Representation (systemics), Constraint (information theory), ComputingMethodologies_PATTERNRECOGNITION, Metric (mathematics), Unsupervised learning, 020201 artificial intelligence & image processing, Artificial intelligence, business, Feature learning, computer
Abstract: We present a new method to learn video representations from large-scale unlabeled video data. Ideally, this representation will be generic and transferable, directly usable for new tasks such as action recognition and zero or few-shot learning. We formulate unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are shared across different modalities via distillation. Further, we introduce the concept of loss function evolution by using an evolutionary search algorithm to automatically find optimal combination of loss functions capturing many (self-supervised) tasks and modalities. Thirdly, we propose an unsupervised representation evaluation metric using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law. This unsupervised constraint, which is not guided by any labeling, produces similar results to weakly-supervised, task-specific ones. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods. Notably, it is also more effective than several label-based methods (e.g., ImageNet), with the exception of large, fully labeled video datasets., Comment: arXiv admin note: text overlap with arXiv:1906.03248
Published: 2020
Full Text: View/download PDF

22. Concept detection using text exemplars aligned with a specialized ontology

Author: Fred N. Davis, David A. Juckett, Eric P. Kasten, and Mark Gostine
Subjects: 0303 health sciences, Class (computer programming), Information Systems and Management, Zipf's law, business.industry, Computer science, 02 engineering and technology, Ontology (information science), computer.software_genre, Plot (graphics), Domain (software engineering), Set (abstract data type), 03 medical and health sciences, Knowledge extraction, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Sentence, Natural language processing, 030304 developmental biology
Abstract: Knowledge extraction from text documents requires identifying and classifying semantic content. Utilizing an appropriate domain ontology can facilitate this process if words and phrases can be linked to the classes and relationships within the ontology. This paper presents an exemplar-based algorithm to link text to semantically similar classes within an ontology constructed for the chronic pain medicine domain. Human annotators linked classes to text segments within a random document set for construction of an exemplar dictionary, which we examined for completeness using Zipf plot analysis. An algorithm was created to use this dictionary on previously unseen text to form a map between sentence text and probable class assignments. We performed a 5 × 5 cross-validation between human and algorithm annotations and examined both ROC and precision versus recall curves to show that the algorithm can identify the many medical and biopsychosocial components from the texts. We briefly describe a use case for detecting pain relief from various interventions utilizing the word-by-class maps. We conclude that an exemplar-based method can be a valuable tool in knowledge extraction from texts that share similar construction, such as medical progress notes.
Published: 2019
Full Text: View/download PDF

23. Brevity is not a universal in animal communication: evidence for compression depends on the unit of analysis in small ape vocalizations

Author: Abdul Hamid Ahmad, Dena J. Clink, and Holger Klinck
Subjects: 0106 biological sciences, Phrase, Computer science, hylobates, Type (model theory), computer.software_genre, 010603 evolutionary biology, 01 natural sciences, unsupervised clustering, 03 medical and health sciences, Menzerath's law, Hylobates, Code (cryptography), menzerath's law, Animal communication, lcsh:Science, 030304 developmental biology, 0303 health sciences, Multidisciplinary, biology, Zipf's law, business.industry, biology.organism_classification, compression, Support vector machine, Organismal and Evolutionary Biology, zipf's law of abbreviation, lcsh:Q, Artificial intelligence, business, computer, Natural language processing, Research Article
Abstract: Evidence for compression, or minimization of code length, has been found across biological systems from genomes to human language and music. Two linguistic laws—Menzerath's Law (which states that longer sequences consist of shorter constituents) and Zipf's Law of abbreviation (a negative relationship between signal length and frequency of use)—are predictions of compression. It has been proposed that compression is a universal in animal communication, but there have been mixed results, particularly in reference to Zipf's Law of abbreviation. Like songbirds, male gibbons ( Hylobates muelleri ) engage in long solo bouts with unique combinations of notes which combine into phrases. We found strong support for Menzerath's Law as the longer a phrase, the shorter the notes. To identify phrase types, we used state-of-the-art affinity propagation clustering, and were able to predict phrase types using support vector machines with a mean accuracy of 74%. Based on unsupervised phrase type classification, we did not find support for Zipf's Law of abbreviation. Our results indicate that adherence to linguistic laws in male gibbon solos depends on the unit of analysis. We conclude that principles of compression are applicable outside of human language, but may act differently across levels of organization in biological systems.
Published: 2020
Full Text: View/download PDF

24. A Statistic and Analysis of Access Pattern for Online VoD Multimedia

Author: Weibei Fan, Zhijie Han, Ji_ao Ma, and Xin He
Subjects: Multimedia, Zipf's law, business.industry, Computer science, 020206 networking & telecommunications, 02 engineering and technology, Object (computer science), computer.software_genre, Theoretical Computer Science, Variety (cybernetics), Mode (computer interface), Hardware and Architecture, Control and Systems Engineering, Modeling and Simulation, Signal Processing, Pattern recognition (psychology), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, The Internet, Long tail, business, computer, Statistic, Information Systems
Abstract: The generally accepted that Zipf-Distribution is a convinced access pattern for text-based Web. However, with the dramatic increasement of VoD media traffic on the Internet such as Flash P2P, the inconsistency between the access patterns of media objects and the Zipf model has been researched by many scholars. In this paper, we have studied a large variety of media work-loads collected from both browser and server sides in Adobe Flash P2P systems which applied in Youku, Youtube, etc. Through extensive analysis and modeling. And found the object reference ranks of all these workloads follow the logistic (LOG) distribution despite their different media systems and delivery methods by extensive analysis and modeling. This mean it does not follow long tail effect; Furthermore, we have constructed mathematical models which can applied in access pattern in FlashP2P traffic. By analyzing the model of media traffic access, it is possible to better describe the user’s access mode. Meantime, it is very suitable for the configuration and allocation of network resources which can be used more efficiently.
Published: 2018
Full Text: View/download PDF

25. Analysing the features of negative sentiment tweets

Author: Ling Zhang, Wei Dong, and Xiangming Mu
Subjects: Zipf's law, Social network, business.industry, Computer science, Document classification, 05 social sciences, Sentiment analysis, 02 engineering and technology, Library and Information Sciences, computer.software_genre, Part of speech, Computer Science Applications, Content analysis, 0202 electrical engineering, electronic engineering, information engineering, Ontology, 020201 artificial intelligence & image processing, Social media, Artificial intelligence, 0509 other social sciences, 050904 information & library sciences, business, computer, Natural language processing
Abstract: Purpose This paper aims to address the challenge of analysing the features of negative sentiment tweets. The method adopted in this paper elucidates the classification of social network documents and paves the way for sentiment analysis of tweets in further research. Design/methodology/approach This study classifies negative tweets and analyses their features. Findings Through negative tweet content analysis, tweets are divided into ten topics. Many related words and negative words were found. Some indicators of negative word use could reflect the degree to which users release negative emotions: part of speech, the density and frequency of negative words and negative word distribution. Furthermore, the distribution of negative words obeys Zipf’s law. Research limitations/implications This study manually analysed only a small sample of negative tweets. Practical implications The research explored how many categories of negative sentiment tweets there are on Twitter. Related words are helpful to construct an ontology of tweets, which helps people with information retrieval in a fixed research area. The analysis of extracted negative words determined the features of negative tweets, which is useful to detect the polarity of tweets by machine learning method. Originality/value The research provides an initial exploration of a negative document classification method and classifies the negative tweets into ten topics. By analysing the features of negative tweets, related words, negative words, the density of negative words, etc. are presented. This work is the first step to extend Plutchik’s emotion wheel theory into social media data analysis by constructing filed specific thesauri, referred to as local sentimental thesauri.
Published: 2018
Full Text: View/download PDF

26. Constructing a WordNet for Turkish Using Manual and Automatic Annotation

Author: Razieh Ehsani, Olcay Taner Yildiz, Ercan Solak, Işık Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Işık University, Faculty of Engineering, Department of Computer Engineering, Ehsani, Razieh, Solak, Ercan, and Yıldız, Olcay Taner
Subjects: Connected component, General Computer Science, Zipf's law, Turkish, business.industry, Computer science, WordNet, 02 engineering and technology, computer.software_genre, language.human_language, Lexicography, Annotation, Synset, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, language, 020201 artificial intelligence & image processing, Artificial intelligence, Variation of information, business, Cluster analysis, computer, Natural language processing
Abstract: In this article, we summarize the methodology and the results of our 2-year-long efforts to construct a comprehensive WordNet for Turkish. In our approach, we mine a dictionary for synonym candidate pairs and manually mark the senses in which the candidates are synonymous. We marked every pair twice by different human annotators. We derive the synsets by finding the connected components of the graph whose edges are synonym senses. We also mined Turkish Wikipedia for hypernym relations among the senses. We analyzed the resulting WordNet to highlight the difficulties brought about by the dictionary construction methods of lexicographers. After splitting the unusually large synsets, we used random walk-based clustering that resulted in a Zipfian distribution of synset sizes. We compared our results to BalkaNet and automatic thesaurus construction methods using variation of information metric. Our Turkish WordNet is available online. Publisher's Version
Published: 2018
Full Text: View/download PDF

27. Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis

Author: Alexander Koplenig
Subjects: 060201 languages & linguistics, Zipf–Mandelbrot law, Linguistics and Language, Measure (data warehouse), Corpus analysis, Lexical density, Zipf's law, Computer science, business.industry, 06 humanities and the arts, computer.software_genre, 01 natural sciences, Language and Linguistics, Linguistics, Varieties of English, 010104 statistics & probability, 0602 languages and literature, Artificial intelligence, 0101 mathematics, Time series, Scale (map), business, computer, Natural language processing
Abstract: Using the Google Ngram Corpora for six different languages (including two varieties of English), a large-scale time series analysis is conducted. It is demonstrated that diachronic changes of the parameters of the Zipf–Mandelbrot law (and the parameter of the Zipf law, all estimated by maximum likelihood) can be used to quantify and visualize important aspects of linguistic change (as represented in the Google Ngram Corpora). The analysis also reveals that there are important cross-linguistic differences. It is argued that the Zipf–Mandelbrot parameters can be used as a first indicator of diachronic linguistic change, but more thorough analyses should make use of the full spectrum of different lexical, syntactical and stylometric measures to fully understand the factors that actually drive those changes.
Published: 2018
Full Text: View/download PDF

28. The predictive capabilities of mathematical models for the type-token relationship in English language corpora

Author: Martin J. Tunnicliffe and Gordon Hunter
Subjects: Vocabulary, Computer science, media_common.quotation_subject, 02 engineering and technology, English language, Type (model theory), Security token, computer.software_genre, 01 natural sciences, Theoretical Computer Science, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Bernoulli trial, 010301 acoustics, media_common, Zipf's law, Mathematical model, business.industry, 020206 networking & telecommunications, Human-Computer Interaction, Probability distribution, Artificial intelligence, business, computer, Software, Natural language processing
Abstract: We investigate the predictive capability of mathematical models of the type-token relationship applied to the vocabulary growth profiles of selected English language documents. We compare the existing Good-Toulmin and Heaps formulae with an alternative approach based on Bernoulli trial word selection from a fixed finite vocabulary using the Zipf and Zipf-Mandelbrot probability distributions. We make two major observations: firstly, while the Zipf-Mandelbrot model makes better predictions of vocabulary growth than the Zipf model, the optimized parameters of the latter correlate better than those of the former with statistics gleaned independently from the data. Secondly, the mean of the Zipf-Mandelbrot, Good-Toulmin and Heaps models provides a more consistent and unbiased prediction of vocabulary than any individual model alone.
Published: 2021
Full Text: View/download PDF

29. Zipf’s Law in Passwords

Author: Ping Wang, Xinyi Huang, Ding Wang, Haibo Cheng, and Gaopeng Jian
Subjects: Password, Authentication, Zipf's law, Computer Networks and Communications, Computer science, Cumulative distribution function, 020206 networking & telecommunications, 02 engineering and technology, computer.software_genre, Empirical distribution function, Data modeling, Data set, Metric (mathematics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Data mining, Safety, Risk, Reliability and Quality, computer
Abstract: Despite three decades of intensive research efforts, it remains an open question as to what is the underlying distribution of user-generated passwords. In this paper, we make a substantial step forward toward understanding this foundational question. By introducing a number of computational statistical techniques and based on 14 large-scale data sets, which consist of 113.3 million real-world passwords, we, for the first time, propose two Zipf-like models (i.e., PDF-Zipf and CDF-Zipf) to characterize the distribution of passwords. More specifically, our PDF-Zipf model can well fit the popular passwords and obtain a coefficient of determination larger than 0.97; our CDF-Zipf model can well fit the entire password data set, with the maximum cumulative distribution function (CDF) deviation between the empirical distribution and the fitted theoretical model being 0.49%~4.59% (on an average 1.85%). With the concrete knowledge of password distributions, we suggest a new metric for measuring the strength of password data sets. Extensive experimental results show the effectiveness and general applicability of the proposed Zipf-like models and security metric.
Published: 2017
Full Text: View/download PDF

30. A Synergetic Approach to the Relationship between the Length and Frequency among English Multiword Formulaic Sequences

Author: Yunhua Qu, Xueting Dai, and Zhiwei Feng
Subjects: 060201 languages & linguistics, Structure (mathematical logic), Linguistics and Language, 060101 anthropology, Zipf's law, business.industry, 06 humanities and the arts, computer.software_genre, Language and Linguistics, Word lists by frequency, Lexical bundles, 0602 languages and literature, 0601 history and archaeology, Production (computer science), Artificial intelligence, business, computer, Word length, Natural language processing, Mathematics, Meaning (linguistics)
Abstract: The present paper employs a synergetic approach to exploring the relationship between the length and frequency among English multiword formulaic sequences. It attempts to test whether Zipf’s assumption on the inverse relationship between word length and word frequency, as well as the synergetic model constructed at lexical level, can be extended to multiword formulaic sequences. A corpus-driven approach is adopted to acquire sufficient data. Results show partial applicability of the assumption to a small part of formulaic sequences: sequences of words with whole structure and complete meaning. However, the majority of formulaic sequences, lexical bundles, are proved to be an exception to the conventional rule. The paper tries to offer explanation by exploring further the features of lexical bundles. It concludes that the synergetic system of length and frequency among lexical bundles is operated under the compromise between the requirements of minimization of production efforts and the requirement...
Published: 2017
Full Text: View/download PDF

31. Visual statistical learning is facilitated in Zipfian distributions

Author: Inbal Arnon and Ori Lavi-Rotbain
Subjects: Adult, Linguistics and Language, Databases, Factual, Cognitive Neuroscience, Intelligence, Spatial Learning, Experimental and Cognitive Psychology, Information theory, computer.software_genre, 050105 experimental psychology, Language and Linguistics, 03 medical and health sciences, 0302 clinical medicine, Hearing, Developmental and Educational Psychology, Humans, 0501 psychology and cognitive sciences, Child, Point (typography), Zipf's law, Learnability, business.industry, 05 social sciences, Text segmentation, Contrast (statistics), Object (computer science), Artificial intelligence, Frequency distribution, Psychology, business, computer, 030217 neurology & neurosurgery, Natural language processing, Statistical Distributions
Abstract: Humans can extract co-occurrence regularities from their environment, and use them for learning. This statistical learning ability (SL) has been studied extensively as a way to explain how we learn the structure of our environment. These investigations have illustrated the impact of various distributional properties on learning. However, almost all SL studies present the regularities to be learned in uniform frequency distributions where each unit (e.g., image triplet) appears the same number of times: While the regularities themselves are informative, the appearance of the units cannot be predicted. In contrast, real-world learning environments, including the words children hear and the objects they see, are not uniform. Recent research shows that word segmentation is facilitated in a skewed (Zipfian) distribution. Here, we examine the domain-generality of the effect and ask if visual SL is also facilitated in a Zipfian distribution. We use an existing database to show that object combinations have a skewed distribution in children's environment. We then show that children and adults showed better learning in a Zipfian distribution compared to a uniform one, overall, and for low-frequency triplets. These results illustrate the facilitative impact of skewed distributions on learning across modality and age; suggest that the use of uniform distributions may underestimate performance; and point to the possible learnability advantage of such distributions in the real-world.
Published: 2020

32. Evaluating the Impact of Region Based Content Popularity of Videos on the Cost of CDN Deployment

Author: Prateek Yadav and Subrat Kar
Subjects: Content popularity, Zipf's law, Multimedia, Computer science, User-generated content, 02 engineering and technology, computer.software_genre, Popularity, Video sharing, Metadata, Software deployment, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Video streaming, computer
Abstract: We investigate the impact of the content popularity of video streams in a local region and its impact on the CDN deployment cost. We have gathered real-world data from four popular classroom video streaming sites and analyzed the content caching to optimize the CDN deployment. Our traces contain metadata of around 31 thousand educational videos and approximately 100 million views in the education category of YouTube. From this analysis, we assert that region based content (e.g., NPTEL, India) follows Zipf law with low popularity exponent, and it is the region-based content popularity which most significantly impacts the CDN deployment cost.
Published: 2020
Full Text: View/download PDF

33. Analysis of user behavior in a large-scale internet video-on-demand(VoD) system

Author: Yaohui Yuan, Guangxiang Bin, and Xingjun Wang
Subjects: Multimedia, Zipf's law, Computer science, business.industry, Scale (chemistry), 020206 networking & telecommunications, 02 engineering and technology, Internet traffic, computer.software_genre, Popularity, Content distribution, On demand, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, The Internet, business, computer, Internet video
Abstract: With the development and popularity of the Internet, Internet traffic has increased dramatically. Numerous studies have shown that video accounts for a large percentage of Internet traffic, and this percentage is still rising in the future. A good understanding of user behavior in online video systems can help us design, configure and manage video content distribution to alleviate network stress. In this paper, we did a detailed analysis of user behavior data for Internet video. Our research shows that the user's daily access and online pattern of users have a fixed pattern, and the user's access behavior conforms to Zipf's law. Besides, we optimized the fit of Zipf-like distribution of video's popularity. Finally, we built a reliable simulation system that simulates user behavior data. Overall, we believe that the results presented in this paper are very important and valuable to the whole network.
Published: 2020
Full Text: View/download PDF

34. Basic Lexical Concepts and Measurements

Author: Jacques Savoy
Subjects: Background information, Lexical density, Lemma (mathematics), Vocabulary, Sentence length, Zipf's law, business.industry, Computer science, Computation, media_common.quotation_subject, Mathematical notation, computer.software_genre, Artificial intelligence, business, computer, Natural language processing, media_common
Abstract: This second chapter exposes some useful background information that will be useful over the entire book. In particular, this chapter presents and defines precisely the notions of word-type, word-token, and lemma. The mathematical notation used in the entire book is also presented and commented. A running example (the Federalist Papers) is described and will serve to illustrate the concepts and computations done in the book. Next, the most important overall stylometric measurements are explained and numerical examples are provided. In this case, the Zipf’s Law and various vocabulary richness measures (e.g., type-token ratio (TTR), Herdan’s C, Yule’s K) are discussed. Finally, other global stylistic measurements are presented, such as lexical density (LD), percentage of big words (BW), or the mean sentence length (MSL).
Published: 2020
Full Text: View/download PDF

35. Feature Extraction with TF-IDF and Game-Theoretic Shadowed Sets

Author: Yan Zhang, Yue Zhou, and JingTao Yao
Subjects: Zipf's law, Computer science, business.industry, Feature extraction, 020207 software engineering, 02 engineering and technology, computer.software_genre, Weighting, Set (abstract data type), Statistical classification, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, tf–idf, business, computer, Game theory, Natural language processing, Word (computer architecture)
Abstract: TF-IDF is one of the most commonly used weighting metrics for measuring the relationship of words to documents. It is widely used for word feature extraction. In many research and applications, the thresholds of TF-IDF for selecting relevant words are only based on trial or experiences. Some cut-off strategies have been proposed in which the thresholds are selected based on Zipf’s law or feedbacks from model performances. However, the existing approaches are restricted in specific domains or tasks, and they ignore the imbalance of the number of representative words in different categories of documents. To address these issues, we apply game-theoretic shadowed set model to select the word features given TF-IDF information. Game-theoretic shadowed sets determine the thresholds of TF-IDF using game theory and repetition learning mechanism. Experimental results on real world news category dataset show that our model not only outperforms all baseline cut-off approaches, but also speeds up the classification algorithms.
Published: 2020
Full Text: View/download PDF

36. An Application of Zipf's Law for Prose and Verse Corpora Neutrality for Hindi and Marathi Languages

Author: R Jatinderkumar and Prafulla B. Bafna
Subjects: Hindi, General Computer Science, Zipf's law, business.industry, Computer science, Lemmatisation, Context (language use), computer.software_genre, Automatic summarization, language.human_language, Text mining, language, Official language, Artificial intelligence, Marathi, tf–idf, business, computer, Natural language processing
Abstract: Availability of the text in different languages has become possible, as almost all websites have offered multilingual option. Hindi is considered as official language in one of the states of India. Hindi text analysis is dominated by the corpus of stories and poems. Before performing any text analysis token extraction is an important step and supports many applications like text summarization , categorizing text and so on. Token extraction is a part of Natural language processing (NLP). NLP includes many steps such as preprocessing the corpus, lemmatization and so on. In this paper the tokens are extracted by two methods and on two corpora. BaSa, a context-based term extraction technique having different NLP activities, e.g. Term Frequency Inverse Document Frequency (TF-IDF) and Zipf ‘s law are used to count and compare extracted tokens. Further token comparison between both of the methods is achieved. The corpus contains proses and verses of Hindi as well as the Marathi language. Common tokens from corpora of verses and proses of Marathi as well as Hindi are identified to prove that both of them behave same as per as NLP activities are concerened. The betterment of BaSa over Zipf’s law is proved. Hindi Corpus includes 820 stories and 710 poems and Marathi corpus includes 610 stories and 505 poems.
Published: 2020
Full Text: View/download PDF

37. On the Economics of Offline Password Cracking

Author: Samson Zhou, Benjamin Harsha, and Jeremiah Blocki
Subjects: Password, Password hashing, FOS: Computer and information sciences, 021110 strategic, defence & security studies, Authentication, Computer Science - Cryptography and Security, Zipf's law, Computer science, Hash function, 0211 other engineering and technologies, Password cracking, 02 engineering and technology, Adversary, scrypt, Computer security, computer.software_genre, Authentication server, PBKDF2, ComputingMilieux_MANAGEMENTOFCOMPUTINGANDINFORMATIONSYSTEMS, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Cryptographic hash function, Message authentication code, computer, Cryptography and Security (cs.CR)
Abstract: We develop an economic model of an offline password cracker which allows us to make quantitative predictions about the fraction of accounts that a rational password attacker would crack in the event of an authentication server breach. We apply our economic model to analyze recent massive password breaches at Yahoo!, Dropbox, LastPass and AshleyMadison. All four organizations were using key-stretching to protect user passwords. In fact, LastPass' use of PBKDF2-SHA256 with $10^5$ hash iterations exceeds 2017 NIST minimum recommendation by an order of magnitude. Nevertheless, our analysis paints a bleak picture: the adopted key-stretching levels provide insufficient protection for user passwords. In particular, we present strong evidence that most user passwords follow a Zipf's law distribution, and characterize the behavior of a rational attacker when user passwords are selected from a Zipf's law distribution. We show that there is a finite threshold which depends on the Zipf's law parameters that characterizes the behavior of a rational attacker -- if the value of a cracked password (normalized by the cost of computing the password hash function) exceeds this threshold then the adversary's optimal strategy is always to continue attacking until each user password has been cracked. In all cases (Yahoo!, Dropbox, LastPass and AshleyMadison) we find that the value of a cracked password almost certainly exceeds this threshold meaning that a rational attacker would crack all passwords that are selected from the Zipf's law distribution (i.e., most user passwords). This prediction holds even if we incorporate an aggressive model of diminishing returns for the attacker (e.g., the total value of $500$ million cracked passwords is less than $100$ times the total value of $5$ million passwords). See paper for full abstract., Comment: IEEE Symposium on Security and Privacy (S&P) 2018
Published: 2020
Full Text: View/download PDF

38. Scaling Laws for Phonotactic Complexity in Spoken English Language Data

Author: Andreas Baumann, Kamil Kaźmierski, and Theresa Matzinger
Subjects: 050101 languages & linguistics, Linguistics and Language, Scaling law, Sociology and Political Science, Computer science, Boundary spanning, Complex system, computer.software_genre, 050105 experimental psychology, Language and Linguistics, Domain (mathematical analysis), diversity, Heaps’ law, Speech and Hearing, phonotactics, Humans, 0501 psychology and cognitive sciences, Scaling, Language, Phonotactics, Zipf's law, business.industry, 05 social sciences, Linguistics, Articles, General Medicine, Models, Theoretical, inventory size, Zipf’s law, Artificial intelligence, business, computer, Word (computer architecture), Natural language processing
Abstract: Two prominent statistical laws in language and other complex systems are Zipf’s law and Heaps’ law. We investigate the extent to which these two laws apply to the linguistic domain of phonotactics—that is, to sequences of sounds. We analyze phonotactic sequences with different lengths within words and across word boundaries taken from a corpus of spoken English (Buckeye). We demonstrate that the expected relationship between the two scaling laws can only be attested when boundary spanning phonotactic sequences are also taken into account. Furthermore, it is shown that Zipf’s law exhibits both high goodness-of-fit and a high scaling coefficient if sequences of more than two sounds are considered. Our results support the notion that phonotactic cognition employs information about boundary spanning phonotactic sequences.
Published: 2020

39. Spotting Urdu Stop Words By Zipf's Statistical Approach

Author: Gul Sahar, Muhammad Paend Bakht, Abdul Samad, Nuzhat Khan, and Muhammad Junaid Khan
Subjects: Stop words, Zipf's law, Computer science, business.industry, Python (programming language), Spotting, computer.software_genre, language.human_language, Text processing, language, Urdu, Artificial intelligence, Language analysis, business, computer, Natural language, Natural language processing, computer.programming_language
Abstract: This paper presents innovative method to extract stop words from large Urdu text. Stop words are less meaningful words in natural language that slow down language processing and affect language analysis negatively. For language analysis, stop words are removed first to ensure fast data processing. But for Urdu language, there is no reliable stop words removal method. In this work, we applied Zipf's law of two factors dependency with least effort approach to spot stop words in Urdu language corpus. Urdu corpus is specifically created for this research. All Urdu text processing and investigation is carried out in Python 3. 4. Previous work for stop words removal is also investigated and proved less helpful. By using Zipfian approach, out of 500 high frequency words, 358 words are identified as stop words. It is observed that by only focusing on 0.01% of large corpus, almost all the stop words can be spotted to create a stop words list with least manual effort. Furthermore, statistical patterns in stop words, content words, stop words vs content words ratio in data samples and dependency of stop words and content words over data size is also examined. In terms of data size, frequency and ranks, Zipf's law and Heap's law coexist in Urdu stop words. Stop words tend to follow some predictable and measurable patterns that can lead to reliable probabilistic methods for Urdu processing. This deterministic approach provides a strong research ground to explore stop words in Urdu text statistically.
Published: 2019
Full Text: View/download PDF

40. Exploiting Data Skew for Improved Query Performance

Author: Kenneth A. Ross and Wangda Zhang
Subjects: Structure (mathematical logic), FOS: Computer and information sciences, Speedup, Zipf's law, CPU cache, Computer science, Locality, Skew, Databases (cs.DB), computer.software_genre, Computer Science Applications, Computational Theory and Mathematics, Computer Science - Databases, Cache, Data mining, computer, Information Systems
Abstract: Analytic queries enable sophisticated large-scale data analysis within many commercial, scientific and medical domains today. Data skew is a ubiquitous feature of these real-world domains. In a retail database, some products are typically much more popular than others. In a text database, word frequencies follow a Zipf distribution with a small number of very common words, and a long tail of infrequent words. In a geographic database, some regions have much higher populations (and therefore data measurements) than others. Current systems do not make the most of caches for exploiting skew. In particular, a whole cache line may remain cache resident even though only a small part of the cache line corresponds to a popular data item. In this paper, we propose a novel index structure for repositioning data items to concentrate popular items into the same cache lines. The net result is better spatial locality, and better utilization of limited cache resources. We develop a theoretical model for analyzing the cache utilization, and implement database operators that are efficient in the presence of skew. Our experimental evaluation on real and synthetic data shows that exploiting skew can significantly improve in-memory query performance. In some cases, our techniques can speed up queries by over an order of magnitude.
Published: 2019

41. A Study on the Characteristics of Douyin Short Videos and Implications for Edge Caching

Author: Zhuang Chen, Qian He, Zhifei Mao, Hwei-Ming Chung, and Sabita Maharjan
Subjects: FOS: Computer and information sciences, Multimedia, Zipf's law, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, computer.software_genre, Popularity, Multimedia (cs.MM), Systems design, Quality of experience, Enhanced Data Rates for GSM Evolution, Service improvement, computer, Computer Science - Multimedia
Abstract: Douyin, internationally known as TikTok, has become one of the most successful short-video platforms. To maintain its popularity, Douyin has to provide better Quality of Experience (QoE) to its growing user base. Understanding the characteristics of Douyin videos is thus critical to its service improvement and system design. In this paper, we present an initial study on the fundamental characteristics of Douyin videos based on a dataset of over 260 thousand short videos collected across three months. The characteristics of Douyin videos are found to be significantly different from traditional online videos, ranging from video bitrate, size, to popularity. In particular, the distributions of the bitrate and size of videos follow Weibull distribution. We further observe that the most popular Douyin videos follow Zifp's law on video popularity, but the rest of the videos do not. We also investigate the correlation between popularity metrics used for Douyin videos. It is found that the correlation between the number of views and the number of likes are strong, while other correlations are relatively low. Finally, by using a case study, we demonstrate that the above findings can provide important guidance on designing an efficient edge caching system.
Published: 2019

42. Zipfian regularities in 'non-point' word representations

Author: Aykut Koc, Furkan Şahinuç, and Şahinuç, Furkan
Subjects: Computer science, 02 engineering and technology, Meaning (non-linguistic), Meaning-frequency relation, Library and Information Sciences, Management Science and Operations Research, computer.software_genre, 01 natural sciences, Word variances, Word entailment, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Media Technology, Semantic breadth, Polysemy, 010306 general physics, Textual entailment, Zipf's law, business.industry, 020206 networking & telecommunications, Semantic property, Variance (accounting), Computer Science Applications, Word lists by frequency, Zipf’s law, Word frequencies, Artificial intelligence, Zipfian regularities, business, computer, Natural language processing, Word (computer architecture), Information Systems
Abstract: Being one of the most common empirical regularities, the Zipf’s law for word frequencies is a power law relation between word frequencies and frequency ranks of words. We quantitatively study semantic uncertainty of words through non-point distribution-based word embeddings and reveal the Zipfian regularities. Uncertainty of a word can increase due to polysemy, the word having “broad” meaning (such as the relation between broader emotion and narrower exasperation) or a combination of both. Variances of Gaussian embeddings are utilized to quantify the extent a word can be used in different senses or contexts. By using the variance information embedded in the non-point Gaussian embeddings, we quantitatively show that semantic breadth of words also exhibits Zipfian patterns, when polysemy is controlled. This outcome is complementary to Zipf’s law of meaning distribution and the related meaning-frequency law by indicating the existence of Zipfian patterns: more frequent words tend to be generic while less frequent ones tend to be specific. Results for two languages, English and Turkish that belong to different language families, are also provided. Such regularities provide valuable information to extract and understand relationships between semantic properties of words and word frequencies. In various applications, performance improvements can be obtained by employing these regularities. We also propose a method that leverages the Zipfian regularity to improve the performance of baseline textual entailment detection algorithms. To the best of our knowledge, our approach is the first quantitative study that uses Gaussian embeddings to examine the relationships between word frequencies and semantic breadth.
Published: 2021
Full Text: View/download PDF

43. Speech tested for Zipfian fit using rigorous statistical techniques

Author: Joseph P. Stover, Paul De Palma, Mark VanDam, Jeb Kilfoyle, and Leon Antonio Garcia-Camargo
Subjects: Zipf's law, business.industry, Rank (computer programming), General Medicine, computer.software_genre, Normal distribution, symbols.namesake, Curve fitting, symbols, Pareto distribution, Artificial intelligence, Computational linguistics, business, computer, Word (computer architecture), Statistic, Natural language processing
Abstract: Zipf’s law describes the relationship between the frequencies of words in a corpus and their rank. Its most basic form is a simple series, indicating that the frequency of a word is inversely proportional to its rank: 1/2 , 1/3 , 1/ 4 ,... The past two decades have seen the emergence of usage-based and cognitive approaches to language study. A key observation of these approaches, along with the importance of frequency, is that speech differs in substantial and structural ways from writing. Yet, except for a few older analyses performed on very small corpora, most studies of Zipf’s law have been done on written corpora. Further, a judgement of Zifianness in much of this work is based on loose and informal criteria. In fact, sophisticated statistical techniques have been developed for curve fitting in recent years in the mathematics and physics literature . These include the use of the Kolmogorov-Smirnov statistic, along with maximum likelihood estimation to generate p-values and the use of the complementary error function for normal distributions. The latter helps determine if a corpus, failing a Zipfian fit, might be better described by another distribution. In this paper, we will: Show that three corpora of recorded speech follow a power law distribution using rigorous statis- tical techniques: Buckeye, Santa Barbara, MiCase Describe preliminary results showing that the techniques outlined in this paper may be useful in the diagnoses of those conditions that can include disordered speech. Explain how to do the analyses described in this paper. Explain how to download and use the R/Python code we have written and packaged as the Zipf Tool Kit
Published: 2021
Full Text: View/download PDF

44. Statistical Analysis of Word Frequency Distribution in Lithuanian Texts of Different Genres

Author: Neringa Bružaitė and Tomas Rekašius
Subjects: Vocabulary, Jaccard index, Distribution (number theory), Zipf's law, business.industry, media_common.quotation_subject, Lithuanian, computer.software_genre, Measure (mathematics), language.human_language, Linguistics, Hierarchical clustering, Word lists by frequency, language, Artificial intelligence, business, computer, Natural language processing, Mathematics, media_common
Abstract: The paper examines Lithuanian texts of different authors and genres. The main points ofinterest – the number of words, the number of different words and word frequencies. Structural type distributionand Zipf’s law are applied for describing the frequency distribution of words in the text. It is obvious that thelexical diversity of any text can be defined by different words that are used in the text, also called vocabulary.It is shown that the information contained in a reduced vocabulary is enough for dividing the texts analyzedin this article into groups by genre and author using a hierarchical clustering method. In this case, distancesbetween clusters are measured using the Jaccard distance measure, and clusters are aggregated using the Wardmethod.
Published: 2016
Full Text: View/download PDF

45. Understanding the API usage in Java

Author: Bixin Li, Hareton Leung, and Dong Qiu
Subjects: Source code, Source lines of code, Zipf's law, Java, Programming language, business.industry, Computer science, media_common.quotation_subject, 020207 software engineering, 02 engineering and technology, computer.software_genre, Computer Science Applications, Software, Java API for XML-based RPC, 020204 information systems, Abstract syntax, 0202 electrical engineering, electronic engineering, information engineering, business, computer, Scope (computer science), Information Systems, computer.programming_language, media_common
Abstract: ContextApplication Programming Interfaces (APIs) facilitate the use of programming languages. They define sets of rules and specifications for software programs to interact with. The design of language API is usually artistic, driven by aesthetic concerns and the intuitions of language architects. Despite recent studies on limited scope of API usage, there is a lack of comprehensive, quantitative analyses that explore and seek to understand how real-world source code uses language APIs. ObjectiveThis study aims to understand how APIs are employed in practical development and explore their potential applications based on the results of API usage analysis. MethodWe conduct a large-scale, comprehensive, empirical analysis of the actual usage of APIs on Java, a modern, mature, and widely-used programming language. Our corpus contains over 5000 open-source Java projects, totaling 150 million source lines of code (SLoC). We study the usage of both core (official) API library and third-party (unofficial) API libraries. We resolve project dependencies automatically, generate accurate resolved abstract syntax trees (ASTs), capture used API entities from over 1.5 million ASTs, and measure the usage based on our defined metrics: frequency, popularity and coverage. ResultsOur study provides detailed quantitative information and yield insight, particularly, (1) confirms the conventional wisdom that the usage of APIs obeys Zipf distribution; (2) demonstrates that core API is not fully used (many classes, methods and fields have never been used); (3) discovers that deprecated API entities (in which some were deprecated long ago) are still widely used; (4) evaluates that the use of current compact profiles is under-utilized; (5) identifies API library coldspots and hotspots. ConclusionsOur findings are suggestive of potential applications across language API design, optimization and restriction, API education, library recommendation and compact profile construction.
Published: 2016
Full Text: View/download PDF

46. Building(s and) Cities: Delineating Urban Areas with a Machine Learning Algorithm

Author: Miquel-Àngel Garcia-López, Daniel Arribas-Bel, and Elisabet Viladecans-Marsal
Subjects: education.field_of_study, Zipf's law, Exploit, business.industry, Population, Distribution (economics), Space (commercial competition), Machine learning, computer.software_genre, Geolocation, Geography, Robustness (computer science), Artificial intelligence, Dimension (data warehouse), education, business, computer, Algorithm
Abstract: This paper proposes a novel methodology for delineating urban areas based on a machine learning algorithm that groups buildings within portions of space of sufficient density. To do so, we use the precise geolocation of all 12 million buildings in Spain. We exploit building heights to create a new dimension for urban areas, namely, the vertical land, which provides a more accurate measure of their size. To better understand their internal structure and to illustrate an additional use for our algorithm, we also identify employment centers within the delineated urban areas. We test the robustness of our method and compare our urban areas to other delineations obtained using administrative borders and commuting-based patterns. We show that: 1) our urban areas are more similar to the commuting-based delineations than the administrative boundaries but that they are more precisely measured; 2) when analyzing the urban areas' size distribution, Zipf's law appears to hold for their population, surface and vertical land; and 3) the impact of transportation improvements on the size of the urban areas is not underestimated.
Published: 2019
Full Text: View/download PDF

47. SEMANTIC ANALYSIS AS AN EXPRESS METHOD OF EVALUATING RISK REPRESENTATIONS IN PROFESSIONAL ACTIVITY

Author: Aleksandr V. Bulgakov
Subjects: Cultural Studies, business.industry, psychologists, Semantic analysis (machine learning), Religious studies, lcsh:Political science, referent, perceptions of risk, computer.software_genre, Professional activity, semantic analysis, Zipf’s law, psychological diagnosis, Artificial intelligence, business, Psychology, computer, lcsh:J, Natural language processing
Abstract: The possibilities of the semantic analysis technique developed in linguistics and used in working with texts for the tasks of psychological study of the professional activities of specialists are substantiated. On the example of the study of risk perceptions in the work of police psychologists and practicing psychologists and teachers of the university, the advantages and limitations of the methodology are shown. The similarity of the structures of perceptions of the risk of professional activity among psychologists has been ascertained: emotional burnout and professional deformation, the use of psychological assistance methods, fear of error, legal liability, proper interaction with the client, relationships with colleagues. Double differences in the levels of professional risk reflection, the banality of risk ideas, and the harmony of the structure of representations (“Zipf’s Law”) are revealed. All indicators are optimal for police psychologists. It is concluded that it is advisable to use the methods of semantic analysis as an express risk diagnosis in professional activities.
Published: 2019
Full Text: View/download PDF

48. Combination of Natural Laws (Benford’s Law and Zipf’s Law) for Fake News Detection

Author: Aamo Iorliam
Subjects: Zipf's law, Natural law, Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS, ComputingMilieux_LEGALASPECTSOFCOMPUTING, Computer security, computer.software_genre, GeneralLiterature_MISCELLANEOUS, Benford's law, News analytics, Fake news, InformationSystems_MISCELLANEOUS, computer
Abstract: With the increase in the number of character assassination and fake news recently happening in Nigeria, we combine Zipf’s law and Benford’s law to analyse and detect fake news. The problem of fake news has become one of the most prominent issues in Nigeria recently. In this chapter, the challenges fake news poses to Nigeria is briefly presented. Due to these challenges, we propose the combination of Benford’s law and Zipf’s law in news analysis such that the hybrid of the two laws will obey the Power law for real news and deviate for fake news. We carried out various tests on different real news sources and the result shows that real news obeys the Power law. We, therefore, propose that fake news should not obey the Power law even though we could not test on fake news sources because of the lack of verified fake news dataset.
Published: 2019
Full Text: View/download PDF

49. On the Application of Agglomerative Hierarchical Clustering for Cache-Assisted D2D Networks

Author: Yi Yin, Abbas Jamalipour, and Komal S. Khan
Subjects: Hardware_MEMORYSTRUCTURES, Zipf's law, Wireless network, Computer science, Quality of service, 020206 networking & telecommunications, 020302 automobile design & engineering, 02 engineering and technology, computer.software_genre, Telecommunications network, Base station, 0203 mechanical engineering, Similarity (network science), 0202 electrical engineering, electronic engineering, information engineering, Data mining, Cache, Cluster analysis, computer
Abstract: With rapid increase in the use of traffic-intensive applications, approaches that improve users experience by reducing delay are recently receiving enormous attention. Caching in Device-to-Device (D2D) networks, in particular, is considered as an effective technique to improve the service quality of the network. In this paper, an agglomerative hierarchical clustering algorithm is proposed for a cache-assisted D2D communication network. The algorithm considers users preferences and groups them into the same cluster based on the similarity of their requested content. An optimal caching strategy has been applied and the cache hit probability has further being optimized within each cluster. Performance of the algorithm has been examined in different clusters, considering both sparse and dense user environments. Simulation results show that the cache hit probability within each cluster is higher for higher Zipf parameter, denser domains, and larger number of participating devices. The D2D cache hit probability has also been examined with changing number of clusters under a base station. In this scenario, the results show that the clustering based D2D cache hit probability is higher than the non-clustered case, and the cache hit probability increases with increasing number of clusters.
Published: 2019
Full Text: View/download PDF

50. Estimating the Total Volume of Queries to Google

Author: Salvatore Ruggieri, Fabrizio Lillo, Ling Liu, Ryen White, Lillo, Fabrizio, and Ruggieri, Salvatore
Subjects: Zipf's law, Computer science, Google Trends, Sample (statistics), 02 engineering and technology, Crawling, computer.software_genre, 01 natural sciences, Domain (software engineering), 010104 statistics & probability, Search engine, 020204 information systems, Search engine query, Zipf’s law, 0202 electrical engineering, electronic engineering, information engineering, Volume estimation, Data mining, 0101 mathematics, computer, Volume (compression)
Abstract: We study the problem of estimating the total volume of queries of a specific domain, which were submitted to the Google search engine in a given time period. Our statistical model assumes a Zipf’s law distribution of the population in the reference domain, and a non- uniform or noisy sampling of queries. Parameters of the distribution are estimated using nonlinear least square regression. Estimations with errors are then derived for the total number of queries and for the total number of searches (volume). We apply the method on the recipes and cooking domain, where a sample of queries is collected by crawling popular Italian websites specialized on this domain. The relative volumes of queries in the sample are computed using Google Trends, and transformed to absolute frequencies after estimating a scaling factor. Our model estimates that the volume of Italian recipes and cooking queries submitted to Google in 2017 and with at least 10 monthly searches consists of 7.2B searches.
Published: 2019

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

307 results on '"zipf's law"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources