Back to Search Start Over

Characterizing E-Science File Access Behavior via Latent Dirichlet Allocation

Authors :
Cecile Germain-Renaud
Yusik Kim
Machine Learning and Optimisation (TAO)
Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris-Sud - Paris 11 (UP11)-Laboratoire de Recherche en Informatique (LRI)
Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec
Laboratoire de Recherche en Informatique (LRI)
Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
Germain, Cecile
Source :
4th IEEE International Conference on Utility and Cloud Computing (UCC 2011), 4th IEEE International Conference on Utility and Cloud Computing (UCC 2011), Dec 2011, Melbourne, Australia, UCC
Publication Year :
2011
Publisher :
HAL CCSD, 2011.

Abstract

International audience; E-science is moving from grids to clouds. Getting the best of both worlds needs to build on the experience gained by the steady operation of production grids since some years. With the Grid Observatory initiative, trace data are publicly available to the computer science and engineering community and can be used for dimensioning and optimizing infrastructure. This paper proposes a new approach for analyzing behavioral traces: as most of them are indeed text documents, state of the art techniques in text mining, and specifically Latent Dirichlet Allocation, can be exploited. The advantages are twofold: providing some level of explanation inferred from the data; and a relatively scalable way to capture the temporal variability of the behavior of interest, while retaining the full dimensionality of the problem at hand. We experiment the text mining analogy approach by characterizing file access behavior. We validate the resulting probabilistic model by showing that it is capable of generating synthetic traces statistically consistent with the real ones.

Details

Language :
English
Database :
OpenAIRE
Journal :
4th IEEE International Conference on Utility and Cloud Computing (UCC 2011), 4th IEEE International Conference on Utility and Cloud Computing (UCC 2011), Dec 2011, Melbourne, Australia, UCC
Accession number :
edsair.doi.dedup.....81292b940499b443fc08812427757656