Descriptor: "H.2.4" / Topic: h.2.8 - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"H.2.4"' showing total 50 results

Start Over Descriptor "H.2.4" Topic h.2.8

50 results on '"H.2.4"'

1. Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Author: Will, Jonathan, Thamsen, Lauritz, Scheinert, Dominik, and Kao, Odej
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Databases, C.2.4, C.4, I.2.8, H.2.8, H.2.4
Abstract: Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration. In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in-memory processing with in-memory data processing frameworks can undermine resource efficiency. Based on the findings of our trace data analysis, we compile requirements towards an automated solution for efficient cluster resource allocation., Comment: 4 pages, 3 Figures; ACM SSDBM 2023
Published: 2023
Full Text: View/download PDF

2. TSEXPLAIN: Explaining Aggregated Time Series by Surfacing Evolving Contributors

Author: Chen, Yiru and Huang, Silu
Subjects: Computer Science - Databases, H.2.4, H.2.8, G.3
Abstract: Aggregated time series are generated effortlessly everywhere, e.g., "total confirmed covid-19 cases since 2019" and "total liquor sales over time." Understanding "how" and "why" these key performance indicators (KPI) evolve over time is critical to making data-informed decisions. Existing explanation engines focus on explaining one aggregated value or the difference between two relations. However, this falls short of explaining KPIs' continuous changes over time. Motivated by this, we propose TSEXPLAIN, a system that explains aggregated time series by surfacing the underlying evolving top contributors. Under the hood, we leverage prior works on two-relations diff as a building block and formulate a K-Segmentation problem to segment the time series such that each segment after segmentation shares consistent explanations, i.e., contributors. To quantify consistency in each segment, we propose a novel within-segment variance design that is explanation-aware; to derive the optimal K-Segmentation scheme, we develop an efficient dynamic programming algorithm. Experiments on synthetic and real-world datasets show that our explanation-aware segmentation can effectively identify evolving explanations for aggregated time series and outperform explanation-agnostic segmentation. Further, we proposed an optimal selection strategy of K and several optimizations to speed up TSEXPLAIN for interactive user experience, achieving up to 13X efficiency improvement., Comment: 17 pages; Accepted by ICDE 2023
Published: 2022

3. Analyzing Partitioned FAIR Health Data Responsibly

Author: Sun, Chang, Ippel, Lianne, Wouters, Birgit, van Soest, Johan, Malic, Alexander, Adekunle, Onaopepo, Berg, Bob van den, Puts, Marco, Mussmann, Ole, Koster, Annemarie, van der Kallen, Carla, Townend, David, Dekker, Andre, and Dumontier, Michel
Subjects: Computer Science - Computers and Society, E.1, E.3, H.2.4, H.2.8
Abstract: It is widely anticipated that the use of health-related big data will enable further understanding and improvements in human health and wellbeing. Our current project, funded through the Dutch National Research Agenda, aims to explore the relationship between the development of diabetes and socio-economic factors such as lifestyle and health care utilization. The analysis involves combining data from the Maastricht Study (DMS), a prospective clinical study, and data collected by Statistics Netherlands (CBS) as part of its routine operations. However, a wide array of social, legal, technical, and scientific issues hinder the analysis. In this paper, we describe these challenges and our progress towards addressing them., Comment: 6 pages, 1 figure, preliminary result, project report
Published: 2018

4. Time Series Management Systems: A Survey

Author: Jensen, Søren Kejser, Pedersen, Torben Bach, and Thomsen, Christian
Subjects: Computer Science - Databases, G.1.2, D.2.11, E.4, E.2, E.1, H.2, H.2.4, C.2.4, H.2.8, G.3
Abstract: The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) have been developed to overcome the limitations of general purpose Database Management Systems (DBMSs) for times series management. In this paper, we present a thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications. Our classification is organized into categories based on the architectures observed during our analysis. In addition, we provide an overview of each system with a focus on the motivational use case that drove the development of the system, the functionality for storage and querying of time series a system implements, the components the system is composed of, and the capabilities of each system with regard to Stream Processing and Approximate Query Processing (AQP). Last, we provide a summary of research directions proposed by other researchers in the field and present our vision for a next generation TSMS., Comment: 20 Pages, 15 Figures, 2 Tables, Accepted for publication in IEEE TKDE
Published: 2017
Full Text: View/download PDF

5. Accelerated Nearest Neighbor Search with Quick ADC

Author: André, Fabien, Kermarrec, Anne-Marie, and Scouarnec, Nicolas Le
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Databases, Computer Science - Information Retrieval, Computer Science - Multimedia, Computer Science - Performance, H.5.1, H.2.4, H.2.8
Abstract: Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. Because it offers low responses times, Product Quantization (PQ) is a popular solution. PQ compresses high-dimensional vectors into short codes using several sub-quantizers, which enables in-RAM storage of large databases. This allows fast answers to NN queries, without accessing the SSD or HDD. The key feature of PQ is that it can compute distances between short codes and high-dimensional vectors using cache-resident lookup tables. The efficiency of this technique, named Asymmetric Distance Computation (ADC), remains limited because it performs many cache accesses. In this paper, we introduce Quick ADC, a novel technique that achieves a 3 to 6 times speedup over ADC by exploiting Single Instruction Multiple Data (SIMD) units available in current CPUs. Efficiently exploiting SIMD requires algorithmic changes to the ADC procedure. Namely, Quick ADC relies on two key modifications of ADC: (i) the use 4-bit sub-quantizers instead of the standard 8-bit sub-quantizers and (ii) the quantization of floating-point distances. This allows Quick ADC to exceed the performance of state-of-the-art systems, e.g., it achieves a Recall@100 of 0.94 in 3.4 ms on 1 billion SIFT descriptors (128-bit codes)., Comment: 8 pages, 5 figures, published in Proceedings of ICMR'17, Bucharest, Romania, June 06-09, 2017
Published: 2017
Full Text: View/download PDF

6. Constellation Queries over Big Data

Author: Porto, Fabio, Khatibi, Amir, Nobre, João R., Ogasawara, Eduardo, Valduriez, Patrick, and Shasha, Dennis
Subjects: Computer Science - Databases, H.2.4, H.2.8, H.3.1
Abstract: A geometrical pattern is a set of points with all pairwise distances (or, more generally, relative distances) specified. Finding matches to such patterns has applications to spatial data in seismic, astronomical, and transportation contexts. For example, a particularly interesting geometric pattern in astronomy is the Einstein cross, which is an astronomical phenomenon in which a single quasar is observed as four distinct sky objects (due to gravitational lensing) when captured by earth telescopes. Finding such crosses, as well as other geometric patterns, is a challenging problem as the potential number of sets of elements that compose shapes is exponentially large in the size of the dataset and the pattern. In this paper, we denote geometric patterns as constellation queries and propose algorithms to find them in large data applications. Our methods combine quadtrees, matrix multiplication, and unindexed join processing to discover sets of points that match a geometric pattern within some additive factor on the pairwise distances. Our distributed experiments show that the choice of composition algorithm (matrix multiplication or nested loops) depends on the freedom introduced in the query geometry through the distance additive factor. Three clearly identified blocks of threshold values guide the choice of the best composition algorithm. Finally, solving the problem for relative distances requires a novel continuous-to-discrete transformation. To the best of our knowledge this paper is the first to investigate constellation queries at scale.
Published: 2017

7. NetClus: A Scalable Framework for Locating Top-K Sites for Placement of Trajectory-Aware Services

Author: Mitra, Shubhadip, Saraf, Priya, Sharma, Richa, Bhattacharya, Arnab, Bhandari, Harsh, and Ranu, Sayan
Subjects: Computer Science - Databases, H.2.8, H.2.4
Abstract: Facility location queries identify the best locations to set up new facilities for providing service to its users. Majority of the existing works in this space assume that the user locations are static. Such limitations are too restrictive for planning many modern real-life services such as fuel stations, ATMs, convenience stores, cellphone base-stations, etc. that are widely accessed by mobile users. The placement of such services should, therefore, factor in the mobility patterns or trajectories of the users rather than simply their static locations. In this work, we introduce the TOPS (Trajectory-Aware Optimal Placement of Services) query that locates the best k sites on a road network. The aim is to optimize a wide class of objective functions defined over the user trajectories. We show that the problem is NP-hard and even the greedy heuristic with an approximation bound of (1-1/e) fails to scale on urban-scale datasets. To overcome this challenge, we develop a multi-resolution clustering based indexing framework called NetClus. Empirical studies on real road network trajectory datasets show that NetClus offers solutions that are comparable in terms of quality with those of the greedy heuristic, while having practical response times and low memory footprints. Additionally, the NetClus framework can absorb dynamic updates in mobility patterns, handle constraints such as site-costs and capacity, and existing services, thereby providing an effective solution for modern urban-scale scenarios., Comment: ICDE 2017 poster
Published: 2017

8. Distributed Publish/Subscribe Query Processing on the Spatio-Textual Data Stream

Author: Chen, Zhida, Cong, Gao, Zhang, Zhenjie, Fu, Tom Z. J., and Chen, Lisi
Subjects: Computer Science - Databases, C.1.2, C.2.4, H.2.4, H.2.8
Abstract: Huge amount of data with both space and text information, e.g., geo-tagged tweets, is flooding on the Internet. Such spatio-textual data stream contains valuable information for millions of users with various interests on different keywords and locations. Publish/subscribe systems enable efficient and effective information distribution by allowing users to register continuous queries with both spatial and textual constraints. However, the explosive growth of data scale and user base has posed challenges to the existing centralized publish/subscribe systems for spatio-textual data streams. In this paper, we propose our distributed publish/subscribe system, called PS2Stream, which digests a massive spatio-textual data stream and directs the stream to target users with registered interests. Compared with existing systems, PS2Stream achieves a better workload distribution in terms of both minimizing the total amount of workload and balancing the load of workers. To achieve this, we propose a new workload distribution algorithm considering both space and text properties of the data. Additionally, PS2Stream supports dynamic load adjustments to adapt to the change of the workload, which makes PS2Stream adaptive. Extensive empirical evaluation, on commercial cloud computing platform with real data, validates the superiority of our system design and advantages of our techniques on system performance improvement., Comment: 13 pages, 16 figures, this paper has been accepted by ICDE2017
Published: 2016

9. Show me the material evidence: Initial experiments on evaluating hypotheses from user-generated multimedia data

Author: Gonçalves, Bernardo
Subjects: Computer Science - Artificial Intelligence, Computer Science - Databases, Computer Science - Multimedia, I.2.6, I.2.7, H.1.2, H.2.4, H.2.8
Abstract: Subjective questions such as `does neymar dive', or `is clinton lying', or `is trump a fascist', are popular queries to web search engines, as can be seen by autocompletion suggestions on Google, Yahoo and Bing. In the era of cognitive computing, beyond search, they could be handled as hypotheses issued for evaluation. Our vision is to leverage on unstructured data and metadata of the rich user-generated multimedia that is often shared as material evidence in favor or against hypotheses in social media platforms. In this paper we present two preliminary experiments along those lines and discuss challenges for a cognitive computing system that collects material evidence from user-generated multimedia towards aggregating it into some form of collective decision on the hypothesis., Comment: 6 pages, 6 figures, 3 tables in Proc. of the 1st Workshop on Multimedia Support for Decision-Making Processes, at IEEE Intl. Symposium on Multimedia (ISM'16), San Jose, CA, 2016
Published: 2016

10. Web Data Knowledge Extraction

Author: Tirado, Juan M., Serban, Ovidiu, Guo, Qiang, and Yoneki, Eiko
Subjects: Computer Science - Databases, Computer Science - Information Retrieval, 68U04, I.7, H.2, H.2.4, D.2.11, D.2.12, C.1.1, H.2.8, H.3, H.3.1, H.5.4
Abstract: A constantly growing amount of information is available through the web. Unfortunately, extracting useful content from this massive amount of data still remains an open issue. The lack of standard data models and structures forces developers to create adhoc solutions from the scratch. The figure of the expert is still needed in many situations where developers do not have the correct background knowledge. This forces developers to spend time acquiring the needed background from the expert. In other directions, there are promising solutions employing machine learning techniques. However, increasing accuracy requires an increase in system complexity that cannot be endured in many projects. In this work, we approach the web knowledge extraction problem using an expertcentric methodology. This methodology defines a set of configurable, extendible and independent components that permit the reutilisation of large pieces of code among projects. Our methodology differs from similar solutions in its expert-driven design. This design, makes it possible for subject-matter expert to drive the knowledge extraction for a given set of documents. Additionally, we propose the utilization of machine assisted solutions that guide the expert during this process. To demonstrate the capabilities of our methodology, we present a real use case scenario in which public procurement data is extracted from the web-based repositories of several public institutions across Europe. We provide insightful details about the challenges we had to deal with in this use case and additional discussions about how to apply our methodology.
Published: 2016

11. Finding Desirable Objects under Group Categorical Preferences

Author: Bikakis, Nikos, Benouaret, Karim, and Sacharidis, Dimitris
Subjects: Computer Science - Databases, Computer Science - Data Structures and Algorithms, 97R50, 68P05, 68P15, E.1, H.2.8, H.3.1, I.3.5, H.2.4
Abstract: Considering a group of users, each specifying individual preferences over categorical attributes, the problem of determining a set of objects that are objectively preferable by all users is challenging on two levels. First, we need to determine the preferable objects based on the categorical preferences for each user, and second we need to reconcile possible conflicts among users' preferences. A naive solution would first assign degrees of match between each user and each object, by taking into account all categorical attributes, and then for each object combine these matching degrees across users to compute the total score of an object. Such an approach, however, performs two series of aggregation, among categorical attributes and then across users, which completely obscure and blur individual preferences. Our solution, instead of combining individual matching degrees, is to directly operate on categorical attributes, and define an objective Pareto-based aggregation for group preferences. Building on our interpretation, we tackle two distinct but relevant problems: finding the Pareto-optimal objects, and objectively ranking objects with respect to the group preferences. To increase the efficiency when dealing with categorical attributes, we introduce an elegant transformation of categorical attribute values into numerical values, which exhibits certain nice properties and allows us to use well-known index structures to accelerate the solutions to the two problems. In fact, experiments on real and synthetic data show that our index-based techniques are an order of magnitude faster than baseline approaches, scaling up to millions of objects and thousands of users., Comment: To appear in Knowledge and Information Systems Journal (KAIS), Springer 2015
Published: 2015

12. FactorBase: SQL for Learning A Multi-Relational Graphical Model

Author: Schulte, Oliver and Qian, Zhensong
Subjects: Computer Science - Databases, Computer Science - Learning, H.2.8, H.2.4
Abstract: We describe FactorBase, a new SQL-based framework that leverages a relational database management system to support multi-relational model discovery. A multi-relational statistical model provides an integrated analysis of the heterogeneous and interdependent data resources in the database. We adopt the BayesStore design philosophy: statistical models are stored and managed as first-class citizens inside a database. Whereas previous systems like BayesStore support multi-relational inference, FactorBase supports multi-relational learning. A case study on six benchmark databases evaluates how our system supports a challenging machine learning application, namely learning a first-order Bayesian network model for an entire database. Model learning in this setting has to examine a large number of potential statistical associations across data tables. Our implementation shows how the SQL constructs in FactorBase facilitate the fast, modular, and reliable development of highly scalable model learning systems., Comment: 14 pages, 10 figures, 10 tables, Published on 2015 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA'2015), Oct 19-21, 2015, Paris, France
Published: 2015

13. SQL for SRL: Structure Learning Inside a Database System

Author: Schulte, Oliver and Qian, Zhensong
Subjects: Computer Science - Learning, Computer Science - Databases, H.2.8, H.2.4
Abstract: The position we advocate in this paper is that relational algebra can provide a unified language for both representing and computing with statistical-relational objects, much as linear algebra does for traditional single-table machine learning. Relational algebra is implemented in the Structured Query Language (SQL), which is the basis of relational database management systems. To support our position, we have developed the FACTORBASE system, which uses SQL as a high-level scripting language for statistical-relational learning of a graphical model structure. The design philosophy of FACTORBASE is to manage statistical models as first-class citizens inside a database. Our implementation shows how our SQL constructs in FACTORBASE facilitate fast, modular, and reliable program development. Empirical evidence from six benchmark databases indicates that leveraging database system capabilities achieves scalable model structure learning., Comment: 3 pages, 1 figure, Position Paper of the Fifth International Workshop on Statistical Relational AI at UAI 2015
Published: 2015

14. Declarative Statistical Modeling with Datalog

Author: Barany, Vince, Cate, Balder ten, Kimelfeld, Benny, Olteanu, Dan, and Vagena, Zografoula
Subjects: Computer Science - Databases, Computer Science - Artificial Intelligence, Computer Science - Programming Languages, F.1.2, G.3, H.2.3, H.2.4, H.2.8, I.2.3
Abstract: Formalisms for specifying statistical models, such as probabilistic-programming languages, typically consist of two components: a specification of a stochastic process (the prior), and a specification of observations that restrict the probability space to a conditional subspace (the posterior). Use cases of such formalisms include the development of algorithms in machine learning and artificial intelligence. We propose and investigate a declarative framework for specifying statistical models on top of a database, through an appropriate extension of Datalog. By virtue of extending Datalog, our framework offers a natural integration with the database, and has a robust declarative semantics. Our Datalog extension provides convenient mechanisms to include numerical probability functions; in particular, conclusions of rules may contain values drawn from such functions. The semantics of a program is a probability distribution over the possible outcomes of the input database with respect to the program; these outcomes are minimal solutions with respect to a related program with existentially quantified variables in conclusions. Observations are naturally incorporated by means of integrity constraints over the extensional and intensional relations. We focus on programs that use discrete numerical distributions, but even then the space of possible outcomes may be uncountable (as a solution can be infinite). We define a probability measure over possible outcomes by applying the known concept of cylinder sets to a probabilistic chase procedure. We show that the resulting semantics is robust under different chases. We also identify conditions guaranteeing that all possible outcomes are finite (and then the probability space is discrete). We argue that the framework we propose retains the purely declarative nature of Datalog, and allows for natural specifications of statistical models., Comment: 14 pages, 4 figures
Published: 2014

15. Computing Multi-Relational Sufficient Statistics for Large Databases

Author: Qian, Zhensong, Schulte, Oliver, and Sun, Yan
Subjects: Computer Science - Learning, Computer Science - Databases, H.2.8, H.2.4
Abstract: Databases contain information about which relationships do and do not hold among entities. To make this information accessible for statistical analysis requires computing sufficient statistics that combine information from different database tables. Such statistics may involve any number of {\em positive and negative} relationships. With a naive enumeration approach, computing sufficient statistics for negative relationships is feasible only for small databases. We solve this problem with a new dynamic programming algorithm that performs a virtual join, where the requisite counts are computed without materializing join tables. Contingency table algebra is a new extension of relational algebra, that facilitates the efficient implementation of this M\"obius virtual join operation. The M\"obius Join scales to large datasets (over 1M tuples) with complex schemas. Empirical evaluation with seven benchmark datasets showed that information about the presence and absence of links can be exploited in feature selection, association rule mining, and Bayesian network learning., Comment: 11pages, 8 figures, 8 tables, CIKM'14,November 3--7, 2014, Shanghai, China
Published: 2014
Full Text: View/download PDF

16. Differential privacy for counting queries: can Bayes estimation help uncover the true value?

Author: Naldi, Maurizio and D'Acquisto, Giuseppe
Subjects: Computer Science - Databases, Computer Science - Cryptography and Security, H.2.8, H.2.4, K.4.1
Abstract: Differential privacy is achieved by the introduction of Laplacian noise in the response to a query, establishing a precise trade-off between the level of differential privacy and the accuracy of the database response (via the amount of noise introduced). Multiple queries may improve the accuracy but erode the privacy budget. We examine the case where we submit just a single counting query. We show that even in that case a Bayesian approach may be used to improve the accuracy for the same amount of noise injected, if we know the size of the database and the probability of a positive response to the query.
Published: 2014

17. Three-Way Joins on MapReduce: An Experimental Study

Author: Kimmett, Ben, Thomo, Alex, and Venkatesh, S.
Subjects: Computer Science - Databases, Computer Science - Distributed, Parallel, and Cluster Computing, 68W15, H.2.4, H.2.8
Abstract: We study three-way joins on MapReduce. Joins are very useful in a multitude of applications from data integration and traversing social networks, to mining graphs and automata-based constructions. However, joins are expensive, even for moderate data sets; we need efficient algorithms to perform distributed computation of joins using clusters of many machines. MapReduce has become an increasingly popular distributed computing system and programming paradigm. We consider a state-of-the-art MapReduce multi-way join algorithm by Afrati and Ullman and show when it is appropriate for use on very large data sets. By providing a detailed experimental study, we demonstrate that this algorithm scales much better than what is suggested by the original paper. However, if the join result needs to be summarized or aggregated, as opposed to being only enumerated, then the aggregation step can be integrated into a cascade of two-way joins, making it more efficient than the other algorithm, and thus becomes the preferred solution., Comment: 6 pages
Published: 2014

18. Constructing Gazetteers from Volunteered Big Geo-Data Based on Hadoop

Author: Gao, Song, Li, Linna, Li, Wenwen, Janowicz, Krzysztof, and Zhang, Yue
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, H.2.4, H.2.8, H.3.3
Abstract: Traditional gazetteers are built and maintained by authoritative mapping agencies. In the age of Big Data, it is possible to construct gazetteers in a data-driven approach by mining rich volunteered geographic information (VGI) from the Web. In this research, we build a scalable distributed platform and a high-performance geoprocessing workflow based on the Hadoop ecosystem to harvest crowd-sourced gazetteer entries. Using experiments based on geotagged datasets in Flickr, we find that the MapReduce-based workflow running on the spatially enabled Hadoop cluster can reduce the processing time compared with traditional desktop-based operations by an order of magnitude. We demonstrate how to use such a novel spatial-computing infrastructure to facilitate gazetteer research. In addition, we introduce a provenance-based trust model for quality assurance. This work offers new insights on enriching future gazetteers with the use of Hadoop clusters, and makes contributions in connecting GIS to the cloud computing environment for the next frontier of Big Geo-Data analytics., Comment: 45 pages, 10 figures
Published: 2013

19. ENFrame: A Platform for Processing Probabilistic Data

Author: van Schaik, Sebastiaan J., Olteanu, Dan, and Fink, Robert
Subjects: Computer Science - Databases, H.2.4, H.2.8, H.3.5
Abstract: This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as bounded-range loops, list comprehension, aggregate operations on lists, and calls to external database engines. The program is then interpreted probabilistically by ENFrame. The realisation of ENFrame required novel contributions along several directions. We propose an event language that is expressive enough to succinctly encode arbitrary correlations, trace the computation of user programs, and allow for computation of discrete probability distributions of program variables. We exemplify ENFrame on three clustering algorithms: k-means, k-medoids, and Markov Clustering. We introduce sequential and distributed algorithms for computing the probability of interconnected events exactly or approximately with error guarantees. Experiments with k-medoids clustering of sensor readings from energy networks show orders-of-magnitude improvements of exact clustering using ENFrame over na\"ive clustering in each possible world, of approximate over exact, and of distributed over sequential algorithms., Comment: 12 pages
Published: 2013

20. Array Requirements for Scientific Applications and an Implementation for Microsoft SQL Server

Author: Dobos, László, Szalay, Alexander, Blakeley, José, Budavári, Tamás, Csabai, István, Tomic, Dragan, Milovanovic, Milos, Tintor, Marko, and Jovanovic, Andrija
Subjects: Computer Science - Databases, H.2.4, H.3.2, H.2.8, E.1, J.2
Abstract: This paper outlines certain scenarios from the fields of astrophysics and fluid dynamics simulations which require high performance data warehouses that support array data type. A common feature of all these use cases is that subsetting and preprocessing the data on the server side (as far as possible inside the database server process) is necessary to avoid the client-server overhead and to minimize IO utilization. Analyzing and summarizing the requirements of the various fields help software engineers to come up with a comprehensive design of an array extension to relational database systems that covers a wide range of scientific applications. We also present a working implementation of an array data type for Microsoft SQL Server 2008 to support large-scale scientific applications. We introduce the design of the array type, results from a performance evaluation, and discuss the lessons learned from this implementation. The library can be downloaded from our website at http://voservices.net/sqlarray/
Published: 2011
Full Text: View/download PDF

21. Secure Mining of Association Rules in Horizontally Distributed Databases

Author: Tassa, Tamir
Subjects: Computer Science - Databases, Computer Science - Cryptography and Security, Computer Science - Distributed, Parallel, and Cluster Computing, H.2.4, H.2.8
Abstract: We propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading protocol is that of Kantarcioglu and Clifton (TKDE 2004). Our protocol, like theirs, is based on the Fast Distributed Mining (FDM) algorithm of Cheung et al. (PDIS 1996), which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms --- one that computes the union of private subsets that each of the interacting players hold, and another that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect to the protocol of Kantarcioglu and Clifton. In addition, it is simpler and is significantly more efficient in terms of communication rounds, communication cost and computational cost.
Published: 2011

22. Mining Multi-Level Frequent Itemsets under Constraints

Author: Gouider, Mohamed Salah and Farhat, Amine
Subjects: Computer Science - Databases, Computer Science - Artificial Intelligence, Computer Science - Data Structures and Algorithms, 68P04, 68Q04, 68T04, 68U04, H.2.4, H.2.8, I.2.6, I.2.4, I.1.2
Abstract: Mining association rules is a task of data mining, which extracts knowledge in the form of significant implication relation of useful items (objects) from a database. Mining multilevel association rules uses concept hierarchies, also called taxonomies and defined as relations of type 'is-a' between objects, to extract rules that items belong to different levels of abstraction. These rules are more useful, more refined and more interpretable by the user. Several algorithms have been proposed in the literature to discover the multilevel association rules. In this article, we are interested in the problem of discovering multi-level frequent itemsets under constraints, involving the user in the research process. We proposed a technique for modeling and interpretation of constraints in a context of use of concept hierarchies. Three approaches for discovering multi-level frequent itemsets under constraints were proposed and discussed: Basic approach, "Test and Generate" approach and Pruning based Approach., Comment: 20 pages
Published: 2010

23. Publishing Math Lecture Notes as Linked Data

Author: David, Catalin, Kohlhase, Michael, Lange, Christoph, Rabe, Florian, Zhiltsov, Nikita, and Zholudev, Vyacheslav
Subjects: Computer Science - Digital Libraries, Computer Science - Artificial Intelligence, Mathematics - History and Overview, 68T35, 68T30, H.2.4, H.2.8, H.3.5, G.4, F.4.m, H.5.3, H.5.4, J.2
Abstract: We mark up a corpus of LaTeX lecture notes semantically and expose them as Linked Data in XHTML+MathML+RDFa. Our application makes the resulting documents interactively browsable for students. Our ontology helps to answer queries from students and lecturers, and paves the path towards an integration of our corpus with external sites., Comment: 7th Extended Semantic Web Conference (http://www.eswc2010.org), Demo Track
Published: 2010

24. Adding HL7 version 3 data types to PostgreSQL

Author: Havinga, Yeb, Dijkstra, Willem, and de Keijzer, Ander
Subjects: Computer Science - Databases, 68N99, C.4, D.3.3, H.2.4, H.2.8, J.3
Abstract: The HL7 standard is widely used to exchange medical information electronically. As a part of the standard, HL7 defines scalar communication data types like physical quantity, point in time and concept descriptor but also complex types such as interval types, collection types and probabilistic types. Typical HL7 applications will store their communications in a database, resulting in a translation from HL7 concepts and types into database types. Since the data types were not designed to be implemented in a relational database server, this transition is cumbersome and fraught with programmer error. The purpose of this paper is two fold. First we analyze the HL7 version 3 data type definitions and define a number of conditions that must be met, for the data type to be suitable for implementation in a relational database. As a result of this analysis we describe a number of possible improvements in the HL7 specification. Second we describe an implementation in the PostgreSQL database server and show that the database server can effectively execute scientific calculations with units of measure, supports a large number of operations on time points and intervals, and can perform operations that are akin to a medical terminology server. Experiments on synthetic data show that the user defined types perform better than an implementation that uses only standard data types from the database server., Comment: 12 pages, 9 figures, 6 tables
Published: 2010

25. Perspects in astrophysical databases

Author: Frailis, M., De Angelis, A., and Roberto, V.
Subjects: Computer Science - Databases, Astrophysics, H.2.4, H.2.8
Abstract: Astrophysics has become a domain extremely rich of scientific data. Data mining tools are needed for information extraction from such large datasets. This asks for an approach to data management emphasizing the efficiency and simplicity of data access; efficiency is obtained using multidimensional access methods and simplicity is achieved by properly handling metadata. Moreover, clustering and classification techniques on large datasets pose additional requirements in terms of computation and memory scalability and interpretability of results. In this study we review some possible solutions.
Published: 2004
Full Text: View/download PDF

26. Data Management and Mining in Astrophysical Databases

Author: Frailis, M., De Angelis, A., and Roberto, V.
Subjects: Computer Science - Databases, Astrophysics, Physics - Data Analysis, Statistics and Probability, H.2.4, H.2.8
Abstract: We analyse the issues involved in the management and mining of astrophysical data. The traditional approach to data management in the astrophysical field is not able to keep up with the increasing size of the data gathered by modern detectors. An essential role in the astrophysical research will be assumed by automatic tools for information extraction from large datasets, i.e. data mining techniques, such as clustering and classification algorithms. This asks for an approach to data management based on data warehousing, emphasizing the efficiency and simplicity of data access; efficiency is obtained using multidimensional access methods and simplicity is achieved by properly handling metadata. Clustering and classification techniques, on large datasets, pose additional requirements: computational and memory scalability with respect to the data size, interpretability and objectivity of clustering or classification results. In this study we address some possible solutions., Comment: 10 pages, Latex
Published: 2003

27. An on-line Integrated Bookkeeping: electronic run log book and Meta-Data Repository for ATLAS

Author: Barczyc, M., Burckhart-Chromek, D., Caprini, M., Conceicao, J. Da Silva, Dobson, M., Flammer, J., Jones, R., Kazarov, A., Kolos, S., Liko, D., Mapelli, L., Soloviev, I., NIKHEF, R. Hart, Amorim, A., Klose, D., Lima, J., Lucio, L., Pedro, L., Wolters, H., NIPNE, E. Badescu, Alexandrov, I., Kotov, V., JINR, M. Mineev, and PNPI, Yu. Ryabov
Subjects: Computer Science - Databases, H.2.4, H.2.8
Abstract: In the context of the ATLAS experiment there is growing evidence of the importance of different kinds of Meta-data including all the important details of the detector and data acquisition that are vital for the analysis of the acquired data. The Online BookKeeper (OBK) is a component of ATLAS online software that stores all information collected while running the experiment, including the Meta-data associated with the event acquisition, triggering and storage. The facilities for acquisition of control data within the on-line software framework, together with a full functional Web interface, make the OBK a powerful tool containing all information needed for event analysis, including an electronic log book. In this paper we explain how OBK plays a role as one of the main collectors and managers of Meta-data produced on-line, and we'll also focus on the Web facilities already available. The usage of the web interface as an electronic run logbook is also explained, together with the future extensions. We describe the technology used in OBK development and how we arrived at the present level explaining the previous experience with various DBMS technologies. The extensive performance evaluations that have been performed and the usage in the production environment of the ATLAS test beams are also analysed.
Published: 2003

28. Configuration Database for BaBar On-line

Author: Bartoldus, R., Dubois-Felsmann, G., Kolomensky, Y., and Salnikov, A.
Subjects: Computer Science - Databases, Computer Science - Information Retrieval, H.2.4, H.2.8
Abstract: The configuration database is one of the vital systems in the BaBar on-line system. It provides services for the different parts of the data acquisition system and control system, which require run-time parameters. The original design and implementation of the configuration database played a significant role in the successful BaBar operations since the beginning of experiment. Recent additions to the design of the configuration database provide better means for the management of data and add new tools to simplify main configuration tasks. We describe the design of the configuration database, its implementation with the Objectivity/DB object-oriented database, and our experience collected during the years of operation., Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 5 pages, 4 figures, PDF. PSN MOKT004
Published: 2003

29. Event Indexing Systems for Efficient Selection and Analysis of HERA Data

Author: Bauerdick, L. A. T., Fox-Murphy, Adrian, Haas, Tobias, Stonjek, Stefan, and Tassi, Enrico
Subjects: Computer Science - Databases, Computer Science - Information Retrieval, H.2.4, H.3.1, H.3.3, H.3.4, J.2, H.2.8
Abstract: The design and implementation of two software systems introduced to improve the efficiency of offline analysis of event data taken with the ZEUS Detector at the HERA electron-proton collider at DESY are presented. Two different approaches were made, one using a set of event directories and the other using a tag database based on a commercial object-oriented database management system. These are described and compared. Both systems provide quick direct access to individual collision events in a sequential data store of several terabytes, and they both considerably improve the event analysis efficiency. In particular the tag database provides a very flexible selection mechanism and can dramatically reduce the computing time needed to extract small subsamples from the total event sample. Gains as large as a factor 20 have been obtained., Comment: Accepted for publication in Computer Physics Communications
Published: 2001
Full Text: View/download PDF

30. Microsoft TerraServer

Author: Barclay, Tom, Eberl, Robert, Gray, Jim, Nordlinger, John, Raghavendran, Guru, Slutz, Don, Smith, Greg, Smoot, Phil, Hoffman, John, Robb III, Natt, Rossmeissl, Hedy, Duff, Beth, Lee, George, Mathesmier, Theresa, and Sunne, Randall
Subjects: Computer Science - Databases, Computer Science - Digital Libraries, H.2.4, H.2.8, H.3.5
Abstract: The Microsoft TerraServer stores aerial and satellite images of the earth in a SQL Server Database served to the public via the Internet. It is the world's largest atlas, combining five terabytes of image data from the United States Geodetic Survey, Sovinformsputnik, and Encarta Virtual Globe. Internet browsers provide intuitive spatial and gazetteer interfaces to the data. The TerraServer is also an E-Commerce application. Users can buy the right to use the imagery using Microsoft Site Servers managed by the USGS and Aerial Images. This paper describes the TerraServer's design and implementation., Comment: Original file at http://research.microsoft.com/~gray/TerraServer_TR.doc
Published: 1998

31. Analyzing Partitioned FAIR Health Data Responsibly

Subjects: E.1, E.3, H.2.4, H.2.8, cs.CY
Abstract: It is widely anticipated that the use of health-related big data will enable further understanding and improvements in human health and wellbeing. Our current project, funded through the Dutch National Research Agenda, aims to explore the relationship between the development of diabetes and socio-economic factors such as lifestyle and health care utilization. The analysis involves combining data from the Maastricht Study (DMS), a prospective clinical study, and data collected by Statistics Netherlands (CBS) as part of its routine operations. However, a wide array of social, legal, technical, and scientific issues hinder the analysis. In this paper, we describe these challenges and our progress towards addressing them.
Published: 2018

32. Constructing gazetteers from volunteered Big Geo-Data based on Hadoop

Author: Yue Zhang, Krzysztof Janowicz, Song Gao, Linna Li, and Wenwen Li
Subjects: FOS: Computer and information sciences, Volunteered geographic information, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Geography, Planning and Development, Big data, 0211 other engineering and technologies, 0507 social and economic geography, Cloud computing, 02 engineering and technology, computer.software_genre, Geoprocessing workflow, H.2.4, H.3.3, H.2.8, 021101 geological & geomatics engineering, General Environmental Science, Database, business.industry, Ecological Modeling, 05 social sciences, Construct (python library), Urban Studies, Workflow, Geography, Computer Science - Distributed, Parallel, and Cluster Computing, Analytics, Scalable distributed, Distributed, Parallel, and Cluster Computing (cs.DC), business, 050703 geography, computer
Abstract: Traditional gazetteers are built and maintained by authoritative mapping agencies. In the age of Big Data, it is possible to construct gazetteers in a data-driven approach by mining rich volunteered geographic information (VGI) from the Web. In this research, we build a scalable distributed platform and a high-performance geoprocessing workflow based on the Hadoop ecosystem to harvest crowd-sourced gazetteer entries. Using experiments based on geotagged datasets in Flickr, we find that the MapReduce-based workflow running on the spatially enabled Hadoop cluster can reduce the processing time compared with traditional desktop-based operations by an order of magnitude. We demonstrate how to use such a novel spatial-computing infrastructure to facilitate gazetteer research. In addition, we introduce a provenance-based trust model for quality assurance. This work offers new insights on enriching future gazetteers with the use of Hadoop clusters, and makes contributions in connecting GIS to the cloud computing environment for the next frontier of Big Geo-Data analytics., Comment: 45 pages, 10 figures
Published: 2017

33. Analyzing Partitioned FAIR Health Data Responsibly

Subjects: E.1, E.3, H.2.4, H.2.8, cs.CY
Abstract: It is widely anticipated that the use of health-related big data will enable further understanding and improvements in human health and wellbeing. Our current project, funded through the Dutch National Research Agenda, aims to explore the relationship between the development of diabetes and socio-economic factors such as lifestyle and health care utilization. The analysis involves combining data from the Maastricht Study (DMS), a prospective clinical study, and data collected by Statistics Netherlands (CBS) as part of its routine operations. However, a wide array of social, legal, technical, and scientific issues hinder the analysis. In this paper, we describe these challenges and our progress towards addressing them.
Published: 2018

34. Constellation Queries over Big Data

Author: Porto, Fábio, Khatibi, Amir, Rittmeyer, Joao, Ogasawara, Eduardo, Valduriez, Patrick, Shasha, Dennis, Laboratorio Nacional de Computação Cientifica [Rio de Janeiro] (LNCC / MCT), Centro Federal de Educação Tecnológica Celso Suckow da Fonseca (Rio de Janeiro) ( CEFET/RJ), Scientific Data Management (ZENITH), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Courant Institute of Mathematical Sciences [New York] (CIMS), New York University [New York] (NYU), NYU System (NYU)-NYU System (NYU), SBC, SciDISC Inria associated team with Brazil, European Project: 689772,H2020 Pilier Industrial Leadership,H2020-EUB-2015,HPC4E(2015), and Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Inria Sophia Antipolis - Méditerranée (CRISAM)
Subjects: FOS: Computer and information sciences, Big data, [INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB], Computer Science - Databases, H.2.4, H.2.8, H.3.1, Databases (cs.DB), ACM: H.: Information Systems/H.2: DATABASE MANAGEMENT, Pattern search
Abstract: International audience; A geometrical pattern is a set of points with all pairwise distances (or, more generally, relative distances) specified. Finding matches to such patterns has applications to spatial data in seismic, astronomical, and transportation contexts. Finding geometric patterns is a challenging problem as the potential number of sets of elements that compose shapes is exponentially large in the size of the dataset and the pattern. In this paper, we propose algorithms to find patterns in large data applications. Our methods combine quadtrees, matrix multiplication, and bucket join processing to discover sets of points that match a geometric pattern within some additive factor on the pairwise distances. Our distributed experiments show that the choice of composition algorithm (matrix multiplication or nested loops) depends on the freedom introduced in the query geometry through the distance additive factor. Three clearly identified blocks of threshold values guide the choice of the best composition algorithm.; Um padra ̃o geome ́trico e ́ definido por um conjunto de pontos e todos os pares de distaˆncias entre estes pontos. Encontrar casamentos de padro ̃es geome ́tricos em datasets tem aplicac ̧o ̃es na astronomia, na pesquisa s ́ısmica e no desenho de a ́reas urbanas. A soluc ̧a ̃o do problema impo ̃e um grande desafio, considerando-se o nu ́mero exponencial de candidatos, potencialmente func ̧a ̃o do nu ́mero de elementos no dataset e nu ́mero de pontos na forma geome ́trica. O me ́todo aqui apresentado inclui: quadtrees,multiplicac ̧a ̃o de matrizes e junc ̧o ̃es espaciais para encontrar conjuntos de pontos que se aproximem do padra ̃o fornecido, com um erro admiss ́ıvel. Apresentamos uma implementac ̧a ̃o dis- tribu ́ıda reveladora de que a escolha do algoritmo (multiplicac ̧a ̃o de matrizes ou junc ̧o ̃es espaciais) depende da liberdade introduzida por um fator de erro adi- tivo na geometria do padra ̃o. Identificamos treˆs regio ̃es baseadas nos valores de erro tolerados que determinam a escolha do algoritmo.
Published: 2018

35. Accelerated Nearest Neighbor Search with Quick ADC

Author: Anne-Marie Kermarrec, Nicolas Le Scouarnec, Fabien André, Technicolor R & I [Cesson Sévigné], Technicolor, As Scalable As Possible: foundations of large scale dynamic distributed systems (ASAP), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-SYSTÈMES LARGE ÉCHELLE (IRISA-D1), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT), Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), SYSTÈMES LARGE ÉCHELLE (IRISA-D1), Université de Bretagne Sud (UBS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-CentraleSupélec-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Bretagne Sud (UBS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Inria Rennes – Bretagne Atlantique, and Institut National de Recherche en Informatique et en Automatique (Inria)
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Speedup, Computer science, Nearest neighbor search, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Scale-invariant feature transform, 02 engineering and technology, H.5.1, H.2.4, H.2.8, Computer Science - Information Retrieval, k-nearest neighbors algorithm, 020901 industrial engineering & automation, Computer Science - Databases, 0202 electrical engineering, electronic engineering, information engineering, SIMD, Computer Science - Performance, [INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB], Quantization (signal processing), [INFO.INFO-CE]Computer Science [cs]/Computational Engineering, Finance, and Science [cs.CE], [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], Databases (cs.DB), Multimedia (cs.MM), Performance (cs.PF), [INFO.INFO-PF]Computer Science [cs]/Performance [cs.PF], [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR], Lookup table, 020201 artificial intelligence & image processing, Cache, Algorithm, Computer Science - Multimedia, Information Retrieval (cs.IR)
Abstract: Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a foundation of many multimedia retrieval systems. Because it offers low responses times, Product Quantization (PQ) is a popular solution. PQ compresses high-dimensional vectors into short codes using several sub-quantizers, which enables in-RAM storage of large databases. This allows fast answers to NN queries, without accessing the SSD or HDD. The key feature of PQ is that it can compute distances between short codes and high-dimensional vectors using cache-resident lookup tables. The efficiency of this technique, named Asymmetric Distance Computation (ADC), remains limited because it performs many cache accesses. In this paper, we introduce Quick ADC, a novel technique that achieves a 3 to 6 times speedup over ADC by exploiting Single Instruction Multiple Data (SIMD) units available in current CPUs. Efficiently exploiting SIMD requires algorithmic changes to the ADC procedure. Namely, Quick ADC relies on two key modifications of ADC: (i) the use 4-bit sub-quantizers instead of the standard 8-bit sub-quantizers and (ii) the quantization of floating-point distances. This allows Quick ADC to exceed the performance of state-of-the-art systems, e.g., it achieves a Recall@100 of 0.94 in 3.4 ms on 1 billion SIFT descriptors (128-bit codes)., Comment: 8 pages, 5 figures, published in Proceedings of ICMR'17, Bucharest, Romania, June 06-09, 2017
Published: 2017

36. NetClus: A Scalable Framework for Locating Top-K Sites for Placement of Trajectory-Aware Services

Author: Arnab Bhattacharya, Richa Sharma, Shubhadip Mitra, Priya Saraf, Harsh Bhandari, and Sayan Ranuy
Subjects: FOS: Computer and information sciences, Service (systems architecture), Computer science, Search engine indexing, Mobile computing, Databases (cs.DB), 02 engineering and technology, 010501 environmental sciences, computer.software_genre, 01 natural sciences, Memory management, Computer Science - Databases, H.2.4, 020204 information systems, Scalability, 0202 electrical engineering, electronic engineering, information engineering, H.2.8, Data mining, Greedy algorithm, computer, 0105 earth and related environmental sciences
Abstract: Facility location queries identify the best locations to set up new facilities for providing service to its users. Majority of the existing works in this space assume that the user locations are static. Such limitations are too restrictive for planning many modern real-life services such as fuel stations, ATMs, convenience stores, cellphone base-stations, etc. that are widely accessed by mobile users. The placement of such services should, therefore, factor in the mobility patterns or trajectories of the users rather than simply their static locations. In this work, we introduce the TOPS (Trajectory-Aware Optimal Placement of Services) query that locates the best k sites on a road network. The aim is to optimize a wide class of objective functions defined over the user trajectories. We show that the problem is NP-hard and even the greedy heuristic with an approximation bound of (1-1/e) fails to scale on urban-scale datasets. To overcome this challenge, we develop a multi-resolution clustering based indexing framework called NetClus. Empirical studies on real road network trajectory datasets show that NetClus offers solutions that are comparable in terms of quality with those of the greedy heuristic, while having practical response times and low memory footprints. Additionally, the NetClus framework can absorb dynamic updates in mobility patterns, handle constraints such as site-costs and capacity, and existing services, thereby providing an effective solution for modern urban-scale scenarios., Comment: ICDE 2017 poster
Published: 2017
Full Text: View/download PDF

37. Time Series Management Systems: A Survey

Author: Søren Kejser Jensen, Torben Bach Pedersen, and Christian Thomsen
Subjects: FOS: Computer and information sciences, Computer science, E.2, E.4, H.2, G.3, G.1.2, D.2.11, E.1, H.2.4, C.2.4, H.2.8, 02 engineering and technology, Field (computer science), Stream processing, Computer Science - Databases, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Time series, Focus (computing), Series (mathematics), Scale (chemistry), Databases (cs.DB), Data science, Computer Science Applications, Computational Theory and Mathematics, Management system, 020201 artificial intelligence & image processing, Information Systems
Abstract: The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) have been developed to overcome the limitations of general purpose Database Management Systems (DBMSs) for times series management. In this paper, we present a thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications. Our classification is organized into categories based on the architectures observed during our analysis. In addition, we provide an overview of each system with a focus on the motivational use case that drove the development of the system, the functionality for storage and querying of time series a system implements, the components the system is composed of, and the capabilities of each system with regard to Stream Processing and Approximate Query Processing (AQP). Last, we provide a summary of research directions proposed by other researchers in the field and present our vision for a next generation TSMS., Comment: 20 Pages, 15 Figures, 2 Tables, Accepted for publication in IEEE TKDE
Published: 2017
Full Text: View/download PDF

38. Secure Mining of Association Rules in Horizontally Distributed Databases

Author: Tamir Tassa
Subjects: FOS: Computer and information sciences, Apriori algorithm, Computer Science - Cryptography and Security, Association rule learning, Computer science, General Inter-ORB Protocol, Encryption, computer.software_genre, H.2.4, H.2.8, Computer Science - Databases, Universal composability, Protocol (object-oriented programming), Distributed database, business.industry, Databases (cs.DB), Computer Science Applications, Computer Science - Distributed, Parallel, and Cluster Computing, Computational Theory and Mathematics, Distributed, Parallel, and Cluster Computing (cs.DC), Data mining, Element (category theory), business, Cryptography and Security (cs.CR), computer, Information Systems, Computer network
Abstract: We propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading protocol is that of Kantarcioglu and Clifton (TKDE 2004). Our protocol, like theirs, is based on the Fast Distributed Mining (FDM) algorithm of Cheung et al. (PDIS 1996), which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms --- one that computes the union of private subsets that each of the interacting players hold, and another that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect to the protocol of Kantarcioglu and Clifton. In addition, it is simpler and is significantly more efficient in terms of communication rounds, communication cost and computational cost.
Published: 2014

39. Show me the material evidence: Initial experiments on evaluating hypotheses from user-generated multimedia data

Author: Bernardo Gonçalves
Subjects: FOS: Computer and information sciences, Computer science, Computer Science - Artificial Intelligence, Cognitive computing, 02 engineering and technology, computer.software_genre, World Wide Web, Computer Science - Databases, H.2.4, 0202 electrical engineering, electronic engineering, information engineering, H.2.8, Leverage (statistics), Social media, I.2.6, I.2.7, H.1.2, Collective decision, Multimedia, Unstructured data, Databases (cs.DB), Multimedia (cs.MM), Metadata, Artificial Intelligence (cs.AI), 020201 artificial intelligence & image processing, Lying, computer, Computer Science - Multimedia
Abstract: Subjective questions such as `does neymar dive', or `is clinton lying', or `is trump a fascist', are popular queries to web search engines, as can be seen by autocompletion suggestions on Google, Yahoo and Bing. In the era of cognitive computing, beyond search, they could be handled as hypotheses issued for evaluation. Our vision is to leverage on unstructured data and metadata of the rich user-generated multimedia that is often shared as material evidence in favor or against hypotheses in social media platforms. In this paper we present two preliminary experiments along those lines and discuss challenges for a cognitive computing system that collects material evidence from user-generated multimedia towards aggregating it into some form of collective decision on the hypothesis., 6 pages, 6 figures, 3 tables in Proc. of the 1st Workshop on Multimedia Support for Decision-Making Processes, at IEEE Intl. Symposium on Multimedia (ISM'16), San Jose, CA, 2016
Published: 2016

40. Finding desirable objects under group categorical preferences

Author: Dimitris Sacharidis, Karim Benouaret, Nikos Bikakis, Service Oriented Computing (SOC), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), and Université de Lyon-Université Lumière - Lyon 2 (UL2)
Subjects: FOS: Computer and information sciences, Matching (statistics), 02 engineering and technology, Recommender system, Machine learning, computer.software_genre, Synthetic data, Computer Science - Databases, H.2.4, Artificial Intelligence, 020204 information systems, Computer Science - Data Structures and Algorithms, H.2.8, 0202 electrical engineering, electronic engineering, information engineering, Data Structures and Algorithms (cs.DS), [INFO]Computer Science [cs], Set (psychology), H.3.1, Categorical variable, ComputingMilieux_MISCELLANEOUS, Mathematics, I.3.5, business.industry, Pareto principle, 97R50, 68P05, 68P15, Databases (cs.DB), Object (computer science), Human-Computer Interaction, Ranking, Hardware and Architecture, E.1, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Software, Information Systems
Abstract: Considering a group of users, each specifying individual preferences over categorical attributes, the problem of determining a set of objects that are objectively preferable by all users is challenging on two levels. First, we need to determine the preferable objects based on the categorical preferences for each user, and second we need to reconcile possible conflicts among users' preferences. A naive solution would first assign degrees of match between each user and each object, by taking into account all categorical attributes, and then for each object combine these matching degrees across users to compute the total score of an object. Such an approach, however, performs two series of aggregation, among categorical attributes and then across users, which completely obscure and blur individual preferences. Our solution, instead of combining individual matching degrees, is to directly operate on categorical attributes, and define an objective Pareto-based aggregation for group preferences. Building on our interpretation, we tackle two distinct but relevant problems: finding the Pareto-optimal objects, and objectively ranking objects with respect to the group preferences. To increase the efficiency when dealing with categorical attributes, we introduce an elegant transformation of categorical attribute values into numerical values, which exhibits certain nice properties and allows us to use well-known index structures to accelerate the solutions to the two problems. In fact, experiments on real and synthetic data show that our index-based techniques are an order of magnitude faster than baseline approaches, scaling up to millions of objects and thousands of users., Comment: To appear in Knowledge and Information Systems Journal (KAIS), Springer 2015
Published: 2016

41. Three-Way Joins on MapReduce: An Experimental Study

Author: Ben Kimmett, S. Venkatesh, and Alex Thomo
Subjects: FOS: Computer and information sciences, Traverse, Theoretical computer science, Computer science, Efficient algorithm, Computation, Joins, Databases (cs.DB), computer.software_genre, H.2.4, H.2.8, 68W15, Computer Science - Databases, Computer Science - Distributed, Parallel, and Cluster Computing, Three way, Programming paradigm, Join (sigma algebra), Distributed, Parallel, and Cluster Computing (cs.DC), computer, Data integration
Abstract: We study three-way joins on MapReduce. Joins are very useful in a multitude of applications from data integration and traversing social networks, to mining graphs and automata-based constructions. However, joins are expensive, even for moderate data sets; we need efficient algorithms to perform distributed computation of joins using clusters of many machines. MapReduce has become an increasingly popular distributed computing system and programming paradigm. We consider a state-of-the-art MapReduce multi-way join algorithm by Afrati and Ullman and show when it is appropriate for use on very large data sets. By providing a detailed experimental study, we demonstrate that this algorithm scales much better than what is suggested by the original paper. However, if the join result needs to be summarized or aggregated, as opposed to being only enumerated, then the aggregation step can be integrated into a cascade of two-way joins, making it more efficient than the other algorithm, and thus becomes the preferred solution., 6 pages
Published: 2014

42. Computing Multi-Relational Sufficient Statistics for Large Databases

Author: Zhensong Qian, Oliver Schulte, and Yan Sun
Subjects: FOS: Computer and information sciences, Association rule learning, Database, Computer science, Bayesian network, Feature selection, Databases (cs.DB), Extension (predicate logic), Relational algebra, computer.software_genre, Machine Learning (cs.LG), Computer Science - Learning, Computer Science - Databases, H.2.4, Schema (psychology), H.2.8, Benchmark (computing), Table (database), Join (sigma algebra), Tuple, computer
Abstract: Databases contain information about which relationships do and do not hold among entities. To make this information accessible for statistical analysis requires computing sufficient statistics that combine information from different database tables. Such statistics may involve any number of {\em positive and negative} relationships. With a naive enumeration approach, computing sufficient statistics for negative relationships is feasible only for small databases. We solve this problem with a new dynamic programming algorithm that performs a virtual join, where the requisite counts are computed without materializing join tables. Contingency table algebra is a new extension of relational algebra, that facilitates the efficient implementation of this M\"obius virtual join operation. The M\"obius Join scales to large datasets (over 1M tuples) with complex schemas. Empirical evaluation with seven benchmark datasets showed that information about the presence and absence of links can be exploited in feature selection, association rule mining, and Bayesian network learning., Comment: 11pages, 8 figures, 8 tables, CIKM'14,November 3--7, 2014, Shanghai, China
Published: 2014
Full Text: View/download PDF

43. Die Assistenzfunktion kooperativer Designflows: – verdeutlicht am Beispiel von CONCORD

Author: Ritter, Norbert and Mitschang, Bernhard
Published: 1997
Full Text: View/download PDF

44. Event based classification of Web 2.0 text streams

Author: Bauer, Andreas and Wolff, Christian
Subjects: ddc:004, FOS: Computer and information sciences, H.3.3, H.2.4, I.5.4, H.2.8, information retrievaltext mining event processing web2.0 text streams real time search neural network stream features, 004 Informatik, H.3.1, Information Retrieval (cs.IR), Computer Science - Information Retrieval
Abstract: Web 2.0 applications like Twitter or Facebook create a continuous stream of information. This demands new ways of analysis in order to offer insight into this stream right at the moment of the creation of the information, because lots of this data is only relevant within a short period of time. To address this problem real time search engines have recently received increased attention. They take into account the continuous flow of information differently than traditional web search by incorporating temporal and social features, that describe the context of the information during its creation. Standard approaches where data first get stored and then is processed from a peristent storage suffer from latency. We want to address the fluent and rapid nature of text stream by providing an event based approach that analyses directly the stream of information. In a first step we want to define the difference between real time search and traditional search to clarify the demands in modern text filtering. In a second step we want to show how event based features can be used to support the tasks of real time search engines. Using the example of Twitter we present in this paper a way how to combine an event based approach with text mining and information filtering concepts in order to classify incoming information based on stream features. We calculate stream dependant features and feed them into a neural network in order to classify the text streams. We show the separative capabilities of event based features as the foundation for a real time search engine., Comment: 11 pages, 3 figures, 2 tables
Published: 2012
Full Text: View/download PDF

45. Using Stream Features for Instant Document Filtering

Author: Bauer, Andreas and Wolff, Christian
Subjects: ddc:004, H.2.4, H.3.3, information retrievalinformation filtering event processing web2.0 text streams real-time search tf/idf okapi stream features, H.2.8, 004 Informatik
Abstract: In this paper, we discuss how event processing technologies can be employed for real-time text stream processing and information filtering in the context of the TREC 2012 microblog task. After introducing basic characteristics of stream and event processing, the technical architecture of our text stream analysis engine is presented. Employing well-known term weighting schemes from document-centric text retrieval for temporally dynamic text streams is discussed next, giving details of the ESPER Event Processing Agents (EPAs) we have implemented for this task. Finally, we describe our experimental setup, give details on the TREC microblog runs as well as the result thereafter with our system including some extensions and give a short interpretation of the evaluation results.
Published: 2012
Full Text: View/download PDF

46. Array requirements for scientific applications and an implementation for microsoft SQL server

Author: Dragan Tomic, Tamás Budavári, Andrija Jovanovic, Alexander S. Szalay, István Csabai, Milos Milovanovic, László Dobos, José A. Blakeley, and Marko Tintor
Subjects: FOS: Computer and information sciences, J.2, Computer science, Overhead (engineering), H.2.4, H.3.2, H.2.8, E.1, Array data type, 02 engineering and technology, computer.software_genre, 01 natural sciences, Software, Relational database management system, Computer Science - Databases, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, 010303 astronomy & astrophysics, Server-side, Database server, Database, business.industry, Databases (cs.DB), Data warehouse, 020201 artificial intelligence & image processing, business, computer
Abstract: This paper outlines certain scenarios from the fields of astrophysics and fluid dynamics simulations which require high performance data warehouses that support array data type. A common feature of all these use cases is that subsetting and preprocessing the data on the server side (as far as possible inside the database server process) is necessary to avoid the client-server overhead and to minimize IO utilization. Analyzing and summarizing the requirements of the various fields help software engineers to come up with a comprehensive design of an array extension to relational database systems that covers a wide range of scientific applications. We also present a working implementation of an array data type for Microsoft SQL Server 2008 to support large-scale scientific applications. We introduce the design of the array type, results from a performance evaluation, and discuss the lessons learned from this implementation. The library can be downloaded from our website at http://voservices.net/sqlarray/
Published: 2011

47. Perspects in astrophysical databases

Author: Alessandro De Angelis, Marco Frailis, and Vito Roberto
Subjects: FOS: Computer and information sciences, Statistics and Probability, Computer science, Data management, FOS: Physical sciences, computer.software_genre, Astrophysics, Data modeling, H.2.4, H.2.8, Computer Science - Databases, Cluster analysis, Data element, business.industry, Astrophysics (astro-ph), Databases (cs.DB), Condensed Matter Physics, Data warehouse, Metadata, Data set, Information extraction, Data access, Data mining, business, computer
Abstract: Astrophysics has become a domain extremely rich of scientific data. Data mining tools are needed for information extraction from such large data sets. This asks for an approach to data management emphasizing the efficiency and simplicity of data access; efficiency is obtained using multidimensional access methods and simplicity is achieved by properly handling metadata. Moreover, clustering and classification techniques on large data sets pose additional requirements in terms of computation and memory scalability and interpretability of results. In this study we review some possible solutions.
Published: 2004

48. Data Management and Mining in Astrophysical Databases

Author: Marco Frailis, Angelis, A., and Roberto, V.
Subjects: FOS: Computer and information sciences, H.2.4, H.2.8, ComputingMethodologies_PATTERNRECOGNITION, Computer Science - Databases, Physics - Data Analysis, Statistics and Probability, Astrophysics (astro-ph), FOS: Physical sciences, Databases (cs.DB), Astrophysics, Data Analysis, Statistics and Probability (physics.data-an)
Abstract: We analyse the issues involved in the management and mining of astrophysical data. The traditional approach to data management in the astrophysical field is not able to keep up with the increasing size of the data gathered by modern detectors. An essential role in the astrophysical research will be assumed by automatic tools for information extraction from large datasets, i.e. data mining techniques, such as clustering and classification algorithms. This asks for an approach to data management based on data warehousing, emphasizing the efficiency and simplicity of data access; efficiency is obtained using multidimensional access methods and simplicity is achieved by properly handling metadata. Clustering and classification techniques, on large datasets, pose additional requirements: computational and memory scalability with respect to the data size, interpretability and objectivity of clustering or classification results. In this study we address some possible solutions., Comment: 10 pages, Latex
Published: 2003

49. Event Indexing Systems for Efficient Selection and Analysis of HERA Data

Author: Adrian Fox-Murphy, Stefan Stonjek, L. A. T. Bauerdick, Enrico Tassi, and Tobias Haas
Subjects: FOS: Computer and information sciences, J.2, Computer science, Search engine indexing, General Physics and Astronomy, Databases (cs.DB), HERA, Terabyte, computer.software_genre, Computer Science - Information Retrieval, H.2.4, H.3.1, H.3.3, H.3.4, H.2.8, Computer Science - Databases, Hardware and Architecture, Software system, Data mining, computer, Information Retrieval (cs.IR)
Abstract: The design and implementation of two software systems introduced to improve the efficiency of offline analysis of event data taken with the ZEUS Detector at the HERA electron-proton collider at DESY are presented. Two different approaches were made, one using a set of event directories and the other using a tag database based on a commercial object-oriented database management system. These are described and compared. Both systems provide quick direct access to individual collision events in a sequential data store of several terabytes, and they both considerably improve the event analysis efficiency. In particular the tag database provides a very flexible selection mechanism and can dramatically reduce the computing time needed to extract small subsamples from the total event sample. Gains as large as a factor 20 have been obtained., Accepted for publication in Computer Physics Communications
Published: 2001

50. Analyzing Partitioned FAIR Health Data Responsibly

Author: Chang Sun, Lianne Ippel, Birgit Wouters, Johan van Soest, Alexander Malic, Onaopepo Adekunle, Bob van den Berg, Marco Puts, Ole Mussmann, Annemarie Koster, Carla van der Kallen, David Townend, Andre Dekker, and Michel Dumontier
Subjects: FOS: Computer and information sciences, Computer Science - Computers and Society, H.2.4, E.3, Computers and Society (cs.CY), H.2.8, E.1
Abstract: It is widely anticipated that the use of health-related big data will enable further understanding and improvements in human health and wellbeing. Our current project, funded through the Dutch National Research Agenda, aims to explore the relationship between the development of diabetes and socio-economic factors such as lifestyle and health care utilization. The analysis involves combining data from the Maastricht Study (DMS), a prospective clinical study, and data collected by Statistics Netherlands (CBS) as part of its routine operations. However, a wide array of social, legal, technical, and scientific issues hinder the analysis. In this paper, we describe these challenges and our progress towards addressing them., Comment: 6 pages, 1 figure, preliminary result, project report

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

50 results on '"H.2.4"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources