50 results on '"Leong, Hong Va"'
Search Results
2. OS-DRAM: A Delegation Administration Model in a Decentralized Enterprise Environment.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Byun, Changwoo, Park, Seog, and Oh, Sejong
- Abstract
In this paper, we propose an effective delegation administration model using the organizational structure. From a user-level delegation point of view, previous delegation models built on the (Administrative) Role-Based Access Control model cannot present the best solution to security problems such as the leakage of information and the abuse of delegation in a decentralized enterprise environment. Thus, we propose a new integrated management model of administration role-based access control model and delegation policy, which is called the OS-DRAM. This defines the authority range in an organizational structure that is separated from role hierarchy and supports a clear criterion for user-level delegation administration. Consequently, the OS-DRAM supports a decentralized user-level delegation policy in which a regular user can freely delegate his/her authority to other users within a security officer's authority range with-out the security officer's intervention. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
3. A Scientific Workflow Framework Integrated with Object Deputy Model for Data Provenance.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Wang, Liwei, Peng, Zhiyong, Luo, Min, Ji, Wenhao, and Huang, Zeqian
- Abstract
There is a critical need to automatically manage large volumes of scientific data and applications in scientific workflows. Database technologies seem to be well suited to handle highly complex data managements. However, most of the workflow management systems (WFMSs) only utilize database technologies to a limited extent. In this paper, we present a DB-integrated scientific workflow framework which adopts the object deputy model to describe the execution of a series of scientific tasks. This framework allows WFMS management operations to be performed in a way analogous to traditional data management operations. Most important of all, data provenance method of this framework can provide much higher performance than other methods. Three kinds of schemas for data provenance are proposed and performance for each schema is analyzed in this paper. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
4. An Efficient Indexing Technique for Computing High Dimensional Data Cubes.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Leng, Fangling, Bao, Yubin, Yu, Ge, Wang, Daling, and Liu, Yuntao
- Abstract
The computation of a data cube is one of the most essential but challenging issues in data warehousing and OLAP. Partition based algorithm is one of the efficient methods to compute data cubes on high dimensionality, low cardinality, and moderate size datasets, which exist in real applications like bioinformatics, statistics, and text processing. To deal with such high dimensional data cubes, we propose an efficient indexing technique consisting of a compressed bitmap index and two algorithms for cube constructing and querying. Experimental results show that our method saves at least 25% on storage space and about 30% on computation time compared with the Frag-Cubing algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
5. A Framework for Query Reformulation Between Knowledge Base Peers.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Qin, Biao, Wang, Shan, and Du, Xiaoyong
- Abstract
The problem of sharing data in peer-to-peer environment has received considerable attention in recent years. However, knowledge sharing in peer architectures has received very little attention. This paper proposes a framework for query reformulation in peer architectures. We first consider a mapping language based on a particular description logic that includes class connectors. Then a set of rules are proposed for building graphs. Because the axioms in a knowledge base have different properties, our graph generation algorithm classifies the generated graphs into four sets (Ugraph, Bgraph, Cgraph and Dgraph). Furthermore, based on the properties of the unification nodes, our algorithms can reformulate each kind of atom in a special way. Finally we do extensive simulation experiments and simulation results show that the proposed method has better performance than those of Mork's [8]. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
6. Load Shedding for Window Joins over Streams.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Han, Donghong, Xiao, Chuan, Zhou, Rui, Wang, Guoren, Huo, Huan, and Hui, Xiaoyun
- Abstract
We present a novel load shedding technique over sliding window joins. We first construct a dual window architectural model including join-windows and aux-windows. With the statistics built on aux-windows, an effective load shedding strategy is developed to produce maximum subset join outputs. For the streams with high arrival rates, we propose an approach incorporating front-shedding and rear-shedding, and then address the problem of how to cooperate these two shedding processes through a series of calculations. Based on extensive experimentation with synthetic data and real life data, we show that our load shedding strategy delivers superb join output performance, and dominates the existing strategies. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
7. Validating Semistructured Data Using OWL.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Li, Yuan Fang, Sun, Jing, Dobbie, Gillian, Sun, Jun, and Wang, Hai H.
- Abstract
Semistructured data has become prevalent in both web applications and database systems. This rapid growth in use makes the design of good semistructured data essential. Formal semantics and automated reasoning tools enable us to reveal the inconsistencies in a semistructured data model and its instances. The Object Relationship Attribute model for Semistructured data (ORA-SS) is a graphical notation for designing and representing semistructured data. This paper presents a methodology of encoding the semantics of ORA-SS in the Web Ontology Language (OWL) and automatically validating the semistructured data design using the OWL reasoning tool - RACER. Our methodology provides automated consistency checking of an ORA-SS data model at both the schema and instance levels. Keywords: Semistructured Data, Semantic Web, OWL, Formal Verification. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
8. Designing Quality XML Schemas from E-R Diagrams.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Liu, Chengfei, and Li, Jianxin
- Abstract
XML has emerged as the standard for representing, exchanging and integrating data on the Web. To guarantee the quality of XML documents, the design of quality XML Schemas becomes essentially important. In this paper, we look into this problem by designing quality XML Schemas from given E-R diagrams. We first discuss several criteria in designing a good XML Schema. Following these criteria, transformation rules are then devised that take all constructs of an E-R diagram into account. Finally, a recursive algorithm is developed to transform an E-R diagram to a corresponding quality XML Schema. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
9. Supporting Efficient Distributed Top-k Monitoring.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Deng, Bo, Jia, Yan, and Yang, Shuqiang
- Abstract
This paper addresses the efficient processing of distributed top-k monitoring, which is continuously reporting the k largest values according to a user-specified ranking function over distributed data streams. To minimize communication requirements, the necessary data transmitting must be selected carefully. We study the optimization problem of which objects are necessary to be transmitted and present a new distributed top-k monitoring algorithm to reduce communication cost. In our approach, few objects are transmitted for maintaining the top-k set and communication cost is independent of k. We verify the effectiveness of our approach empirically using both real-world and synthetic data sets. We show that our approach reduces overall communication cost by a factor ranging from 2 to over an order of magnitude compared with the previous approach when k is no lees than 10. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
10. Classifying E-Mails Via Support Vector Machine.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Shou, Lidan, Cui, Bin, Chen, Gang, and Dong, Jinxiang
- Abstract
For addressing the growing problem of junk E-mail on the Internet, this paper proposes an effective E-mail classifying technique. Our work handles E-mail messages as semi-structured documents consisting of a set of fields with predefined semantics and a number of variable length free-text contents. The main contributions of this paper include the following: First, we present a Support Vector Machine (SVM) based model that incorporates the Principal Component Analysis (PCA) technique to reduce the data in terms of size and dimensionality of the input feature space. As a result, the input data become classifiable with fewer features, and the training process has faster convergence speed. Second, we build the classification model using both the $\mathcal{C}$-support vector machine and v-support vector machine algorithms. Various control parameters for performance tuning are studied in an extensive set of experiments. The results of our performance evaluation indicate that the proposed technique is effective in E-mail classification. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
11. Dynamic Data Distribution of High Level Architecture Based on Publication and Subscription Tree.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Liu, Yintian, Tang, Changjie, Li, Chuan, Zhu, Minfang, and Zeng, Tao
- Abstract
To ensure the efficiency of data exchange between simulation members via multicast groups in the simulation system based on High Level Architecture (HLA), this paper proposes a novel method of dynamic data distribution based on publication and subscription tree (PS-Tree). The main contributions of this paper include: (1) Proposing the structure of PS-Tree which can manifest the relationship of data exchange between simulation members. (2) Describing the method of dynamic data distribution based on PS-Tree by mining association rule and (3) Analyzing the performance. Experiment shows that this dynamic data distribution method can implement data distribution efficiently and effectively. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
12. Tight Bounds on the Estimation Distance Using Wavelet.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Liu, Bing, Wang, Zhihui, Li, Jingtao, Wang, Wei, and Shi, Baile
- Abstract
Time series similarity search is of growing importance in many applications. Wavelet transforms are used as a dimensionality reduction technique to permit efficient similarity search over high-dimensional time series data. This paper proposes the tight upper and lower bounds on the estimation distance using wavelet transform, and we show that the traditional distance estimation is only part of our lower bound. According to the lower bound, we can exclude more dissimilar time series than traditional method. And according to the upper bound, we can directly judge whether two time series are similar, and further reduce the number of time series to process in original time domain. The experiments have shown that using the upper and lower tight bounds can significantly improve filter efficiency and reduce running time than traditional method. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
13. Counting Graph Matches with Adaptive Statistics Collection.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Feng, Jianhua, Qian, Qian, Liao, Yuguo, and Zhou, Lizhu
- Abstract
High performance of query processing in large scale graph-structured data poses a pressing demand for high-quality statistics collection and selectivity estimation. Precise and succinct statistics collection about graph-structured data plays a crucial role for graph query selectivity estimation. In this paper, we propose the approach SMT, Succinct Markov Table, which achieves high precision in selectivity estimation with low memory space consumed. Four core notions of SMT are constructing, refining, compressing and estimating. The efficient algorithm SMTBuilder provides facility to build adaptive statistics model in the form of SMT. Versatile optimization rules, which investigate local bi-directional reachability, are introduced in SMT refining. During compressing, affective SMT grouping techniques are introduced. Statistical methods are used for selectivity estimations of various graph queries basing on SMT, especially for twig queries. By a thorough experimental study, we demonstrate SMT's advantages in accuracy and space by comparing with previously known alternative, as well as the preferred optimization rules and compressing technique that would favor different real-life data. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
14. A Novel Web Page Categorization Algorithm Based on Block Propagation Using Query-Log Information.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Dai, Wenyuan, Yu, Yong, Zhang, Cong-Le, Han, Jie, and Xue, Gui-Rong
- Abstract
Most existing web page classification algorithms, including content-based, link-based, or query-log analysis methods, treat the pages as smallest units. However, web pages usually contain some noisy or biased information which could affect the performance of classification. In this paper, we propose a Block Propagation Categorization (BPC) algorithm which deep mines web structure and views blocks as basic semantic units. Moreover, with query log information, BPC propagates only suitable information (block) among web pages to emphasize their topics. We also optimize the BPC algorithm to significantly speed up the block propagation process, without losing any precision. Our experiments on ODP and MSN search engine log show that BPC achieves a great improvement over traditional approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
15. Error-Adaptive and Time-Aware Maintenance of Frequency Counts over Data Streams.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Liu, Hongyan, Lu, Ying, Han, Jiawei, and He, Jun
- Abstract
Maintaining frequency counts for items over data stream has a wide range of applications such as web advertisement fraud detection. Study of this problem has attracted great attention from both researchers and practitioners. Many algorithms have been proposed. In this paper, we propose a new method, error-adaptive pruning method, to maintain frequency more accurately. We also propose a method called fractionization to record time information together with the frequency information. Using these two methods, we design three algorithms for finding frequent items and top-k frequent items. Experimental results show these methods are effective in terms of improving the maintenance accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
16. Dynamic Incremental Data Summarization for Hierarchical Clustering.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Liu, Bing, Shi, Yuliang, Wang, Zhihui, Wang, Wei, and Shi, Baile
- Abstract
In many real world applications, with the databases frequent insertions and deletions, the ability of a data mining technique to detect and react quickly to dynamic changes in the data distribution and clustering over time is highly desired. Data summarizations (e.g., data bubbles) have been proposed to compress large databases into representative points suitable for subsequent hierarchical cluster analysis. In this paper, we thoroughly investigate the quality measure (data summarization index) of incremental data bubbles. When updating databases, we show which factors could affect the mean and standard deviation of data summarization index or not. Based on these statements, a fully dynamic scheme to maintain data bubbles incrementally is proposed. An extensive experimental evaluation confirms our statements and shows that the fully dynamic incremental data bubbles are effective in preserving the quality of the data summarization for hierarchical clustering. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
17. A New Method for Finding Approximate Repetitions in DNA Sequences.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Wang, Di, Wang, Guoren, Wu, Qingquan, Chen, Baichen, and Zhao, Yi
- Abstract
Searching for approximate repetitions in a DNA sequence has been an important topic in gene analysis. One of the problems in the study is that because of the varying lengths of patterns, the similarity between patterns cannot be judged accurately if we use only the concept of ED ( Edit Distance ). In this paper we shall make effort to define a new function to compute similarity, which considers both the difference and sameness between patterns at the same time. Seeing the computational complexity, we shall also propose two new filter methods based on frequency distance and Pearson correlation, with which we can sort out candidate set of approximate repetitions efficiently. We use SUA instead of sliding window to get the fragments in a DNA sequence, so that the patterns of an approximate repetition have no limitation on length. The results show that with our technique we are able to find a bigger number of approximate repetitions than that of those found with tandem repeat finder. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
18. TreeCluster: Clustering Results of Keyword Search over Databases.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Peng, Zhaohui, Zhang, Jun, Wang, Shan, and Qin, Lu
- Abstract
A critical challenge in keyword search over relational data- bases (KSORD) is to improve its result presentation to facilitate users' quick browsing through search results. An effective method is to organize the results into clusters. However, traditional clustering method is not applicable to KSORD search results. In this paper, we propose a novel clustering method named TreeCluster. In the first step, we use labels to represent schema information of each result tree and reformulate the clustering problem as a problem of judging whether labeled trees are isomorphic. In the second step, we rank user keywords according to their frequencies in databases, and further partition the large clusters based on keyword nodes. Furthermore, we give each cluster a readable description, and present the description and each result graphically to help users understand the results more easily. Experimental results verify our method's effectiveness and efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
19. Scalable Clustering Using Graphics Processors.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Cao, Feng, Tung, Anthony K.H., and Zhou, Aoying
- Abstract
We present new algorithms for scalable clustering using graphics processors. Our basic approach is based on k-means. By changing the order of determining object labels, and exploiting the high computational power and pipeline of graphics processing units (GPUs) for distance computing and comparison, we speed up the k-means algorithm substantially. We introduce two strategies for retrieving data from the GPU, taking into account the low bandwidth from the GPU back to the main memory. We also extend our GPU-based approach to data stream clustering. We implement our algorithms in a PC with a Pentium IV 3.4G CPU and a NVIDIA GeForce 6800 GT graphics card. Our comprehensive performance study shows that the common GPU in desktop computers could be an efficient co-processor of CPU in traditional and data stream clustering. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
20. DGCL: An Efficient Density and Grid Based Clustering Algorithm for Large Spatial Database.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Kim, Ho Seok, Gao, Song, Xia, Ying, Kim, Gyoung Bae, and Bae, Hae Young
- Abstract
Spatial clustering, which groups similar objects based on their distance, connectivity, or their relative density in space, is an important component of spatial data mining. Clustering large data sets has always been a serious challenge for clustering algorithms, because huge data set makes the clustering process extremely costly. In this paper, we propose DGCL, an enhanced Density-Grid based Clustering algorithm for Large spatial database. The characteristics of dense area can be enhanced by considering the affection of the surrounding area. Dense areas are analytically identified as clusters by removing sparse area or outliers with the help of a density threshold. Synthetic datasets are used for testing and the result shows the superiority of our approach. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
21. Discovery of Temporal Frequent Patterns Using TFP-Tree.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Jin, Long, Lee, Yongmi, Seo, Sungbo, and Ryu, Keun Ho
- Abstract
Mining temporal frequent patterns in transaction databases, time-series databases, and many other kinds of databases have been widely studied in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and long patterns. In this paper, we propose an efficient temporal frequent pattern mining method using the TFP-tree (Temporal Frequent Pattern tree). This approach has three advantages: (i) one can scan the transaction only once for reducing significantly the I/O cost; (ii) one can store all transactions in leaf nodes but only save the star calendar patterns in the internal nodes. So we can save a large amount of memory. Moreover, we divide the transactions into many partitions by maximum size domain which significantly saves the memory; (iii) we efficiently discover each star calendar pattern in internal node using the frequent calendar patterns of leaf node. Thus we can reduce significantly the computational time. Our performance study shows that the TFP-tree is efficient and scalable for mining, and is about an order of magnitude faster than the classical frequent pattern mining algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
22. Compressing Spatial and Temporal Correlated Data in Wireless Sensor Networks Based on Ring Topology.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Zhou, Siwang, Lin, Yaping, Wang, Jiliang, Zhang, Jianming, and Ouyang, Jingcheng
- Abstract
In this paper, we propose an algorithm for wavelet based spatio-temporal data compression in wireless sensor networks. By employing a ring topology, the algorithm is capable of supporting a broad scope of wavelets that can simultaneously explore the spatial and temporal correlations among the sensory data. Furthermore, the ring based topology is in particular effective in eliminating the "border effect" generally encountered by wavelet based schemes. We propose a "Hybrid" decomposition based wavelet transform instead of wavelet transform based on the common dyadic decomposition, since temporal compression is local and far cheaper than spatial compression in sensor networks. We show that the optimal level of wavelet transform is different due to diverse sensor network circumstances. Theoretically and experimentally, we conclude the proposed algorithm can effectively explore the spatial and temporal correlation in the sensory data and provide significant reduction in energy consumption and delay compared to other schemes. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
23. Finding the Plateau in an Aggregated Time Series.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Wang, Min, and Wang, X. Sean
- Abstract
Given d input time series, an aggregated series can be formed by aggregating the d values at each time position. It is often useful to find the time positions whose aggregated values are the greatest. Instead of looking for individual top-k time positions, this paper gives two algorithms for finding the time interval (called the plateau) in which the aggregated values are close to each other (within a given threshold) and are all no smaller than the aggregated values outside of the interval. The first algorithm is a centralized one assuming that all data are available at a central location, and the other is a distributed search algorithm that does not require such a central location. The centralized algorithm has a linear time complexity with respect to the length of the time series, and the distributed algorithm employs the Threshold Algorithm by Fagin et al. and is quite efficient in reducing the communication cost as shown by the experiments reported in the paper. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
24. Bulkloading Updates for Moving Objects.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Wang, Xiaoyuan, Sun, Weiwei, and Wang, Wei
- Abstract
Supporting frequent updates is a key challenge in moving object indexing. Most of the existing work regards the update as an individual process for each object, and a large number of separate updates are issued respectively in update-intensive environments. In this paper, we propose the bulkloading updates for moving objects (BLU). Based on a common framework, we propose three bulkloading schemes of different spatial biases. By grouping the objects with near positions, BLU prefetches the nodes accessed on the shared update path and combines multiple disk accesses to the same node into one, which avoids I/O overhead for objects within the same group. We also propose a novel MBR-driven flushing algorithm, which utilizes the dynamic spatial correlation and improves the buffer hit ratio. The theoretical analysis and experimental evaluation demonstrate that BLU achieves the good update performance and does not affect the query performance. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
25. Cache Consistency in Mobile XML Databases.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, and Böttcher, Stefan
- Abstract
Whenever an XML database is used to provide transactional access to mobile clients in multi-hop networks, standard database technologies like query processing and concurrency control have to be adapted to fundamentally different requirements, including limited bandwidth and unforeseeable lost connections. We present a query processing approach that reduces XML data exchange to the exchange of difference XML fragments wherever possible. Additionally, within our approach transactions can even use cached results of outdated queries and of neighbor clients, wherever they result in a reduction of data exchange. Furthermore, our approach supports a pipelined exchange of queries and partial answers. Finally, we present a timestamp-based approach to concurrency control that guarantees cache consistency and minimizes data exchange between the mobile clients and the XML database server. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
26. MiniTasking: Improving Cache Performance for Multiple Query Workloads.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Zhang, Yan, Chen, Zhifeng, and Zhou, Yuanyuan
- Abstract
This paper proposes a novel idea, called MiniTasking to reduce the number of cache misses by improving the data temporal locality for multiple concurrent queries. Our idea is based on the observation that, in many workloads such as decision support systems (DSS), there is usually significant amount of data sharing among different concurrent queries. MiniTasking exploits such data sharing characteristics to improve data temporal locality by scheduling query execution at three levels: (1) It batches queries based on their data sharing characteristics and the cache configuration. (2) It groups operators that share certain data. (3) It schedules mini-tasks which are small pieces of computation in operator groups according to their data locality without violating their execution dependencies. Our experimental results show that, MiniTasking can significantly reduce the execution time up to 12% for joins. For the TPC-H throughput test workload, MiniTasking improves the end performance up to 20%. Even with the Partition Attributes Across (PAX) layout, MiniTasking further reduces the cache misses by 65% and the execution time by 9%. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
27. CCWrapper: Adaptive Predefined Schema Guided Web Extraction.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Gao, Jun, Yang, Dongqing, and Wang, Tengjiao
- Abstract
In this paper, we propose a method called CCWrapper (Classification-Cluster) to extract target data items from web pages under the guide of the predefined schema. CCWrapper extracts and combines the different HTML nodes features, including the style, structure, thesaurus and data type attributes into one unified model, and generates the extraction rules with Bayes classification in the training step. When the new HTML page is handled, CCWrapper generates the probability of the target element for each HTML node and clusters the HTML nodes for extraction based on the intra-document relationship in the HTML document tree. The preliminary experimental results on real-life web sites demonstrate CCWrapper is a promising extraction method. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
28. RecipeCrawler: Collecting Recipe Data from WWW Incrementally.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Li, Yu, Meng, Xiaofeng, Wang, Liping, and Li, Qing
- Abstract
WWW has posed itself as the largest data repository ever available in the history of humankind. Utilizing the Internet as a data source seems to be natural and many efforts have been made. In this paper we focus on establishing a robust system to collect structured recipe data from the Web incrementally, which, as we believe, is a critical step towards practical, continuous, reliable web data extraction systems and therefore utilizing WWW as data sources for various database applications. The reasons for advocating such an incremental approach are two-fold: (1) it is impractical to crawl all the recipe pages from relevant web sites as the Web is highly dynamic; (2) it is almost impossible to induce a general wrapper for future extraction from the initial batch of recipe web pages. In this paper, we describe such a system called RecipeCrawler which targets at incrementally collecting recipe data from WWW. General issues in establishing an incremental data extraction system are considered and techniques are applied to recipe data collection from the Web. Our RecipeCrawler is actually used as the backend of a fully-fledged multimedia recipe database system being developed jointly by City University of Hong Kong and Renmin University of China. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
29. Crawling Web Pages with Support for Client-Side Dynamism.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Álvarez, Manuel, Pan, Alberto, Raposo, Juan, and Hidalgo, Justo
- Abstract
There is a great amount of information on the web that can not be accessed by conventional crawler engines. This portion of the web is usually known as the Hidden Web. To be able to deal with this problem, it is necessary to solve two tasks: crawling the client-side and crawling the server-side hidden web. In this paper we present an architecture and a set of related techniques for accessing the information placed in web pages with support for client-side dynamism, dealing with aspects such as JavaScript technology, non-standard session maintenance mechanisms, client redirections, pop-up menus, etc. Our approach leverages current browser APIs and implements novel crawling models and algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
30. Service Matchmaking Based on Semantics and Interface Dependencies.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Deng, Shuiguang, Wu, Jian, Li, Ying, and Wu, Zhaohui
- Abstract
Most of the current service matchmaking algorithms are based on one presupposition, in which all inputs of a service are indispensable to each output of that service. However, this presupposition does not always hold. This paper analyses this presupposition and argues that it exerts a negative influence on the recall rate and precision to current matchmaking algorithms. A formal service model is then introduced, which extends the service profile of OWL-S. A new service matchmaking algorithm based on the model and semantics is proposed. Compared with other algorithms, the proposed one takes interface dependencies into consideration while performing matchmaking. This algorithm has been applied in a service composition framework called DartFlow. Our experimental data show that this novel service matchmaking outperforms others in terms of the recall rate and precision. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
31. Optimizing the Profit of On-Demand Multimedia Service Via a Server-Dependent Queuing System.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, and Lin, Pei-chun
- Abstract
This study presents a profit maximization model that adopts the number of requests for image or voice transferring services on a network as decision variables for when to switch a second server on and off based on the costs of using a second server and of users waiting. A Markovian queue with a number of servers depending upon queue length and finite capacity is discussed. The data of interarrival time and service times of requests are collected by observing a queuing system. An empirical Bayesian method is then applied to estimate the traffic intensity of the system, which denotes the need for host computers. The mean number of transfer requests in the system and the queue length of transfer requests are calculated as the characteristic values of the system. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
32. QoS-Aware Web Services Composition Using Transactional Composition Operator.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Liu, An, Huang, Liusheng, and Li, Qing
- Abstract
As composite web services are often long lasting, loosely coupled, and cross application and administrative boundaries, transactional support is required. Most of the work has so far focused on relaxing some ACID properties of the traditional transaction model, with little being done on investigating how the transaction can influence the quality of service (QoS) of a composite web service. In this paper, a composition model is proposed to evaluate the quality of service (QoS) of a composite service with various transactional requirements. The proposed model is based on a transactional composition operator, which extends the traditional workflow patterns and integrates transactional properties. Using a recursive approach, the QoS of a composite service can easily be calculated, in spite of transactional requirements given by service providers or end users. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
33. Role-Based Peer-to-Peer Model: Capture Global Pseudonymity for Privacy Protection.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Li, Zude, Zhan, Guoqiang, and Ye, Xiaojun
- Abstract
Peer-to-peer (P2P) resource dissemination has raised some security concerns for privacy protection and intellectual property rights protection along resource dissemination over the network. To solve these challenges, we propose the Role-Based P2P model, in which the role notion is functioned as the bridge component between users and resources to enforce secure resource dissemination together with relative constraints. The property rights attached to resource and user's private identity information are both protected as promise by taking each local role as a permission set in local centralized network and each global role as a user's pseudonym in global decentralized network. Furthermore, we propose the access control algorithm to describe how to handle access requests by the role policy in the role-based hybrid P2P model. In addition, we illustrate the intra and inter access schemas as two kinds of access processes. The model is feasible as its role structure and the connection with user and resource in open environment are consistent with the application objectives. The model is extensible, as the role structure can be also available for Purpose-Based Privacy Protection technologies. Keywords: Peer-to-Peer, Privacy Protection, Role, Pseudonymity. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
34. A Reputation Management Scheme Based on Global Trust Model for Peer-to-Peer Virtual Communities.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Li, Jingtao, Wang, Xueping, Liu, Bing, Wang, Qian, and Zhang, Gendu
- Abstract
Peer-to-peer virtual communities are often established dynamically with peers that are unrelated and unknown to each other. Peers have to manage the risk involved with the transactions without prior knowledge about each other's reputation. SimiTrust, a reputation management scheme, is proposed for P2P virtual communities. A unique global trust value, computed by aggregating similarity-weighted recommendations of the peers who have interacted with him and reflecting the degree that the community as a whole trusts a peer, is assigned to each peer in the community. Different from previous global-trust based schemes, SimiTrust does not need any pre-trusted peers to ensure algorithm convergence and invalidates the assumption that the peers with high trust value will give the honest recommendation. Theoretical analyses and experiments show that the scheme is still robust under more general conditions where malicious peers cooperate in an attempt to deliberately subvert the system, converges more quickly and decreases the number of inauthentic files downloaded more effectively than previous schemes. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
35. Object Placement and Caching Strategies on AN.P2P.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Mu, Su, Chi, Chi-Hung, Liu, Lin, and Wang, HongGuang
- Abstract
This paper discusses the object placement and caching strategies on the AN.P2P, which is an Application Networking infrastructure to implement pervasive content delivery on the peer-to-peer networks. In general, the AN.P2P allows the peer to send the content's original object with the associated content adaptation workflow to other peers. The recipient peer can reuse the original object to generate different content presentations. In order to achieve reasonable performance, this paper proposes several object placement schemes and caching strategies. Our simulation results show that these methods could effectively improve the system performance in terms of query hops, computation cost and retrieval latency. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
36. Tracking Network-Constrained Moving Objects with Group Updates.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Chen, Jidong, Meng, Xiaofeng, Li, Benzhao, and Lai, Caifeng
- Abstract
Advances in wireless sensors and position technologies such as GPS enable location-based services that rely on the tracking of continuously changing positions of moving objects. The key issue in tracking techniques is how to minimize the number of updates, while providing accurate locations for query results. In this paper, for tracking network-constrained moving objects, we first propose a simulation-based prediction model with more accurate location prediction for objects movements in a traffic road network, which lowers the update frequency and assures the location precision. Then, according to their predicted future functions, objects are grouped and only the central object in each group reports its location to the server. The group update strategy further reduces the total number of objects reporting their locations. A simulation study has been conducted and proved that the group update policy based on the simulation prediction is superior to traditional update policies with fewer updates and higher location precision. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
37. Dynamic Configuring Service on Semantic Grid.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, and Zhu, Qing
- Abstract
Dynamic configuring service composition can automatically leverage distributed service components and resources to compose an optimal configuration according to the requirements on Semantic Grid. One major challenge is how to comprehend service-specific semantics and how to generate workflow to reuse common service composition functionalities. Current ontological specifications for semantically describing properties of Grid services are limited to their static interface description. In this paper, we present an automaton [1, 2] model in which service providers express their service-specific knowledge in the form of a service template and create composition plan that is used by a synthesizer to perform dynamic configuring composition automatically. Our main contribution is to formally describe dynamic processing of composition, to take QoS-driven composition goal into account to find best quality composition on Semantic Grid. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
38. An Effective Approach for Hiding Sensitive Knowledge in Data Publishing.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Wang, Zhihui, Liu, Bing, Wang, Wei, Zhou, Haofeng, and Shi, Baile
- Abstract
Recent efforts have been made to address the problem of privacy preservation in data publishing. However, they mainly focus on preserving data privacy. In this paper, we address another aspect of privacy preservation in data publishing, where some of the knowledge implied by a dataset are regarded as private or sensitive information. In particular, we consider that the data are stored in a transaction database, and the knowledge is represented in the form of patterns. We present a data sanitization algorithm, called SanDB, for effectively protecting a set of sensitive patterns, meanwhile attempting to minimize the impact of data sanitization on the non-sensitive patterns. The experimental results show that SanDB can achieve significant improvement over the best approach presented in the literature. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
39. Web Image Retrieval Refinement by Visual Contents.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Gong, Zhiguo, Liu, Qian, and Zhang, Jingbai
- Abstract
For Web image retrieval, two basic methods can be used for representing and indexing Web images. One is based on the associate text around the Web images; and the other utilizes visual features of images, such as color, texture, shape, as the descriptions of Web images. However, those two methods are often applied independently in practice. In fact, both have their limitations to support Web image retrieval. This paper proposes a novel model called 'multiplied refinement', which is more applicable to combination of those two basic methods. Our experiments compare three integration models, including multiplied refinement model, linear refinement model and expansion model, and show that the proposed model yields very good performance. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
40. LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Gonçalves, Alexandre, Zhu, Jianhan, Song, Dawei, Uren, Victoria, and Pacheco, Roberto
- Abstract
In this paper, we propose a text mining method called LRD (latent relation discovery), which extends the traditional vector space model of document representation in order to improve information retrieval (IR) on documents and document clustering. Our LRD method extracts terms and entities, such as person, organization, or project names, and discovers relationships between them by taking into account their co-occurrence in textual corpora. Given a target entity, LRD discovers other entities closely related to the target effectively and efficiently. With respect to such relatedness, a measure of relation strength between entities is defined. LRD uses relation strength to enhance the vector space model, and uses the enhanced vector space model for query based IR on documents and clustering documents in order to discover complex relationships among terms and entities. Our experiments on a standard dataset for query based IR shows that our LRD method performed significantly better than traditional vector space model and other five standard statistical methods for vector expansion. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
41. Succinct and Informative Cluster Descriptions for Document Repositories.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Chen, Lijun, and Dong, Guozhu
- Abstract
Large document repositories need to be organized, summarized and labeled in order to be used effectively. Previous clustering studies focused on organizing, and paid little attention to producing cluster labels. Without informative labels, users need to browse many documents to get a sense of what the clusters contain. Human labeling of clusters is not viable when clustering is performed on demand or for very few users. It is desirable to automatically generate informative cluster descriptions (CDs), in order to give users a high-level sense about the clusters, and to help repository managers to produce the final cluster labels. This paper studies CDs in the form of small term sets for document clusters, and investigates how to measure the quality or fidelity of CDs and how to construct high quality CDs. We propose to use a CD-based classification for simulating how to interpret CDs, and to use the F-score of the classification to measure CD quality. Since directly searching good CDs using F-score is too expensive, we consider a surrogate quality measure, the CDD measure, which combines three factors: coverage, disjointness, and diversity. We give a search strategy for constructing CDs, namely a layer-based replacement method called PagodaCD. Experimental results show that the algorithm is efficient and can produce high quality CDs. CDs produced by PagodaCD also exhibit a monotone quality behavior. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
42. LSM: Language Sense Model for Information Retrieval.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Bao, Shenghua, Zhang, Lei, Chen, Erdong, Long, Min, Li, Rui, and Yu, Yong
- Abstract
A lot of work has been done on drawing word senses into retrieval to deal with the word sense ambiguity problem, but most of them achieved negative results. In this paper, we first implement a WSD system for nouns and verbs, then the language sense model (LSM) for information retrieval is proposed. The LSM combines the terms and senses of a document seamlessly through an EM algorithm. Retrieval on TREC collections shows that the LSM outperforms both the vector space model (BM25) and the traditional language model significantly for both medium and long queries (7.53%-16.90%). Based on the experiments, we can also empirically draw the conclusion that the fine-grained senses will improve the retrieval performance when they are properly used. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
43. Keyword Extraction Using Support Vector Machine.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Zhang, Kuo, Xu, Hui, Tang, Jie, and Li, Juanzi
- Abstract
This paper is concerned with keyword extraction. By keyword extraction, we mean extracting a subset of words/phrases from a document that can describe the ‘meaning' of the document. Keywords are of benefit to many text mining applications. However, a large number of documents do not have keywords and thus it is necessary to assign keywords before enjoying the benefit from it. Several research efforts have been done on keyword extraction. These methods make use of the ‘global context information', which makes the performance of extraction restricted. A thorough and systematic investigation on the issue is thus needed. In this paper, we propose to make use of not only ‘global context information', but also ‘local context information' for extracting keywords from documents. As far as we know, utilizing both ‘global context information' and ‘local context information' in keyword extraction has not been sufficiently investigated previously. Methods for performing the tasks on the basis of Support Vector Machines have also been proposed in this paper. Features in the model have been defined. Experimental results indicate that the proposed SVM based method can significantly outperform the baseline methods for keyword extraction. The proposed method has been applied to document classification, a typical text mining processing. Experimental results show that the accuracy of document classification can be significantly improved by using the keyword extraction method. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
44. Automated Extraction of Hit Numbers from Search Result Pages.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Ling, Yanyan, Meng, Xiaofeng, and Meng, Weiyi
- Abstract
When a query is submitted to a search engine, the search engine returns a dynamically generated result page that contains the number of hits (i.e., the number of matching results) for the query. Hit number is a very useful piece of information in many important applications such as obtaining document frequencies of terms, estimating the sizes of search engines and generating search engine summaries. In this paper, we propose a novel technique for automatically identifying the hit number for any search engine and any query. This technique consists of three steps: first segment each result page into a set of blocks, then identify the block(s) that contain the hit number using a machine learning approach, and finally extract the hit number from the identified block(s) by comparing the patterns in multiple blocks from the same search engine. Experimental results indicate that this technique is highly accurate. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
45. Efficient Evaluation of Multiple Queries on Streamed XML Fragments.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Huo, Huan, Zhou, Rui, Wang, Guoren, Hui, Xiaoyun, Xiao, Chuan, and Yu, Yongqian
- Abstract
With the prevalence of Web applications, expediting multiple queries over streaming XML has become a core challenge due to one-pass processing and limited resources. Recently proposed Hole-Filler model is low consuming for XML fragments transmission and evaluation; however existing work addressed the multiple query problem over XML tuple streams instead of XML fragment streams. By taking advantage of schema information for XML, this paper proposes a model of tid+ tree to construct multiple queries over XML fragments and to prune off duplicate and dependent operations. Based on tid+ tree, it then proposes a notion of FQ-Index as the core in M-XFPro to index both multiple queries and XML fragments for processing multiple XPath queries involving simple path and twig path patterns. We illustrate the effectiveness of the techniques developed with a detailed set of experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
46. A New Structure for Accelerating XPath Location Steps.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Feng, Yaokai, and Makinouchi, Akifumi
- Abstract
Multidimensional indices have been successfully introduced to the field of querying on XML data. Using R*-tree, T. Grust proposed an interesting method to support all XPath axes. In that method, each node of an XML document is labeled with a five-dimensional descriptor. All the nodes of the XML document are mapped to a point set in a five-dimensional space. T. Grust made it clear that each of the XPath axes can be implemented by a range query in the above five-dimensional space. Thus, R*-tree can be used to improve the query performance for XPath axes. However, according to our investigations, most of the range queries for the XPath axes are partially-dimensional range queries. That is, the number of query dimensions in each of the range queries is less than five, although the R*-tree is built in the five-dimensional space. If the existing multidimensional indices are used for such range queries, then a great deal of information that is irrelevant to the queries also has to be read from disk. Based on this observation, a new multidimensional index structure (called Adaptive R*-tree) is proposed in this paper to support the XPath axes more efficiently. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
47. KCAM: Concentrating on Structural Similarity for XML Fragments.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Kong, Lingbo, Tang, Shiwei, Yang, Dongqing, Wang, Tengjiao, and Gao, Jun
- Abstract
This paper proposes a new method, KCAM, to measure the structural similarity of XML fragments satisfying given keywords. Its name is derived directly after the key structure in this method, Keyword Common Ancestor Matrix. One KCAM for one XML fragment is a k × k upper triangle matrix. Each element ai,j stores the level information of the SLCA (Smallest Lowest Common Ancestor) node corresponding to the keywords ki, kj. The matrix distance between KCAMs, denoted as KDist(·,·), can be used as the approximate structural similarity. KCAM is independent of label information in fragments. It is powerful to distinguish the structural difference between XML fragments. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
48. Spatial Index Compression for Location-Based Services Based on a MBR Semi-approximation Scheme.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Kim, Jongwan, Im, SeokJin, Kang, Sang-Won, and Hwang, Chong-Sun
- Abstract
The increased need for spatial data for location-based services or geographical information systems (GISs) in mobile computing has led to more research on spatial indexing, such as R-tree. The R-tree variants approximate spatial data to a minimal bounding rectangle (MBR). Most studies are based on adding or changing various options in R-tree, while a few studies have focused on increasing search performance via MBR compression. This study proposes a novel MBR compression scheme that uses semi-approximation (SA) MBRs and SAR-tree. Since SA decreases the size of MBR keys, halves QMBR enlargement, and increases node utilization, it improves the overall search performance. This scheme decreases quantized space more than existing quantization schemes do, and increases the utilization of each disk allocation unit. This study mathematically analyzes the number of node accesses and evaluates the performance of SAR-tree using real location data. The results show that the proposed index performs better than existing MBR compression schemes. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
49. An Efficient Indexing Scheme for Moving Objects' Trajectories on Road Networks.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Chang, Jae-Woo, and Um, Jung-Ho
- Abstract
Even though moving objects usually move on spatial networks, there has been little research on trajectory indexing schemes for spatial networks, like road networks. In this paper, we propose an efficient indexing scheme for moving objects' trajectories on road networks. For this, we design a signature-based indexing scheme for efficiently dealing with the trajectories of current moving objects as well as for maintaining those of past moving objects. In addition, we provide both an insertion algorithm to store the initial information of moving objects' trajectories and one to store their segment information. We also provide a retrieval algorithm to find a set of moving objects whose trajectories match the segments of a query trajectory. Finally, we show that our indexing scheme achieves much better performance on trajectory retrieval than the leading trajectory indexing schemes, such as TB-tree and FNR-tree. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
50. On-Demand Index for Efficient Structural Joins.
- Author
-
Yu, Jeffrey Xu, Kitsuregawa, Masaru, Leong, Hong Va, Wu, Kun-Lung, Chen, Shyh-Kwei, and Yu, Philip S.
- Abstract
A structural join finds all occurrences of structural, or containment, relationship between two sets of XML node elements: ancestor and descendant. Prior approaches to structural joins mostly focus on maintaining offline indexes on disks or requiring the elements in both sets to be sorted. However, either one can be expensive. More important, not all node elements are beforehand indexed or sorted. We present an on-demand, in-memory indexing approach to performing structural joins. There is no need to sort the elements. We discover that there are similarities between the problems of structural joins and stabbing queries. However, previous work on stabbing queries, although efficient in search time, is not directly applicable to structural joins because of high storage costs. We develop two storage reduction techniques to alleviate the problem of high storage costs. Simulations show that our new method outperforms prior approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.