Author: "Guha, Sudipto" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Guha, Sudipto"' showing total 339 results

Start Over Author "Guha, Sudipto"

339 results on '"Guha, Sudipto"'

1. Correlation Clustering in Data Streams

Author: Ahn, Kook Jin, Cormode, Graham, Guha, Sudipto, McGregor, Andrew, and Wirth, Anthony
Subjects: Computer Science - Data Structures and Algorithms
Abstract: Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on $n$ nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, $O(n\cdot \ \mbox{polylog}~n)$-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the "quality" of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in $O(n\cdot \mbox{polylog}~n)$-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.
Published: 2018

2. Distributed Partial Clustering

Author: Guha, Sudipto, Li, Yi, and Zhang, Qin
Subjects: Computer Science - Data Structures and Algorithms
Abstract: Recent years have witnessed an increasing popularity of algorithm design for distributed data, largely due to the fact that massive datasets are often collected and stored in different locations. In the distributed setting communication typically dominates the query processing time. Thus it becomes crucial to design communication efficient algorithms for queries on distributed data. Simultaneously, it has been widely recognized that partial optimizations, where we are allowed to disregard a small part of the data, provide us significantly better solutions. The motivation for disregarded points often arise from noise and other phenomena that are pervasive in large data scenarios. In this paper we focus on partial clustering problems, $k$-center, $k$-median and $k$-means, in the distributed model, and provide algorithms with communication sublinear of the input size. As a consequence we develop the first algorithms for the partial $k$-median and means objectives that run in subquadratic running time. We also initiate the study of distributed algorithms for clustering uncertain data, where each data point can possibly fall into multiple locations under certain probability distribution., Comment: A preliminary version is to appear in the Proceedings of SPAA 2017
Published: 2017

3. Behavioral Intervention and Non-Uniform Bootstrap Percolation

Author: Ballen, Peter and Guha, Sudipto
Subjects: Mathematics - Probability, Computer Science - Social and Information Networks
Abstract: Bootstrap percolation is an often used model to study the spread of diseases, rumors, and information on sparse random graphs. The percolation process demonstrates a critical value such that the graph is either almost completely affected or almost completely unaffected based on the initial seed being larger or smaller than the critical value. To analyze intervention strategies we provide the first analytic determination of the critical value for basic bootstrap percolation in random graphs when the vertex thresholds are nonuniform and provide an efficient algorithm. This result also helps solve the problem of "Percolation with Coinflips" when the infection process is not deterministic, which has been a criticism about the model. We also extend the results to clustered random graphs thereby extending the classes of graphs considered. In these graphs the vertices are grouped in a small number of clusters, the clusters model a fixed communication network and the edge probability is dependent if the vertices are in close or far clusters. We present simulations for both basic percolation and interventions that support our theoretical results.
Published: 2015

4. Correlation Clustering in Data Streams

Author: Ahn, Kook Jin, Cormode, Graham, Guha, Sudipto, McGregor, Andrew, and Wirth, Anthony
Published: 2021
Full Text: View/download PDF

5. Access to Data and Number of Iterations: Dual Primal Algorithms for Maximum Matching under Resource Constraints

Author: Ahn, Kook Jin and Guha, Sudipto
Subjects: Computer Science - Data Structures and Algorithms
Abstract: In this paper we consider graph algorithms in models of computation where the space usage (random accessible storage, in addition to the read only input) is sublinear in the number of edges $m$ and the access to input data is constrained. These questions arises in many natural settings, and in particular in the analysis of MapReduce or similar algorithms that model constrained parallelism with sublinear central processing. In SPAA 2011, Lattanzi etal. provided a $O(1)$ approximation of maximum matching using $O(p)$ rounds of iterative filtering via mapreduce and $O(n^{1+1/p})$ space of central processing for a graph with $n$ nodes and $m$ edges. We focus on weighted nonbipartite maximum matching in this paper. For any constant $p>1$, we provide an iterative sampling based algorithm for computing a $(1-\epsilon)$-approximation of the weighted nonbipartite maximum matching that uses $O(p/\epsilon)$ rounds of sampling, and $O(n^{1+1/p})$ space. The results extends to $b$-Matching with small changes. This paper combines adaptive sketching literature and fast primal-dual algorithms based on relaxed Dantzig-Wolfe decision procedures. Each round of sampling is implemented through linear sketches and executed in a single round of MapReduce. The paper also proves that nonstandard linear relaxations of a problem, in particular penalty based formulations, are helpful in mapreduce and similar settings in reducing the adaptive dependence of the iterations.
Published: 2013

6. Near Linear Time Approximation Schemes for Uncapacitated and Capacitated b--Matching Problems in Nonbipartite Graphs

Author: Ahn, Kook Jin and Guha, Sudipto
Subjects: Computer Science - Data Structures and Algorithms
Abstract: We present the first near optimal approximation schemes for the maximum weighted (uncapacitated or capacitated) $b$--matching problems for non-bipartite graphs that run in time (near) linear in the number of edges. For any $\delta>3/\sqrt{n}$ the algorithm produces a $(1-\delta)$ approximation in $O(m \poly(\delta^{-1},\log n))$ time. We provide fractional solutions for the standard linear programming formulations for these problems and subsequently also provide (near) linear time approximation schemes for rounding the fractional solutions. Through these problems as a vehicle, we also present several ideas in the context of solving linear programs approximately using fast primal-dual algorithms. First, even though the dual of these problems have exponentially many variables and an efficient exact computation of dual weights is infeasible, we show that we can efficiently compute and use a sparse approximation of the dual weights using a combination of (i) adding perturbation to the constraints of the polytope and (ii) amplification followed by thresholding of the dual weights. Second, we show that approximation algorithms can be used to reduce the width of the formulation, and faster convergence.
Published: 2013

7. Approximation Algorithms for Bayesian Multi-Armed Bandit Problems

Author: Guha, Sudipto and Munagala, Kamesh
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Learning
Abstract: In this paper, we consider several finite-horizon Bayesian multi-armed bandit problems with side constraints which are computationally intractable (NP-Hard) and for which no optimal (or near optimal) algorithms are known to exist with sub-exponential running time. All of these problems violate the standard exchange property, which assumes that the reward from the play of an arm is not contingent upon when the arm is played. Not only are index policies suboptimal in these contexts, there has been little analysis of such policies in these problem settings. We show that if we consider near-optimal policies, in the sense of approximation algorithms, then there exists (near) index policies. Conceptually, if we can find policies that satisfy an approximate version of the exchange property, namely, that the reward from the play of an arm depends on when the arm is played to within a constant factor, then we have an avenue towards solving these problems. However such an approximate version of the idling bandit property does not hold on a per-play basis and are shown to hold in a global sense. Clearly, such a property is not necessarily true of arbitrary single arm policies and finding such single arm policies is nontrivial. We show that by restricting the state spaces of arms we can find single arm policies and that these single arm policies can be combined into global (near) index policies where the approximate version of the exchange property is true in expectation. The number of different bandit problems that can be addressed by this technique already demonstrate its wide applicability., Comment: arXiv admin note: text overlap with arXiv:1011.1161
Published: 2013

8. REX: Recursive, Delta-Based Data-Centric Computation

Author: Mihaylov, Svilen R., Ives, Zachary G., and Guha, Sudipto
Subjects: Computer Science - Databases
Abstract: In today's Web and social network environments, query workloads include ad hoc and OLAP queries, as well as iterative algorithms that analyze data relationships (e.g., link analysis, clustering, learning). Modern DBMSs support ad hoc and OLAP queries, but most are not robust enough to scale to large clusters. Conversely, "cloud" platforms like MapReduce execute chains of batch tasks across clusters in a fault tolerant way, but have too much overhead to support ad hoc queries. Moreover, both classes of platform incur significant overhead in executing iterative data analysis algorithms. Most such iterative algorithms repeatedly refine portions of their answers, until some convergence criterion is reached. However, general cloud platforms typically must reprocess all data in each step. DBMSs that support recursive SQL are more efficient in that they propagate only the changes in each step -- but they still accumulate each iteration's state, even if it is no longer useful. User-defined functions are also typically harder to write for DBMSs than for cloud platforms. We seek to unify the strengths of both styles of platforms, with a focus on supporting iterative computations in which changes, in the form of deltas, are propagated from iteration to iteration, and state is efficiently updated in an extensible way. We present a programming model oriented around deltas, describe how we execute and optimize such programs in our REX runtime system, and validate that our platform also handles failures gracefully. We experimentally validate our techniques, and show speedups over the competing methods ranging from 2.5 to nearly 100 times., Comment: VLDB2012
Published: 2012

9. Laminar Families and Metric Embeddings: Non-bipartite Maximum Matching Problem in the Semi-Streaming Model

Author: Ahn, Kook Jin and Guha, Sudipto
Subjects: Computer Science - Data Structures and Algorithms
Abstract: In this paper, we study the non-bipartite maximum matching problem in the semi-streaming model. The maximum matching problem in the semi-streaming model has received a significant amount of attention lately. While the problem has been somewhat well solved for bipartite graphs, the known algorithms for non-bipartite graphs use $2^{\frac1\epsilon}$ passes or $n^{\frac1\epsilon}$ time to compute a $(1-\epsilon)$ approximation. In this paper we provide the first FPTAS (polynomial in $n,\frac1\epsilon$) for the problem which is efficient in both the running time and the number of passes. We also show that we can estimate the size of the matching in $O(\frac1\epsilon)$ passes using slightly superlinear space. To achieve both results, we use the structural properties of the matching polytope such as the laminarity of the tight sets and total dual integrality. The algorithms are iterative, and are based on the fractional packing and covering framework. However the formulations herein require exponentially many variables or constraints. We use laminarity, metric embeddings and graph sparsification to reduce the space required by the algorithms in between and across the iterations. This is the first use of these ideas in the semi-streaming model to solve a combinatorial optimization problem.
Published: 2011

10. Linear Programming in the Semi-streaming Model with Application to the Maximum Matching Problem

Author: Ahn, Kook Jin and Guha, Sudipto
Subjects: Computer Science - Data Structures and Algorithms
Abstract: In this paper, we study linear programming based approaches to the maximum matching problem in the semi-streaming model. The semi-streaming model has gained attention as a model for processing massive graphs as the importance of such graphs has increased. This is a model where edges are streamed-in in an adversarial order and we are allowed a space proportional to the number of vertices in a graph. In recent years, there has been several new results in this semi-streaming model. However broad techniques such as linear programming have not been adapted to this model. We present several techniques to adapt and optimize linear programming based approaches in the semi-streaming model with an application to the maximum matching problem. As a consequence, we improve (almost) all previous results on this problem, and also prove new results on interesting variants.
Published: 2011

11. Multiarmed Bandit Problems with Delayed Feedback

Author: Guha, Sudipto, Munagala, Kamesh, and Pal, Martin
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Learning
Abstract: In this paper we initiate the study of optimization of bandit type problems in scenarios where the feedback of a play is not immediately known. This arises naturally in allocation problems which have been studied extensively in the literature, albeit in the absence of delays in the feedback. We study this problem in the Bayesian setting. In presence of delays, no solution with provable guarantees is known to exist with sub-exponential running time. We show that bandit problems with delayed feedback that arise in allocation settings can be forced to have significant structure, with a slight loss in optimality. This structure gives us the ability to reason about the relationship of single arm policies to the entangled optimum policy, and eventually leads to a O(1) approximation for a significantly general class of priors. The structural insights we develop are of key interest and carry over to the setting where the feedback of an action is available instantaneously, and we improve all previous results in this setting as well., Comment: The results and presentation in this paper are subsumed by the article "Approximation algorithms for Bayesian multi-armed bandit problems" arXiv:1306.3525
Published: 2010

12. Approximation Schemes for Sequential Posted Pricing in Multi-Unit Auctions

Author: Chakraborty, Tanmoy, Even-Dar, Eyal, Guha, Sudipto, Mansour, Yishay, and Muthukrishnan, S.
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Data Structures and Algorithms, F.2.2, J.4
Abstract: We design algorithms for computing approximately revenue-maximizing {\em sequential posted-pricing mechanisms (SPM)} in $K$-unit auctions, in a standard Bayesian model. A seller has $K$ copies of an item to sell, and there are $n$ buyers, each interested in only one copy, who have some value for the item. The seller must post a price for each buyer, the buyers arrive in a sequence enforced by the seller, and a buyer buys the item if its value exceeds the price posted to it. The seller does not know the values of the buyers, but have Bayesian information about them. An SPM specifies the ordering of buyers and the posted prices, and may be {\em adaptive} or {\em non-adaptive} in its behavior. The goal is to design SPM in polynomial time to maximize expected revenue. We compare against the expected revenue of optimal SPM, and provide a polynomial time approximation scheme (PTAS) for both non-adaptive and adaptive SPMs. This is achieved by two algorithms: an efficient algorithm that gives a $(1-\frac{1}{\sqrt{2\pi K}})$-approximation (and hence a PTAS for sufficiently large $K$), and another that is a PTAS for constant $K$. The first algorithm yields a non-adaptive SPM that yields its approximation guarantees against an optimal adaptive SPM -- this implies that the {\em adaptivity gap} in SPMs vanishes as $K$ becomes larger., Comment: 16 pages
Published: 2010

13. Selective Call Out and Real Time Bidding

Author: Chakraborty, Tanmoy, Even-Dar, Eyal, Guha, Sudipto, Mansour, Yishay, and Muthukrishnan, S.
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Data Structures and Algorithms, F.2.2, J.4
Abstract: Ads on the Internet are increasingly sold via ad exchanges such as RightMedia, AdECN and Doubleclick Ad Exchange. These exchanges allow real-time bidding, that is, each time the publisher contacts the exchange, the exchange ``calls out'' to solicit bids from ad networks. This aspect of soliciting bids introduces a novel aspect, in contrast to existing literature. This suggests developing a joint optimization framework which optimizes over the allocation and well as solicitation. We model this selective call out as an online recurrent Bayesian decision framework with bandwidth type constraints. We obtain natural algorithms with bounded performance guarantees for several natural optimization criteria. We show that these results hold under different call out constraint models, and different arrival processes. Interestingly, the paper shows that under MHR assumptions, the expected revenue of generalized second price auction with reserve is constant factor of the expected welfare. Also the analysis herein allow us prove adaptivity gap type results for the adwords problem., Comment: 24 pages, 10 figures
Published: 2010

14. Graph Sparsification in the Semi-streaming Model

Author: Ahn, Kook Jin and Guha, Sudipto
Subjects: Computer Science - Data Structures and Algorithms
Abstract: Analyzing massive data sets has been one of the key motivations for studying streaming algorithms. In recent years, there has been significant progress in analysing distributions in a streaming setting, but the progress on graph problems has been limited. A main reason for this has been the existence of linear space lower bounds for even simple problems such as determining the connectedness of a graph. However, in many new scenarios that arise from social and other interaction networks, the number of vertices is significantly less than the number of edges. This has led to the formulation of the semi-streaming model where we assume that the space is (near) linear in the number of vertices (but not necessarily the edges), and the edges appear in an arbitrary (and possibly adversarial) order. In this paper we focus on graph sparsification, which is one of the major building blocks in a variety of graph algorithms. There has been a long history of (non-streaming) sampling algorithms that provide sparse graph approximations and it a natural question to ask if the sparsification can be achieved using a small space, and in addition using a single pass over the data? The question is interesting from the standpoint of both theory and practice and we answer the question in the affirmative, by providing a one pass $\tilde{O}(n/\epsilon^{2})$ space algorithm that produces a sparsification that approximates each cut to a $(1+\epsilon)$ factor. We also show that $\Omega(n \log \frac1\epsilon)$ space is necessary for a one pass streaming algorithm to approximate the min-cut, improving upon the $\Omega(n)$ lower bound that arises from lower bounds for testing connectivity.
Published: 2009

15. Adaptive Uncertainty Resolution in Bayesian Combinatorial Optimization Problems

Author: Guha, Sudipto and Munagala, Kamesh
Subjects: Computer Science - Data Structures and Algorithms, F.2
Abstract: In several applications such as databases, planning, and sensor networks, parameters such as selectivity, load, or sensed values are known only with some associated uncertainty. The performance of such a system (as captured by some objective function over the parameters) is significantly improved if some of these parameters can be probed or observed. In a resource constrained situation, deciding which parameters to observe in order to optimize system performance itself becomes an interesting and important optimization problem. This general problem is the focus of this paper. One of the most important considerations in this framework is whether adaptivity is required for the observations. Adaptive observations introduce blocking or sequential operations in the system whereas non-adaptive observations can be performed in parallel. One of the important questions in this regard is to characterize the benefit of adaptivity for probes and observation. We present general techniques for designing constant factor approximations to the optimal observation schemes for several widely used scheduling and metric objective functions. We show a unifying technique that relates this optimization problem to the outlier version of the corresponding deterministic optimization. By making this connection, our technique shows constant factor upper bounds for the benefit of adaptivity of the observation schemes. We show that while probing yields significant improvement in the objective function, being adaptive about the probing is not beneficial beyond constant factors., Comment: Journal version of the paper "Model-driven Optimization using Adaptive Probes" that appeared in the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2007
Published: 2008

16. Sequential Design of Experiments via Linear Programming

Author: Guha, Sudipto and Munagala, Kamesh
Subjects: Computer Science - Data Structures and Algorithms
Abstract: The celebrated multi-armed bandit problem in decision theory models the basic trade-off between exploration, or learning about the state of a system, and exploitation, or utilizing the system. In this paper we study the variant of the multi-armed bandit problem where the exploration phase involves costly experiments and occurs before the exploitation phase; and where each play of an arm during the exploration phase updates a prior belief about the arm. The problem of finding an inexpensive exploration strategy to optimize a certain exploitation objective is NP-Hard even when a single play reveals all information about an arm, and all exploration steps cost the same. We provide the first polynomial time constant-factor approximation algorithm for this class of problems. We show that this framework also generalizes several problems of interest studied in the context of data acquisition in sensor networks. Our analyses also extends to switching and setup costs, and to concave utility objectives. Our solution approach is via a novel linear program rounding technique based on stochastic packing. In addition to yielding exploration policies whose performance is within a small constant factor of the adaptive optimal policy, a nice feature of this approach is that the resulting policies explore the arms sequentially without revisiting any arm. Sequentiality is a well-studied concept in decision theory, and is very desirable in domains where multiple explorations can be conducted in parallel, for instance, in the sensor network context., Comment: The results and presentation in this paper are subsumed by the article "Approximation algorithms for Bayesian multi-armed bandit problems" http://arxiv.org/abs/1306.3525
Published: 2008

17. Information Acquisition and Exploitation in Multichannel Wireless Networks

Author: Guha, Sudipto, Munagala, Kamesh, and Sarkar, Saswati
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Networking and Internet Architecture, F.2
Abstract: A wireless system with multiple channels is considered, where each channel has several transmission states. A user learns about the instantaneous state of an available channel by transmitting a control packet in it. Since probing all channels consumes significant energy and time, a user needs to determine what and how much information it needs to acquire about the instantaneous states of the available channels so that it can maximize its transmission rate. This motivates the study of the trade-off between the cost of information acquisition and its value towards improving the transmission rate. A simple model is presented for studying this information acquisition and exploitation trade-off when the channels are multi-state, with different distributions and information acquisition costs. The objective is to maximize a utility function which depends on both the cost and value of information. Solution techniques are presented for computing near-optimal policies with succinct representation in polynomial time. These policies provably achieve at least a fixed constant factor of the optimal utility on any problem instance, and in addition, have natural characterizations. The techniques are based on exploiting the structure of the optimal policy, and use of Lagrangean relaxations which simplify the space of approximately optimal solutions., Comment: 29 pages
Published: 2008

18. Approximation Algorithms for Restless Bandit Problems

Author: Guha, Sudipto, Munagala, Kamesh, and Shi, Peng
Subjects: Computer Science - Data Structures and Algorithms, F.2.2, G.3
Abstract: The restless bandit problem is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to approximate to any non-trivial factor, and little progress has been made despite its importance in modeling activity allocation under uncertainty. We consider a special case that we call Feedback MAB, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process whose exact state is only revealed when the arm is played. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is also an instance of a Partially Observable Markov Decision Process (POMDP), and is widely studied in wireless scheduling and unmanned aerial vehicle (UAV) routing. Unlike the stochastic MAB problem, the Feedback MAB problem does not admit to greedy index-based optimal policies. We develop a novel and general duality-based algorithmic technique that yields a surprisingly simple and intuitive 2+epsilon-approximate greedy policy to this problem. We then define a general sub-class of restless bandit problems that we term Monotone bandits, for which our policy is a 2-approximation. Our technique is robust enough to handle generalizations of these problems to incorporate various side-constraints such as blocking plays and switching costs. This technique is also of independent interest for other restless bandit problems. By presenting the first (and efficient) O(1) approximations for non-trivial instances of restless bandits as well as of POMDPs, our work initiates the study of approximation algorithms in both these contexts., Comment: Merges two papers appearing in the FOCS '07 and SODA '09 conferences. This final version has been submitted for journal publication
Published: 2007

19. Approximation algorithms for wavelet transform coding of data streams

Author: Guha, Sudipto and Harb, Boulos
Subjects: Computer Science - Data Structures and Algorithms, G.1.2
Abstract: This paper addresses the problem of finding a B-term wavelet representation of a given discrete function $f \in \real^n$ whose distance from f is minimized. The problem is well understood when we seek to minimize the Euclidean distance between f and its representation. The first known algorithms for finding provably approximate representations minimizing general $\ell_p$ distances (including $\ell_\infty$) under a wide variety of compactly supported wavelet bases are presented in this paper. For the Haar basis, a polynomial time approximation scheme is demonstrated. These algorithms are applicable in the one-pass sublinear-space data stream model of computation. They generalize naturally to multiple dimensions and weighted norms. A universal representation that provides a provable approximation guarantee under all p-norms simultaneously; and the first approximation algorithms for bit-budget versions of the problem, known as adaptive quantization, are also presented. Further, it is shown that the algorithms presented here can be used to select a basis from a tree-structured dictionary of bases and find a B-term representation of the given function that provably approximates its best dictionary-basis representation., Comment: Added a universal representation that provides a provable approximation guarantee under all p-norms simultaneously
Published: 2006

20. Streaming and Sublinear Approximation of Entropy and Information Distances

Author: Guha, Sudipto, McGregor, Andrew, and Venkatasubramanian, Suresh
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Information Theory
Abstract: In many problems in data mining and machine learning, data items that need to be clustered or classified are not points in a high-dimensional space, but are distributions (points on a high dimensional simplex). For distributions, natural measures of distance are not the $\ell_p$ norms and variants, but information-theoretic measures like the Kullback-Leibler distance, the Hellinger distance, and others. Efficient estimation of these distances is a key component in algorithms for manipulating distributions. Thus, sublinear resource constraints, either in time (property testing) or space (streaming) are crucial. We start by resolving two open questions regarding property testing of distributions. Firstly, we show a tight bound for estimating bounded, symmetric f-divergences between distributions in a general property testing (sublinear time) framework (the so-called combined oracle model). This yields optimal algorithms for estimating such well known distances as the Jensen-Shannon divergence and the Hellinger distance. Secondly, we close a $(\log n)/H$ gap between upper and lower bounds for estimating entropy $H$ in this model. In a stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. We also provide other results along the space/time/approximation tradeoff curve., Comment: 18 pages
Published: 2005

21. How far will you walk to find your shortcut: Space Efficient Synopsis Construction Algorithms

Author: Guha, Sudipto
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Databases
Abstract: In this paper we consider the wavelet synopsis construction problem without the restriction that we only choose a subset of coefficients of the original data. We provide the first near optimal algorithm. We arrive at the above algorithm by considering space efficient algorithms for the restricted version of the problem. In this context we improve previous algorithms by almost a linear factor and reduce the required space to almost linear. Our techniques also extend to histogram construction, and improve the space-running time tradeoffs for V-Opt and range query histograms. We believe the idea applies to a broad range of dynamic programs and demonstrate it by showing improvements in a knapsack-like setting seen in construction of Extended Wavelets.
Published: 2005

22. Clustering Data Streams

Author: Guha, Sudipto, Mishra, Nina, Carey, Michael J., Series editor, Ceri, Stefano, Series editor, Garofalakis, Minos, editor, Gehrke, Johannes, editor, and Rastogi, Rajeev, editor
Published: 2016
Full Text: View/download PDF

23. Improving the performance of list intersection

Author: Tsirogiannis, Dimitris, Guha, Sudipto, and Koudas, Nick
Abstract: List intersection is a central operation, utilized excessively for query processing on text and databases. We present list intersection algorithms for an arbitrary number of sorted and unsorted lists tailored to the characteristics of modern hardware architectures. Two new list intersection algorithms are presented for sorted lists. The first algorithm, termed Dynamic Probes, dynamically decides the probing order on the lists exploiting information from previous probes at runtime. This information is utilized as a cache-resident microindex. The second algorithm, termed Quantile-based, deduces in advance a good probing order, thus avoiding the overhead of adaptivity and is based on detecting lists with non-uniform distribution of document identifiers. For unsorted lists, we present a novel hash-based algorithm that avoids the overhead of sorting.A detailed experimental evaluation is presented based on real and synthetic data using existing chip multiprocessor architectures with eight cores, validating the efficiency and efficacy of the proposed algorithms.
Published: 2024
Full Text: View/download PDF

24. Learning to create data-integrating queries

Author: Talukdar, Partha Pratim, Jacob, Marie, Mehmood, Muhammad Salman, Crammer, Koby, Ives, Zachary G., Pereira, Fernando, and Guha, Sudipto
Abstract: The number of potentially-related data resources available for querying --- databases, data warehouses, virtual integrated schemas --- continues to grow rapidly. Perhaps no area has seen this problem as acutely as the life sciences, where hundreds of large, complex, interlinked data resources are available on fields like proteomics, genomics, disease studies, and pharmacology. The schemas of individual databases are often large on their own, but users also need to pose queries across multiple sources, exploiting foreign keys and schema mappings. Since the users are not experts, they typically rely on the existence of pre-defined Web forms and associated query templates, developed by programmers to meet the particular scientists' needs. Unfortunately, such forms are scarce commodities, often limited to a single database, and mismatched with biologists' information needs that are often context-sensitive and span multiple databases.We present a system with which a non-expert user can author new query templates and Web forms, to be reused by anyone with related information needs. The user poses keyword queries that are matched against source relations and their attributes; the system uses sequences of associations (e.g., foreign keys, links, schema mappings, synonyms, and taxonomies) to create multiple ranked queries linking the matches to keywords; the set of queries is attached to a Web query form. Now the user and his or her associates may pose specific queries by filling in parameters in the form. Importantly, the answers to this query are ranked and annotated with data provenance, and the user provides feedbackon the utility of the answers, from which the system ultimately learns to assign costs to sources and associations according to the user's specific information need, as a result changing the ranking of the queries used to generate results. We evaluate the effectiveness of our method against "gold standard" costs from domain experts and demonstrate the method's scalability.
Published: 2024
Full Text: View/download PDF

25. Approximate Indexability and Bandit Problems with Concave Rewards and Delayed Feedback

Author: Guha, Sudipto, Munagala, Kamesh, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Raghavendra, Prasad, editor, Raskhodnikova, Sofya, editor, Jansen, Klaus, editor, and Rolim, José D. P., editor
Published: 2013
Full Text: View/download PDF

26. Spectral Sparsification in Dynamic Graph Streams

Author: Ahn, Kook Jin, Guha, Sudipto, McGregor, Andrew, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Raghavendra, Prasad, editor, Raskhodnikova, Sofya, editor, Jansen, Klaus, editor, and Rolim, José D. P., editor
Published: 2013
Full Text: View/download PDF

27. Multi-armed Bandits with Metric Switching Costs

Author: Guha, Sudipto, Munagala, Kamesh, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Albers, Susanne, editor, Marchetti-Spaccamela, Alberto, editor, Matias, Yossi, editor, Nikoletseas, Sotiris, editor, and Thomas, Wolfgang, editor
Published: 2009
Full Text: View/download PDF

28. Revisiting the Direct Sum Theorem and Space Lower Bounds in Random Order Streams

Author: Guha, Sudipto, Huang, Zhiyi, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Albers, Susanne, editor, Marchetti-Spaccamela, Alberto, editor, Matias, Yossi, editor, Nikoletseas, Sotiris, editor, and Thomas, Wolfgang, editor
Published: 2009
Full Text: View/download PDF

29. Tight Lower Bounds for Multi-pass Stream Computation Via Pass Elimination

Author: Guha, Sudipto, McGregor, Andrew, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Aceto, Luca, editor, Damgård, Ivan, editor, Goldberg, Leslie Ann, editor, Halldórsson, Magnús M., editor, Ingólfsdóttir, Anna, editor, and Walukiewicz, Igor, editor
Published: 2008
Full Text: View/download PDF

30. Lower Bounds for Quantile Estimation in Random-Order and Multi-pass Streaming

Author: Guha, Sudipto, McGregor, Andrew, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Arge, Lars, editor, Cachin, Christian, editor, Jurdziński, Tomasz, editor, and Tarlecki, Andrzej, editor
Published: 2007
Full Text: View/download PDF

31. Sketching Information Divergences

Author: Guha, Sudipto, Indyk, Piotr, McGregor, Andrew, Carbonell, Jaime G., editor, Siekmann, Jörg, editor, Bshouty, Nader H., editor, and Gentile, Claudio, editor
Published: 2007
Full Text: View/download PDF

32. Techniques for Clustering Massive Data Sets

Author: Guha, Sudipto, Rastogi, Rajeev, Shim, Kyuseok, Du, Ding-Zhu, editor, Raghavendra, Cauligi, editor, Wu, Weili, Xiong, Hui, and Shekhar, Shashi
Published: 2004
Full Text: View/download PDF

33. Inferring Mixtures of Markov Chains

Author: Batu, Tuğkan, Guha, Sudipto, Kannan, Sampath, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Carbonell, Jaime G., editor, Siekmann, Jörg, editor, Shawe-Taylor, John, editor, and Singer, Yoram, editor
Published: 2004
Full Text: View/download PDF

34. Linear programming in the semi-streaming model with application to the maximum matching problem

Author: Ahn, Kook Jin and Guha, Sudipto
Published: 2013
Full Text: View/download PDF

35. Compression of Partially Ordered Strings

Author: Alur, Rajeev, Chaudhuri, Swarat, Etessami, Kousha, Guha, Sudipto, Yannakakis, Mihalis, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Amadio, Roberto, editor, and Lugiez, Denis, editor
Published: 2003
Full Text: View/download PDF

36. Approximating Steiner k-Cuts

Author: Chekuri, Chandra, Guha, Sudipto, Naor, Joseph Seffi, Goos, G., editor, Hartmanis, J., editor, van Leeuwen, J., editor, Baeten, Jos C. M., editor, Lenstra, Jan Karel, editor, Parrow, Joachim, editor, and Woeginger, Gerhard J., editor
Published: 2003
Full Text: View/download PDF

37. Histogramming Data Streams with Fast Per-Item Processing

Author: Guha, Sudipto, Indyk, Piotr, Muthukrishnan, S., Strauss, Martin J., Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Widmayer, Peter, editor, Eidenbenz, Stephan, editor, Triguero, Francisco, editor, Morales, Rafael, editor, Conejo, Ricardo, editor, and Hennessy, Matthew, editor
Published: 2002
Full Text: View/download PDF

38. Facility location with dynamic distance functions : Extended abstract

Author: Bhatia, Randeep, Guha, Sudipto, Khuller, Samir, Sussmann, Yoram J., Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Arnborg, Stefan, editor, and Ivansson, Lars, editor
Published: 1998
Full Text: View/download PDF

39. Improved Methods for Approximating Node Weighted Steiner Trees and Connected Dominating Sets

Author: Guha, Sudipto, Khuller, Samir, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Arvind, Vikraman, editor, and Ramanujam, Sundar, editor
Published: 1998
Full Text: View/download PDF

40. Approximation algorithms for connected dominating sets

Author: Guha, Sudipto, Khuller, Samir, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Diaz, Josep, editor, and Serna, Maria, editor
Published: 1996
Full Text: View/download PDF

41. On the space–time of optimal, approximate and streaming algorithms for synopsis construction problems

Author: Guha, Sudipto
Published: 2008
Full Text: View/download PDF

42. Sketching information divergences

Author: Guha, Sudipto, Indyk, Piotr, and McGregor, Andrew
Published: 2008
Full Text: View/download PDF

43. Wavelet synopsis for hierarchical range queries with workloads

Author: Guha, Sudipto, Park, Hyoungmin, and Shim, Kyuseok
Published: 2008
Full Text: View/download PDF

44. Approximation algorithms for wavelet transform coding of data streams

Author: Guha, Sudipto and Harb, Boulos
Subjects: Algorithm, Approximation theory -- Methods, Coding theory -- Methods, Wavelet transforms -- Properties, Algorithms -- Methods
Abstract: This paper addresses the problem of finding B-term wavelet representation of a given discrete function f [member of] [R.sup.n] whose distance from f is minimized. The problem is well understood when we seek to minimize the Euclidean distance between f and its representation. The first-known algorithms for finding provably approximate representations minimizing general [e.sub.p] distances (including [e.sub.[infinity]] under a wide variety of compactly supported wavelet bases are presented in this paper. For the Haar basis, a polynomial time approximation scheme is demonstrated. These algorithms are applicable in the one-pass sublinear-space data stream model of computation. They generalize naturally to multiple dimensions and weighted norms. A universal representation that provides a provable approximation guarantee under all p-norms simultaneously; and the first approximation algorithms for bit-budget versions of the problem, known as adaptive quantization, are also presented. Further, it is shown that the algorithms presented here can be used to select a basis from a tree-structured dictionary of bases and find a B-term representation of the given function that provably approximates its best dictionary-basis representation. Index Terms--Adaptive quantization, best basis selection, compactly supported wavelets, nonlinear approximation, sparse representation, streaming algorithms, transform coding, universal representation.
Published: 2008

45. A note on linear time algorithms for maximum error histograms

Author: Guha, Sudipto and Shim, Kyuseok
Subjects: Algorithms -- Usage, Query processing -- Methods, Algorithm, Business, Computers, Electronics, Electronics and electrical industries
Abstract: Histograms and Wavelet synopses provide useful tools in query optimization and approximate query answering. Traditional histogram construction algorithms, e.g., V-Optimal, use error measures which are the sums of a suitable function, e.g., square, of the error at each point. Although the best-known algorithms for solving these problems run in quadratic time, a sequence of results have given us a linear time approximation scheme for these algorithms. In recent years, there have been many emerging applications where we are interested in measuring the maximum (absolute or relative) error at a point. We show that this problem is fundamentally different from the other traditional non-[l.sub.[infinity]] error measures and provide an optimal algorithm that runs in linear time for a small number of buckets. We also present results which work for arbitrary weighted maximum error measures. Index Terms--Histograms, algorithms.
Published: 2007

46. Spectral Sparsification in Dynamic Graph Streams

Author: Ahn, Kook Jin, primary, Guha, Sudipto, additional, and McGregor, Andrew, additional
Published: 2013
Full Text: View/download PDF

47. Approximate Indexability and Bandit Problems with Concave Rewards and Delayed Feedback

Author: Guha, Sudipto, primary and Munagala, Kamesh, additional
Published: 2013
Full Text: View/download PDF

48. Approximation and streaming algorithms for histogram construction problems

Author: Guha, Sudipto, Koudas, Nick, and Shim, Kyuseok
Subjects: Algorithm, Algorithms -- Usage, Algorithms -- Methods, Data structures -- Measurement
Abstract: Histograms and related synopsis structures are popular techniques for approximating data distributions. These have been successful in query optimization and a variety of applications, including approximate querying, similarity searching, and data mining, to name a few. Histograms were a few of the earliest synopsis structures proposed and continue to be used widely. The histogram construction problem is to construct the best histogram restricted to a space bound that reflects the data distribution most accurately under a given error measure. The histograms are used as quick and easy estimates. Thus, a slight loss of accuracy, compared to the optimal histogram under the given error measure, can be offset by fast histogram construction algorithms. A natural question arises in this context: Can we find a fast near optimal approximation algorithm for the histogram construction problem? In this article, we give the first linear time (1 + [member of])-factor approximation algorithms (for any [member of] > 0) for a large number of histogram construction problems including the use of piecewise small degree polynomials to approximate data, workloads, etc. Several of our algorithms extend to data streams. Using synthetic and real-life data sets, we demonstrate that in many scenarios the approximate histograms are almost identical to optimal histograms in quality and are significantly faster to construct. Categories and Subject Descriptors: F.2 [Theory of Computation]: Analysis of Algorithms; G.1.2 [Numerical Analysis I: Approximation; H.2 [Information Systems I: Database Management General Terms: Algorithms, Performance, Theory Additional Key Words and Phrases: Data Streams, histograms, approximation algorithm
Published: 2006

49. Integrating XML data sources using approximate joins

Author: Guha, Sudipto, Jagadish, H.V., Koudas, Nick, Rivastava, Divesh S., and Yu, Ting
Subjects: Algorithm, Data integrity, XML, Algorithms -- Usage, Algorithms -- Methods, Data integrity -- Research, XML (Document markup language) -- Research
Abstract: XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling-based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems--Query processing General Terms: Algorithms, Experimentation, Performance, Theory Additional Key Words and Phrases: Data integration, tree edit distance, XML, joins, approximate joins
Published: 2006

50. Linear Programming in the Semi-streaming Model with Application to the Maximum Matching Problem

Author: Ahn, Kook Jin, primary and Guha, Sudipto, additional
Published: 2011
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

339 results on '"Guha, Sudipto"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources