Author: "Bader, David" / Journal: high performance computing - hipc 2005 - Searchworks@Jio Institute Digital Library Search Results

1. Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors

Author: Bader, David A., primary and Madduri, Kamesh, additional
Published: 2005
Full Text: View/download PDF

2. A Clustering and Traffic-Redistribution Scheme for High-Performance IPsec VPNs.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Pan-Lung Tsai, Chun-Ying Huang, Yun-Yin Huang, Chia-Chang Hsu, and Chin-Laung Lei
Abstract: CPE-based IPsec VPNs have been widely used to provide secure private communication across the Internet. As the bandwidth of WAN links keeps growing, the bottleneck in a typical deployment of CPE-based IPsec VPNs has moved from the last-mile connections to the customer-edge security gateways. In this paper, we propose a clustering scheme to scale the throughput as required by CPE-based IPsec VPNs. The proposed scheme groups multiple security gateways into a cluster using a transparent self-dispatching technique and allows as many gateways to be added as necessary until the resulting throughput is again limited by the bandwidth of the last-mile connections. It also includes a flow-migration mechanism to keep the load of the gateways balanced. The results of the performance evaluation confirm that the clustering technique and the traffic-redistribution mechanism together create a transparent, adaptive, and highly scalable solution for building high-performance IPsec VPNs. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

3. Scheduling Multiple Flows on Parallel Disks.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Gulati, Ajay, and Varman, Peter
Abstract: We examine the problem of scheduling concurrent independent flows on multiple-disk I/O storage systems. Two models are considered: in the shared buffer model the memory buffer is shared among all the flows, while in the partitioned buffer model each flow has a private buffer. For the parallel disk model with d > 1 disks it is shown that the problem of minimizing the schedule length of n > 2 concurrent flows is NP-complete for both buffer models. A randomized scheduling algorithm for the partitioned buffer model is analyzed and probabilistic bounds on the schedule length are presented. Finally a heuristic based on static buffer allocation for the shared buffer model is discussed. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

4. Performance and Power Evaluation of an Intelligently Adaptive Data Cache.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Benítez, Domingo, Moure, Juan Carlos, Rexachs, Dolores Isabel, and Luque, Emilio
Abstract: We describe the analysis of an on-line pattern-recognition algorithm to dynamically control the configuration of the L1 data cache of a high-performance processor. The microarchitecture achieves higher performance and energy saving due to the accommodation of operating frequency, capacity, set-associativity, line size, hit latency, energy per access, and chip area to program workload and ILP. We show that for the operating frequency 4.5 GHz, the execution time is always reduced with an average measure of 12.1% when compared to a non-adaptive high-performance processor. Additionally, the energy saving is 2.7% on average, and t1he product time-energy is reduced on average by 14.9%. We also consider a profile-based reconfiguration of data cache, which allows picking different cache configurations but only one can be chosen for each program. Experimental results indicate that this approach yields a high percentage of the performance improvement and energy saving achieved by the on-line algorithm. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

5. WDM Multistage Interconnection Networks Architectures for Enhancing Supernetworks Switching Infrastructure.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Hamza, Haitham S., and Deogun, Jitender S.
Abstract: Multistage Interconnection Networks (MINs) provide the required switching infrastructure for many shared-memory multiprocessor systems and telecommunication networks. The concept of Supernetworks is evolving in response to emerging computation and communication intensive applications. Supernetworks exploit parallelism in both computing resources and communication infrastructures by interconnecting several computing clusters via high-bandwidth communication links. Wavelength Division Multiplexing (WDM) technology provides the communication infrastructure for Supernetowrks by dividing the bandwidth of a single fiber into numerous channels that can be used independently. In this paper, we investigate several architectures for WDM MINs that enhance the Supernetworks switching infrastructure. Our objective is to propose a new architecture and to evaluate its hardware complexity by comparing it to other WDM MINs architectures. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

6. Performance Analysis of User-Level PIM Communication in the Data IntensiVe Architecture (DIVA) System.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Mediratta, Sumit Dharampal, and Draper, Jeffrey
Abstract: The performance of user-level messaging in PIM (Processing-In-Memory) to PIM communication is modeled and analyzed for the DIVA (Data IntensiVe Architecture) system. Six benchmarks have been used for this purpose, two from each category, namely single message transfer, parallel transfer and collective communication, as described for the PMB (Pallas MPI Benchmarks). The benchmarks used are PingPong, PingPing, SendReceive, Exchange, Barrier synchronization and AllToAll personalized exchange. The main significance of this work lies in the evaluation of an implementation of system-wide support for memory-to-memory and memory-to-host communi-cation via a parcel buffer (used as a network interface). Another remarkable feature of this evaluation lies in presenting an optimal algorithm for Barrier synchronization and an optimal algorithm, with full channel utilization, for AllToAll personalized exchange for the bi-directional ring configuration of up to 8 DIVA PIMs in the memory system of a Hewlett-Packard's zx6000 server. The algorithms presented can be scaled for higher number of PIM chips with a little degradation in performance over the optimal solution. Our analysis shows that the currently employed communication mechanism can be used very efficiently for collective communication operations, and it also exposes the bottlenecks in the current design for future improvements. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

7. iSCSI Analysis System and Performance Improvement of iSCSI Sequential Access in High Latency Networks.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Yamaguchi, Saneyasu, Oguchi, Masato, and Kitsuregawa, Masaru
Abstract: IP-SAN and iSCSI are expected to remedy the problems of FC-based SAN. iSCSI has a structure of multilayer protocols. A typical configuration of the protocols to realize this system is as follows: SCSI over iSCSI over TCP/IP over Ethernet. Thus, in order to improve the performance of the system, it is necessary to precisely analyze the complicated behavior of each layer. In this paper, we present an IP-SAN analysis tool that monitors each of these layers from different viewpoints. By using this analysis tool, we experimentally demonstrate that the performance of iSCSI storage access can be significantly improved by more than 60 times. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

8. Cooperative Instruction Scheduling with Linear Scan Register Allocation.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Win, Khaing Khaing Kyi, and Wong, Weng-Fai
Abstract: Linear scan register allocation is an attractive register allocation algorithm because of its simplicity and fast running time. However, it is generally felt that linear scan register allocation yields poorer code than allocation schemes based on graph coloring. In this paper, we propose a pre-pass instruction scheduling algorithm that improves on the code quality of linear scan allocators. Our implementation in the Trimaran compiler-simulator infrastructure shows that our scheduler can reduce the number of active live ranges that the linear scan allocator has to deal with. As a result, fewer spills are needed and the quality of the generated code is improved. Furthermore, compared to the default scheduling and graph-coloring allocator schemes found in the IMPACT and Elcor components of Trimaran, our implementation with our pre-pass scheduler and linear scan register allocator significantly reduced compilation times. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

9. Integrating a New Cluster Assignment and Scheduling Algorithm into an Experimental Retargetable Code Generation Framework.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Lakshmi, K. Vasanta, Sreedhar, Deepak, Raman, Easwaran, and Shankar, Priti
Abstract: This paper presents a new unified algorithm for cluster assignment and region scheduling, and its integration into an experimental retargetable code generation framework. The components of the framework are an instruction selector generator based on a recent technique, the IMPACT front end, a machine description module which uses a modification of the HMDES machine description language to include cluster information, a combined cluster allocator and an acyclic region scheduler, and a register allocator. Experiments have been carried out on the targeting of the tool to the Texas Instruments TMS320c62x architecture. We report preliminary results on a set of TI benchmarks. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

10. A Grid Authentication System with Revocation Guarantees.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Sundaram, Babu, and Chapman, Barbara M.
Abstract: Credential revocation is a critical problem in grid environments and remains unaddressed in existing grid security solutions. We present a novel grid authentication system that solves the revocation problem. It guarantees instantaneous revocation of both long-term digital identities of hosts/users and short-lived identities of user proxies. With our approach, revocation information is guaranteed to be fresh with high time-granularity. Our system employs mediated RSA (mRSA), adapts Boneh's notion of semi-trusted mediators to suit security in virtual organizations and propagates proxy revocation information as in Micali's NOVOMODO system. Our approach's added benefits include a configuration-free security model for end-users of the grid and fine-grained management of users' delegation capabilities. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

11. Scheduling Divisible Loads with Return Messages on Heterogeneous Master-Worker Platforms.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Beaumont, Olivier, Marchal, Loris, and Robert, Yves
Abstract: In this paper, we consider the problem of scheduling divisible loads onto an heterogeneous star platform, with both heterogeneous computing and communication resources. We consider the case where the workers, after processing the tasks, send back some results to the master processor. This corresponds to a more general framework than the one used in many divisible load papers, where only forward communications are taken into account. To the best of our knowledge, this paper constitutes the first attempt to derive optimality results under this general framework (forward and backward communications, heterogeneous processing and communication resources). We prove that it is possible to derive the optimal solution both for LIFO and FIFO distribution schemes. Nevertheless, the complexity of the general problem remains open: we also show in the paper that the optimal distribution scheme may be neither LIFO nor FIFO. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

12. Snap-Stabilizing Detection of Cutsets.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Cournier, Alain, Devismes, Stéphane, and Villain, Vincent
Abstract: A snap-stabilizing protocol, starting from any configuration, always behaves according to its specification. Here, we present the first snap-stabilizing protocol for arbitrary rooted networks which detects if a set of nodes is a cutset. This protocol is based on the depth-first search (DFS) traversal and its properties. One of the most interesting properties of our protocol is that, despite the initial configuration, as soon as the protocol is initiated by the root, the result obtained from the computations will be right. So, after the first execution of the protocol, the root is able to take a decision: "the input set is a cutset or not", and this decision is right. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

13. Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors.

Author: Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Bader, David A., and Madduri, Kamesh
Abstract: Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that supports the kernels from the Scalable Synthetic Compact Applications (SSCA) benchmark suite, developed under the DARPA High Productivity Computing Systems (HPCS) program. This synthetic benchmark consists of four kernels that require irregular access to a large, directed, weighted multi-graph. We have developed a parallel implementation of this benchmark in C using the POSIX thread library for commodity symmetric multiprocessors (SMPs). In this paper, we primarily discuss the data layout choices and algorithmic design issues for each kernel, and also present execution time and benchmark validation results. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

14. Learning-TCP: A Novel Learning Automata Based Congestion Window Updating Mechanism for Ad hoc Wireless Networks.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Ramana, B. Venkata, and Murthy, C. Siva Ram
Abstract: The use of traditional TCP, in its present form, for reliable transport over Ad hoc Wireless Networks (AWNs) leads to a significant degradation in the network performance. This is primarily due to the congestion window (cwnd) updation and congestion control mechanisms employed by TCP and its inability to distinguish congestion losses from wireless losses. In order to provide an efficient reliable transport over AWNs, we propose Learning-TCP, a novel learning automata based reliable transport protocol, which efficiently adjusts the cwnd size and thus reduces the packet losses. The key idea behind Learning-TCP is that, it dynamically adapts to the changing network conditions and appropriately updates the cwnd size by observing the arrival of acknowledgment (ACK) and duplicate ACK (DUPACK) packets. Learning-TCP, unlike other existing proposals for reliable transport over AWNs, does not require any explicit feedback, such as congestion and link failure notifications, from the network. We provide extensive simulation studies of Learning-TCP under varying network conditions, that show increased throughput (9-18%) and reduced packet loss (42-55%) compared to that of TCP. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

15. A Low-Complexity Issue Queue Design with Speculative Pre-execution.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Ro, Won W., and Gaudiot, Jean-Luc
Abstract: Current superscalar architectures inherently depend on an instruction issue queue to achieve multiple instruction issue and out-of-order execution. However, the issue queue is implemented as a centralized structure and mainly causes globally broadcasting operations to wake up and select the instructions. Therefore, a large issue queue ultimately results in a low clock rate along with a high circuit complexity. This paper proposes Speculative Pre-Execution Assisted by compileR (SPEAR), a low-complexity issue queue design. SPEAR is designed to manage the small issue queue more efficiently without increasing the queue size. To this end, we have first recognized that the long memory latency is one of the factors which demand a large queue, and we aim at achieving early execution of the miss-causing load instructions using another hierarchy of an issue queue. We speculatively pre-execute those miss-causing instructions as an additional prefetching thread. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

16. Improved Point-to-Point and Collective Communication Performance with Output-Queued High-Radix Routers.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Kumar, Sameer, Stunkel, Craig, and Kalé, Laxmikant V.
Abstract: We present an output-queued switch architecture with cross-point buffering that has improved performance for both point-to-point communication and hardware accelerated collective communication. In the past, output queuing architectures have been less popular as they require more internal speedup and buffering. However, with current technology it is possible to build output-queued switches with a relatively large number of ports. We demonstrate that our output-queued architecture performs well for point-to-point messages, specially in a fat-tree topology. We also show that output-queued architectures facilitate efficient implementations of multicasts and reductions. We present performance of multicasts and reductions on individual switches and a network of switches interconnected in a fat-tree topology. We also present simulation results based on synthetic workloads that emulate a molecular dynamics application. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

17. Neural Confidence Estimation for More Accurate Value Prediction.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Black, Michael, and Franklin, Manoj
Abstract: Data dependencies between instructions have traditionally limited the ability of processors to execute instructions in parallel. Data value predictors are used to overcome these dependencies by guessing the outcomes of instructions. Because mispredictions can result in a significant performance decrease, most data value predictors include a confidence estimator that indicates whether a prediction should be used. This paper presents a global approach to confidence estimation in which the prediction accuracy of previous instructions is used to estimate the confidence of the current prediction. Perceptrons are used to identify which past instructions affect the accuracy of a prediction and to decide whether the prediction is likely to be correct. Simulation studies compare this global confidence estimator to the more conventional local confidence estimator. Results show that predictors using this global confidence estimator tend to predict significantly more instructions and incur fewer mispredictions than predictors using existing local confidence estimation approaches. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

18. The Potential of On-Chip Multiprocessing for QCD Machines.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Bilardi, Gianfranco, Pietracaprina, Andrea, Pucci, Geppino, Schifano, Fabio, and Tripiccione, Raffaele
Abstract: We explore the opportunities offered by current and forthcoming VLSI technologies to on-chip multiprocessing for Quantum Chromo Dynamics (QCD), a computational grand challenge for which over half a dozen specialized machines have been developed over the last two decades. Based on a careful study of the information exchange requirements of QCD both across the network and within the memory system, we derive the optimal partition of die area between storage and functional units. We show that a scalable chip organization holds the promise to deliver from hundreds to thousands flop per cycle as VLSI feature size scales down from 90 nm to 20 nm, over the next dozen years. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

19. Low-Power 32bit×32bit Multiplier Design with Pipelined Block-Wise Shutdown.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Yong-Ju Jang, Yoan Shin, Min-Cheol Hong, Jae-Kyung Wee, and Seongsoo Lee
Abstract: This paper proposes a novel low-power 32bit×32bit multiplier with pipelined block-wise shutdown scheme. When it idles, it turns off supply voltage to reduce both dynamic and static power. It shutdowns and wakes up sequentially along with pipeline stage to avoid power line noise. In idle mode, the proposed multiplier consumes 0.013mW and 0.006mW in 0.13μm and 0.09μm technologies, respectively, and it reduces power consumption to 0.07%~0.08% of active mode. As fabrication technology becomes small, power efficiency degrades in the conventional clock gating scheme, but the proposed multiplier does not. The low-power design methodology in this paper can be easily adopted in most functional blocks with pipeline architecture. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

20. An Efficient Distributed Algorithm for Finding Virtual Backbones in Wireless Ad-Hoc Networks.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Paul, B., Rao, S.V., and Nandi, S.
Abstract: A minimum connected dominating set is an efficient approach to form a virtual backbone for ad-hoc networks. We propose a tree based distributed time/message efficient approximation algorithm to compute a small connected dominating set without using geographic positions. The algorithm has O(n) time, O(n log n) message complexity, and has an approximation factor of eight. The algorithm is implemented using dominating set simulation program, which shows that our method gives smaller connected dominating set than the existing methods. Keywords: independent set, connected dominating set, MANET. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

21. A Novel Battery Aware MAC Protocol for Minimizing Energy × Latency in Wireless Sensor Networks.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Dhanaraj, M., Jayashree, S., and Murthy, C. Siva Ram
Abstract: Wireless Sensor Networks (WSNs) possess highly con- strained energy resources. The existing Medium Access Control (MAC) protocols for WSNs try to either minimize the energy consumption or the latency, which are conflicting objectives, or find a trade-off between them. They fail to achieve the minimum energy × latency, which ensures that transmission should occur such that both the energy consumption and latency are minimized. We propose a novel Battery-aware Energy-efficient MAC protocol to minimize the Latency (BEL-MAC) that exploits the chemical properties of the batteries of the sensor nodes, in order to increase their lifetime. Our protocol reduces the latency of the packets in an efficient manner without compromising on the lifetime of the network. We compare our protocol with the SMAC, DSMAC, TMAC, and IEEE 802.11 MAC, in terms of throughput and latency and show that our protocol outperforms these existing protocols, in terms of energy × latency. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

22. Extensible Parallel Architectural Skeletons.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Akon, Mohammad Mursalin, Singh, Ajit, Goswami, Dhrubajyoti, and Hon Fung Li
Abstract: Complexity of parallel application development has been one of the major obstacles towards the mainstream adoption of parallel programming. In order to hide some of these complexities, researchers have been actively investigating the pattern-based approaches to parallel programming. As reusable components, patterns are intended to ease the design and development phases of parallel applications. Parallel Architectural Skeleton (PAS) is one such pattern-based parallel programming model which describes the architectural aspects of parallel patterns. Like many other pattern-based parallel programming models and tools, the benefits of PAS were offset by the difficulties in extending PAS. EPAS is an extension of PAS that addresses this issue. Using EPAS, a skeleton designer can design new skeletons and add them to the skeleton repository (i.e., extensibility). EPAS also makes the PAS model more flexible by defining composition of skeletons. In this paper, we describe the model of EPAS and also discuss some of the recent usability and performance studies. The studies demonstrate that EPAS is a practical and usable parallel programming model and tool. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

23. Providing Full QoS Support in Clusters Using Only Two VCs at the Switches.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Martínez, A., Alfaro, F. J., Sánchez, J. L., and Duato, J.
Abstract: Current interconnect standards providing hardware support for quality of service (QoS) consider up to 16 virtual channels (VCs) for this purpose. However, most implementations do not offer so many VCs because they increase the complexity of the switch and the scheduling delays. In this paper, we show that this number of VCs can be significantly reduced. Some of the scheduling decisions made at network interfaces can be easily reused at switches without significantly altering the global behavior. Specifically, we show that it is enough to use two VCs for QoS purposes at each switch port, thereby simplifying the design and reducing its cost. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

24. An Algorithm for Boundary Discovery in Wireless Sensor Networks.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Deogun, Jitender S., Das, Saket, Hamza, Haitham S., and Goddard, Steve
Abstract: Wireless Sensor Networks (WSNs) consist of a large number of nodes networked via wireless links. In many WSN settings, sensor nodes are deployed in an ad hoc manner. One important issue in this context is to detect the boundary of the deployed network to ensure that the sensor nodes cover the target area. In this paper, we propose a new algorithm that can be used to discover the boundary of a randomly deployed WSN. The algorithm does not require the sensor nodes to be equipped with positioning devices and is scalable for large number of nodes. Simulation experiments are developed to evaluate the performance of the proposed algorithms for different network topologies. The simulation results show that the algorithm detects the boundary nodes of the network with high accuracy. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

25. A Novel Solution for Time Synchronization in Wireless Ad Hoc and Sensor Networks.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Sekhar, Archana, Manoj, B. S., and Murthy, C. Siva Ram
Abstract: Time synchronization is an important aspect of distributed computer systems and networks. Nodes must be synchronized to a common clock to determine slot durations for a TDMA based transmission scheme. Most efficient slot-assignment algorithms apportion the TDMA slots with the underlying assumption of a reasonably accurate global synchronization of the network. In this paper, we propose a novel synchronization protocol for ad hoc, sensor, and other dense multi-hop infrastructure-less wireless networks. The protocol performs a random leader election to achieve global network synchronization. We have analyzed the variation of synchronization time and error with different node densities and mobility speeds, by simulating the protocol. Expressions have been derived reflecting the worst case synchronization error, and the maximum synchronization time, for a network with uniform distribution of nodes. Simulation results show that out-of-band and piggybacked signaling have good accuracy of synchronization, and that a considerable bandwidth saving occurs with piggybacking on data or acknowledgment packets. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

26. On the Power Optimization and Throughput Performance of Multihop Wireless Network Architectures.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Bhaya, G., Manoj, B. S., and Murthy, C. Siva Ram
Abstract: With the emergence of powerful processors and complex applications, wireless communication devices are increasingly power hungry. While there exist several solutions to provide transmission power management in cellular wireless networks and ad hoc wireless networks, it remains an open problem in recently proposed hybrid wireless networks. The Multihop Cellular Network (MCN) and Multi Power Architecture for Cellular network (MuPAC) are instances of hybrid wireless networks, which are proposed to increase the system throughput and spectrum reuse by infusing multihop radio relaying mechanism into the infrastructure-based wireless networks. This paper proposes a novel variable power optimization scheme for the hybrid wireless network architectures such as MCN and MuPAC in order to optimize the power consumption at a mobile node without losing the throughput advantage gained by the multihop scheme. Extensive simulation results show 10% to 15% improvement in power consumption and system throughput which is significant in case of power constrained mobile nodes. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

27. Sensor Selection Heuristic in Sensor Networks.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Sadaphal, Vaishali P., and Jain, Bijendra N.
Abstract: We consider the problem of estimating the location of a moving target in a 2-D plane. In this paper, we focus attention on selecting an appropriate 3rd sensor, given two sensors, with a view to minimize the estimation error. Only the selected sensors need to measure distance to the target and communicate the same to the central "tracker". This minimizes bandwidth and energy consumed in measurement and communication while achieving near minimum estimation error. In this paper, we have proposed that the 3rd sensor be selected based on three measures viz. (a) collinearity, (b) deviation from the ideal direction in which the sensor should be selected, and (c) proximity of the sensor from the target. We assume that the measurements are subject to multiplicative error. Further, we use least square error estimation technique to estimate the target location. Simulation results show that using the proposed algorithm it is possible to achieve near minimum error in target location. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

28. A High-Speed VLSI Array Architecture for Euclidean Metric-Based Hausdorff Distance Measures Between Images.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Sudha, N., and Vivek, E. P.
Abstract: A new parallel algorithm to compute Euclidean metric-based Hausdorff distance measures between binary images (typically edge maps) is proposed in this paper. The algorithm has a running time of O(n) for images of size n × n. Further, the algorithm has the following features: (i) simple arithmetic (ii) identical computations at each pixel and (iii) computations using a small neighborhood for each pixel. An efficient cellular architecture for implementing the proposed algorithm is presented. Results of implementation using field-programmable gate arrays show that the measures can be computed for approximately 88000 image pairs of size 128×128 in a second. This result is valuable for real-time applications like object tracking and video surveillance. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

29. The Impact of Noise on the Scaling of Collectives: A Theoretical Approach.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Agarwal, Saurabh, Garg, Rahul, and Vishnoi, Nisheeth K.
Abstract: The performance of parallel applications running on large clusters is known to degrade due to the interference of kernel and daemon activities on individual nodes, often referred to as noise. In this paper, we focus on an important class of parallel applications, which repeatedly perform computation, followed by a collective operation such as a barrier. We model this theoretically and demonstrate, in a rigorous way, the effect of noise on the scalability of such applications. We study three natural and important classes of noise distributions: The exponential distribution, the heavy-tailed distribution, and the Bernoulli distribution. We show that the systems scale well in the presence of exponential noise, but the performance goes down drastically in the presence of heavy-tailed or Bernoulli noise. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

30. XCAT-C++: Design and Performance of a Distributed CCA Framework.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Govindaraju, Madhusudhan, Head, Michael R., and Chiu, Kenneth
Abstract: In this paper we describe the design and implementation of a C++ based Common Component Architecture (CCA) framework, XCAT-C++. It can efficiently marshal and unmarshal large data sets, and provides the necessary modules and hooks in the framework to meet the requirements of distributed scientific applications. XCAT-C++ uses a high-performance multi-protocol library so that the appropriate communication protocol is employed for each pair of interacting components. Scientific applications can dynamically switch to a suitable communication protocol to maximize effective throughput. XCAT-C++ component layering imposes minimal overhead and application components can achieve highly efficient throughput for large data sets commonly used in scientific computing. It has a suite of tools to aid application developers including a flexible code generation toolkit and a python scripting interface. XCAT-C++ provides the means for application developers to leverage the efficacy of the CCA component model to manage the complexity of their distributed scientific simulations. Keywords: CCA, XCAT-C++, component, performance, multi-protocol. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

31. Data and Computation Abstractions for Dynamic and Irregular Computations.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Krishnamoorthy, Sriram, Nieplocha, Jarek, and Sadayappan, P.
Abstract: Effective data distribution and parallelization of computations involving irregular data structures is a challenging task. We address the twin-problems in the context of computations involving block-sparse matrices. The programming model provides a global view of a distributed block-sparse matrix. Abstractions are provided for the user to express the parallel tasks in the computation. The tasks are mapped onto processors to ensure load balance and locality. The abstractions are based on the Aggregate Remote Memory Copy Interface, and are interoperable with the Global Arrays programming suite and MPI. Results are presented that demonstrate the utility of the approach. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

32. Using Clustering to Address Heterogeneity and Dynamism in Parallel Scientific Applications.

Author: Bader, David A., Sridhar, Varadarajan, Prasanna, Viktor K., Xiaolin Li, and Parashar, Manish
Abstract: The dynamism and space-time heterogeneity exhibited by structured adaptive mesh refinement (SAMR) applications makes their scalable parallel implementation a significant challenge. This paper investigates an adaptive hierarchical multi-partitioner (AHMP) framework that dynamically applies multiple partitioners to different regions of the domain, in a hierarchical manner, to match the local requirements of these regions. Key components of the AHMP framework include a segmentation-based clustering algorithm (SBC) for identifying regions in the domain with relatively homogeneous partitioning requirements, mechanisms for characterizing the partitioning requirements, and a runtime system for selecting, configuring and applying the most appropriate partitioner to each region. The AHMP framework has been implemented and experimentally evaluated on up to 1280 processors of the IBM SP4 cluster at San Diego Supercomputer Center. Keywords: Parallel Computing, Adaptive Mesh Refinement, Dynamic Load Balancing, Hierarchical Multi-Partitioner. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

33. Advanced Resource Management and Scheduling of Workflow Applications in JavaSymphony.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Jugravu, Alexandru, and Fahringer, Thomas
Abstract: JavaSymphony is a high-level programming model for performance oriented distributed and parallel Java applications, which allows the programmer to control parallelism, load balancing, and locality at a high level of abstraction. In this paper, we describe an extension of JavaSymphony that deals with distributed workflow applications as graphs of software components, which can be executed on a distributed set computers. Workflows are not limited to DAGs, but also cover complex control flow including loops. Furthermore, we introduce a novel approach for workflow scheduling based on the HEFT algorithm and resource brokerage for a heterogeneous set of computers. We demonstrate the effectiveness of our approach with two real-world applications and compare our techniques against the widely known DAGMan Condor scheduler. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

34. An Intra-task DVS Algorithm Exploiting Program Path Locality for Real-Time Embedded Systems.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Kumar, G. Sudha Anil, and Manimaran, G.
Abstract: In this paper, we present a novel intra-task Dynamic Voltage Scheduling (DVS) algorithm based on the knowledge of frequently executed paths in the control flow graph for real-time embedded systems. The basic idea is to construct a common path composing all the frequently executed paths (hot-paths) and perform DVS scheduling based on this common path, rather than the most probable path. We compare the performance (energy consumption) of our algorithm with a recently proposed algorithm. Our simulation results show that the proposed algorithm performs better than the existing algorithm for most of the simulated conditions. We also identify interesting research problems in this context. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

35. Distributed Point Rendering.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Rajagopalan, Ramgopal, Bhakar, Sushil, Goswami, Dhrubajyoti, and Mudur, Sudhir P.
Abstract: Traditionally graphics clusters have been employed in real-time visualization of large geometric models (many millions of 3D points). Data parallel approaches have been the obvious choices when it comes to breaking up the computations over multiple processors. In recent years, programmable graphics hardware has gained widespread acceptance. Today, every processing node in a graphics cluster has two powerful and fully programmable processors - a CPU (Central Processing Unit) and a GPU (Graphics processing unit). It enables distribution of graphics computations targeting an applications's needs in more flexible ways. In this paper we discuss and analyze our implementation of functionality distributed point-based rendering pipeline with impressive performance improvements. To the best of our knowledge, it is the first attempt to devise a functionality distribution scheme for a large data and compute-intensive application. We discuss the merits and limitations of such a distribution scheme by comparing it against traditional data parallel and single node schemes. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

36. Mobile Pipelines: Parallelizing Left-Looking Algorithms Using Navigational Programming.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Pan, Lei, Lai, Ming Kin, Dillencourt, Michael B., and Bic, Lubomir F.
Abstract: We consider the class of "left-looking" sequential matrix algorithms: consumer-driven algorithms that are characterized by "lazy" propagation of data. Left-looking algorithms are difficult to parallelize using the message-passing or distributed shared memory models because they only possess pipeline parallelism. We show that these algorithms can be directly parallelized using mobile pipelines provided by the Navigational Programming methodology. We present performance data demonstrating the effectiveness of our approach. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

37. Orthogonal Decision Trees for Resource-Constrained Physiological Data Stream Monitoring Using Mobile Devices.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Dutta, Haimonti, Kargupta, Hillol, and Joshi, Anupam
Abstract: This paper considers the problem of monitoring physiological data streams obtained from resource-constrained wearable sensing devices for pervasive health-care management. It considers Orthogonal decision trees (ODTs) that offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-tree-ensembles often created by popular techniques such as Bagging, Boosting, Random Forests and many distributed and data stream mining algorithms. ODTs are functionally orthogonal to each other and they correspond to the principal components of the underlying function space. This paper offers experimental results to document the performance of ODTs on grounds of accuracy, model complexity, and resource consumption. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

38. Offloading Bloom Filter Operations to Network Processor for Parallel Query Processing in Cluster of Workstations.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Kumar, V. Santhosh, Thazhuthaveetil, M. J., and Govindarajan, R.
Abstract: Workstation clusters have high performance interconnects with programmable network processors, which facilitate interesting opportunities to offload certain application specific computation on them and hence enhance the performance of the parallel application. Our earlier work in this direction achieves enhanced performance and balanced utilization of resources by exploiting the programmable features of the network interface in parallel database query execution. In this paper, we extend our earlier work for studying parallel query execution with Bloom filters. We propose and evaluate a scheme to offload the Bloom filter operations to the network processor. Further we explore offloading certain tuple processing activities on to the network processor by adopting a network interface attached disk scheme. The above schemes yield a speedup of up to 1.13 over the base scheme with Bloom filter where all processing is done by the host processor and achieve balanced utilization of resources. In the presence of a disk buffer cache, which reduces both the disk and I/O traffic, offloading schemes improve the speedup to 1.24. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

39. Performance Study of LU Decomposition on the Programmable GPU.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Ino, Fumihiko, Matsui, Manabu, Goda, Keigo, and Hagihara, Kenichi
Abstract: With the increasing programmability of graphics processing units (GPUs), these units are emerging as an attractive computing platform not only for traditional graphics computation but also for general-purpose computation. In this paper, to study the performance of programmable GPUs, we describe the design and implementation of LU decomposition as an example of numerical computation. To achieve this, we have developed and evaluated some methods with different implementation approaches in terms of (a) loop processing, (b) branch processing, and (c) vector processing. The experimental results give four important points: (1) dependent loops must be implemented through the use of a render texture in order to avoid copies in the video random access memory (VRAM); (2) in most cases, branch processing can be efficiently handled by the CPU rather than the GPU; (3) as Fatahalian et al. state for matrix multiplication, we find that GPUs require higher VRAM cache bandwidth in order to provide full performance for LU decomposition; and (4) decomposition results obtained by GPUs usually differ from those by CPUs, mainly due to the floating-point division error that increases the numerical error with the progress of decomposition. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

40. High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Sur, S., Bondhugula, U. K. R., Mamidala, A., Jin, H.-W., and Panda, D. K.
Abstract: The All-to-all broadcast collective operation is essential for many parallel scientific applications. This collective operation is called MPI_Allgather in the context of MPI. Contemporary MPI software stacks implement this collective on top of MPI point-to-point calls leading to several performance overheads. In this paper, we propose a design of All-to-All broadcast using the Remote Direct Memory Access (RDMA) feature offered by InfiniBand, an emerging high performance interconnect. Our RDMA based design eliminates the overheads associated with existing designs. Our results indicate that latency of the All-to-all Broadcast operation can be reduced by 30% for 32 processes and a message size of 32 KB. In addition, our design can improve the latency by a factor of 4.75 under no buffer reuse conditions for the same process count and message size. Further, our design can improve performance of a parallel matrix multiplication algorithm by 37% on eight processes, while multiplying a 256x256 matrix. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

41. Throughput Computing with Chip MultiThreading and Clusters.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Buddhikot, Mukund, and Goil, Sanjay
Abstract: Chip MultiThreading (CMT) based systems are being introduced in the market by several computer platform vendors. At the same time, cluster computing platforms are becoming prevalent in market segments which tend to be highly price/performance driven. This paper analyzes the architectural space of these two prominent computing paradigms in technical computing markets. This analysis is carried out in terms of the application turnaround time, throughput, and scalability across multiple threads of execution. Additionally, we introduce various subscription models to optimize application throughput and turnaround time. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

42. Supporting MPI-2 One Sided Communication on Multi-rail InfiniBand Clusters: Design Challenges and Performance Benefits.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Vishnu, Abhinav, Santhanaraman, Gopal, Huang, Wei, Jin, Hyun-Wook, and Panda, Dhabaleswar K.
Abstract: In cluster computing, InfiniBand has emerged as a popular high performance interconnect with MPI as the de facto programming model. However, even with InfiniBand, bandwidth can become a bottleneck for clusters executing communication intensive applications. Multi-rail cluster configurations with MPI-1 are being proposed to alleviate this problem. Recently, MPI-2 with support for one-sided communication is gaining significance. In this paper, we take the challenge of designing high performance MPI-2 one-sided communication on multi-rail InfiniBand clusters. We propose a unified MPI-2 design for different configurations of multi-rail networks (multiple ports, multiple HCAs and combinations). We present various issues associated with one-sided communication such as multiple synchronization messages, scheduling of RDMA (Read, Write) operations, ordering relaxation and discuss their implications on our design. Our performance results show that multi-rail networks can significantly improve MPI-2 one-sided communication performance. Using PCI-Express with two-ports, we can achieve a peak MPI_Put bidirectional bandwidth of 2620 Million Bytes/s, compared to 1910 MB/s for single-rail implementation. For PCI-X with two HCAs, we can almost double the throughput and reduce the latency to half for large messages. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

43. A Symmetric Localization Algorithm for MANETs Based on Collapsing Coordinate Systems.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Srinivasa, Srinath, and Patil, Sanket
Abstract: Localization in mobile ad hoc networks (MANETs) is the process of fixing the position of a node according to some real or virtual coordinate system. In many cases, solutions like Global Positioning System (GPS) are not feasible. As a result, several algorithms have been developed for localization based purely on local communication. However, many of these suffer from one of the following: flooding of the network, requirement for global knowledge, or the requirement of "beacon" nodes, which know their absolute position according to GPS. At the very least, localization algorithms require parts of the system to be either static or relatively stable. In this paper, we propose a symmetric localization algorithm that performs fairly accurate localization. No special elements like beacons and other static elements are required; however, they are not excluded. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

44. Application of Reduce Order Modeling to Time Parallelization.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Srinivasan, Ashok, Yu, Yanan, and Chandra, Namas
Abstract: We recently proposed a new approach to parallelization, by decomposing the time domain, instead of the conventional space domain. This improves latency tolerance, and we demonstrated its effectiveness in a practical application, where it scaled to much larger numbers of processors than conventional parallelization. This approach is fundamentally based on dynamically predicting the state of a system from data of related simulations. In earlier work, we used knowledge of the science of the problem to perform the prediction. In complicated simulations, this is not feasible. In this work, we show how reduced order modeling can be used for prediction, without requiring much knowledge of the science. We demonstrate its effectiveness in an important nano-materials application. The significance of this work lies in proposing a novel approach, based on established mathematical theory, that permits effective parallelization of time. This has important applications in multi-scale simulations, especially in dealing with long time-scales. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

45. PENCAPS: A Parallel Application for Electrode Encased Grounding Systems Project.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Birchal, Marco Aurélio S., Vale, Maria Helena M., and Visacro, Silvério
Abstract: The design of concrete encased electrode grounding systems by conventional computation procedures is a time-consuming task. It happens once the electromagnetic representation of the physical system requires the calculation of large full matrices. Recently, the possibility of paralleling the procedures involved in such calculations led the authors to implement a C language parallel application, based on MPI (Message Passing Interface). This article presents the engineering problem associated to this development and the fundamental aspects regarding this application, including the evaluation of its efficiency for solution of large grounding systems. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

46. Search-Optimized Suffix-Tree Storage for Biological Applications.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Bedathur, Srikanta J., and Haritsa, Jayant R.
Abstract: Suffix-trees are popular indexing structures for various sequence processing problems in biological data management. We investigate here the possibility of enhancing the search efficiency of disk-resident suffix-trees through customized layouts of tree-nodes to disk-pages. Specifically, we propose a new layout strategy, called Stellar, that provides significantly improved search performance on a representative set of real genomic sequences. Further, Stellar supports both the standard root-to-leaf lookup queries as well as sophisticated sequencesearch algorithms that exploit the suffix-links of suffix-trees. Our results are encouraging with regard to the ultimate objective of seamlessly integrating sequence processing in database engines. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

47. Cost-Optimal Job Allocation Schemes for Bandwidth-Constrained Distributed Computing Systems.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Ghosh, Preetam, Basu, Kalyan, and Das, Sajal K.
Abstract: This paper formulates the job allocation problem in distributed systems with bandwidth-constrained nodes. The bandwidth limitations of the nodes play an important role in the design of cost-optimal job allocation schemes. In this paper, we present a pricing strategy for generalized distributed systems by formulating an incomplete information bargaining game on two variables (price and percentage of bandwidth allocated for distributed computing jobs at each node). Next, we present a cost-optimal job allocation scheme for single class jobs that involve the communication delay and hence link bandwidth. We show that our algorithms are comparable to existing job allocation algorithms in minimizing the expected system response time. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

48. A Fault Recovery Scheme for P2P Metacomputers.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Power, Keith, and Morrison, John P.
Abstract: Despite the leaps and bounds made by the P2P research field in the last few years, the benefit of this innovation has been constrained to a few areas; search and file-sharing and storage to name a few. In particular, this innovation has had little significant impact in the field of distributed computing. There are several obstacles to be overcome in the development of any distributed computer, most notably: scalability, fault tolerance, security and load balancing. The difficulty of these is compounded in the dynamic, decentralized environment which characterizes the P2P arena. This paper presents a method of recovering from faults which exploits the distributed hash table functionality provided by modern overlay networks. Its effectiveness is evaluated experimentally using a proof of concept P2P distributed computer. It is hoped that by providing a solution to one of the obstacles, global, decentralized, dependable distributed computers will be one step closer to reality. Keywords: peer-to-peer, fault, tolerance, recovery, decentralized, distributed computing, condensed graphs. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

49. A Distributed Location Identification Algorithm for Ad hoc Networks Using Computational Geometric Methods.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Sinha, Koushik, and DattaChowdhury, Atish
Abstract: We present here a novel approach where we identify a region within which a node is guaranteed to be found, in contrast to the existing approaches where no such confining region for a node can be guaranteed, but only the location could be estimated either with no definitive error bound or only with some probabilistic error. The location identification algorithm presented here minimizes the size of this region, using computational geometric methods. The proposed technique iteratively improves the region of residence of all the nodes in the network through the exchange of region information among neighbors in O(nD) time, where n and D are the number of nodes and diameter of the network respectively. Simulation results also show encouraging results with this approach. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

50. Criticality Driven Energy Aware Speculation for Speculative Multithreaded Processors.

Author: Bader, David A., Parashar, Manish, Sridhar, Varadarajan, Prasanna, Viktor K., Nagpal, Rahul, and Bhowmik, Anasua
Abstract: Speculative multithreaded architecture (SpMT) philosophy relies on aggressive speculative execution for improved performance. Aggressive speculative execution results in a significant wastage of dynamic energy due to useless computation in the event of mis-speculation. As energy consumption is becoming an important constraint in microprocessor design, it is extremely important to reduce such wastage of dynamic energy in SpMT processors in order to achieve a better performance to power ratio. Dynamic instruction criticality information can be effectively applied to control aggressive speculation in SpMT processors. In this paper, we present a model of micro-execution for SpMT processors to determine dynamic instruction criticality. We also present two novel techniques utilizing criticality information, namely delaying non-critical loads and criticality based thread-prediction for reducing useless computation and energy consumption. Our experiments show 17.71% and 11.63% reduction in dynamic energy for criticality based thread prediction and criticality based delayed load scheme respectively while the corresponding improvements in dynamic energy delay products are 13.93% and 5.54%. [ABSTRACT FROM AUTHOR]
Published: 2005
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Language

Publication Type

Database

Publisher

56 results on '"Bader, David"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources