Journal: ieee transactions on parallel & distributed systems / Topic: computer architecture - Searchworks@Jio Institute Digital Library Search Results

1. Critique of “MemXCT: Memory-Centric X-Ray CT Reconstruction With Massive Parallelization” by SCC Team From the University of Texas at Austin.

Author: Davis, Brock, Paez, Juan, Gaither, Jack, and Garcia, Joe A.
Subjects: COMPUTED tomography, VIRTUAL machine systems, X-rays, GRAPHICS processing units, MICROSOFT Azure (Computing platform), COMPUTER workstation clusters
Abstract: This report describes The University of Texas Student Cluster Competition team’s effort to reproduce the results of “MemXCT: memory-centric X-ray CT reconstruction with massive parallelization” (Hidayetoğlu et al., 2019). The article details a new memory-centric approach that reconstructs X-ray computed tomography (XCT) from noisy raw data. In our reproduction experiments, we utilized Microsoft Azure’s CycleCloud tool to provision, orchestrate, and manage our computing cluster in the cloud. In particular, we scheduled and benchmarked reconstruction workloads using Azure’s CPU-based HC44rs and GPU-based NC12s v2 virtual machine (VM) types to evaluate the scalability properties of the reconstruction approach and the performance differences between architectures. The HC44rs VMs contained 44 Intel Xeon Platinum cores, while the NC12s v2 VM was equipped with two NVIDIA P100 GPUs. We used a recent version of Intel’s compiler stack with the MKL library for our CPU code along with CUDA 11.1 on GPUs. Overall, our results confirm the findings of the original article, demonstrating similar acceleration on GPUs and scalability properties on CPUs. Digital artifacts from these experiments are available at: 10.5281/zenodo.5598108 [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

2. A Survey of Desktop Grid Scheduling.

Author: Ivashko, Evgeny, Chernov, Ilya, and Nikitina, Natalia
Subjects: GRID computing, COMPUTER scheduling, PARALLEL algorithms, PEER-to-peer architecture (Computer networks), MIDDLEWARE
Abstract: The paper surveys the state of the art of task scheduling in Desktop Grid computing systems. We describe the general architecture of a Desktop Grid system and the computing model adopted by the BOINC middleware. We summarize research papers to bring together and examine the optimization criteria and methods proposed by researchers so far for improving Desktop Grid task scheduling (assigning tasks to computing nodes by a server). In addition, we review related papers, which address non-regular architectures, like hierarchical and peer-to-peer Desktop Grids, as well as Desktop Grids designed for solving interdependent tasks. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

3. A Novel Compute-Efficient Tridiagonal Solver for Many-Core Architectures.

Author: Liu, Kan and Xue, Wei
Subjects: COMPUTER architecture, GRAPHICS processing units
Abstract: The tridiagonal solver is an important kernel and is widely supported in mainstream numerical libraries. While parallel algorithms have been studied for many-core architectures, the performance of current algorithms and implementations is still hindered by input size sensitivity and cross-platform portability. In this paper, we propose a novel algorithm WM-pGE for the batched solution of diagonally dominant tridiagonal systems. The algorithm balances the key design objectives, including computation complexity, memory complexity, parallelism, and input size sensitivity, better than existing algorithms. Moreover, an elegant formulation is presented to show the implementation and cross-platform optimization without loss of efficiency and generality, by extracting the platform-dependent works into only four vector operators. The results from our batched tridiagonal experiments show that the proposed algorithm outperforms the prior work PCR-pThomas by 25% and 12% on NVIDIA Tesla V100 in single and double precision, respectively. On Intel KNL, our method achieves a 10% improvement in performance over PCR-pThomas in double precision. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

4. A Survey of Techniques for Architecting and Managing GPU Register File.

Author: Mittal, Sparsh
Subjects: COMPUTER architecture, GRAPHICS processing units, NONVOLATILE memory, EMBEDDED computer systems, REGISTERS (Computers)
Abstract: To support their massively-multithreaded architecture, GPUs use very large register file (RF) which has a capacity higher than even L1 and L2 caches. In total contrast, traditional CPUs use tiny RF and much larger caches to optimize latency. Due to these differences, along with the crucial impact of RF in determining GPU performance, novel and intelligent techniques are required for managing GPU RF. In this paper, we survey the techniques for designing and managing GPU RF. We discuss techniques related to performance, energy and reliability aspects of RF. To emphasize the similarities and differences between the techniques, we classify them along several parameters. The aim of this paper is to synthesize the state-of-art developments in RF management and also stimulate further research in this area. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

5. Survey on Real-Time Networks-on-Chip.

Author: Hesham, Salma, Rettkowski, Jens, Goehringer, Diana, and Abd El Ghany, Mohamed A.
Subjects: MULTIPROCESSORS, SYSTEMS on a chip, PARALLEL programs (Computer programs), NETWORKS on a chip, QUALITY of service
Abstract: Multi-Processor Systems-on-Chip (MPSoCs) have emerged as an evolution trend to meet the growing complexity of embedded applications with increasing computation parallelism. Particularly, real-time applications make out a significant portion of the embedded field. Networks-on-Chip (NoCs) are the backbone of communications in an MPSoC platform. However, the use of NoCs in real-time systems imposes complex constraints on the overall design. This paper discusses the challenges faced, when designing NoCs for real-time applications. Contributions in this area are surveyed on the level of guaranteed Quality-of-Service (QoS) support, adaptivity, and energy efficient techniques. Furthermore, the evaluation methodologies and experimental performance measurements of real-time NoCs are examined. This survey provides a comprehensive overview of existing endeavors in real-time NoCs and gives an insight towards future promising research points in this field. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

6. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors.

Author: Sun, Wei, Li, Ang, Geng, Tong, Stuijk, Sander, and Corporaal, Henk
Subjects: MATRIX multiplications, SPARSE matrices, APPLICATION program interfaces, GRAPHICS processing units
Abstract: Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs. Legacy wmma APIs are more easy-to-use but can only exploit limited features and power of Tensor Cores. Specifically, wmma APIs support fewer operand shapes and can not leverage the new sparse matrix multiplication feature of the newest Ampere Tensor Cores. However, the performance of current programming interface has not been well explored. Furthermore, the computation numeric behaviors of low-precision floating points (TF32, BF16, and FP16) supported by the newest Ampere Tensor Cores are also mysterious. In this paper, we explore the throughput and latency of current programming APIs. We also intuitively study the numeric behaviors of Tensor Cores MMA and profile the intermediate operations including multiplication, addition of inner product, and accumulation. All codes used in this work can be found in https://github.com/sunlex0717/DissectingTensorCores. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

7. Predicting Throughput of Distributed Stochastic Gradient Descent.

Author: Li, Zhuojin, Paolieri, Marco, Golubchik, Leana, Lin, Sung-Han, and Yan, Wumo
Subjects: OCCUPATIONAL training, MULTICASTING (Computer networks), ASYNCHRONOUS learning, EMPLOYEE training, FORECASTING, COMPUTER architecture
Abstract: Training jobs of deep neural networks (DNNs) can be accelerated through distributed variants of stochastic gradient descent (SGD), where multiple nodes process training examples and exchange updates. The total throughput of the nodes depends not only on their computing power, but also on their networking speeds and coordination mechanism (synchronous or asynchronous, centralized or decentralized), since communication bottlenecks and stragglers can result in sublinear scaling when additional nodes are provisioned. In this paper, we propose two classes of performance models to predict throughput of distributed SGD: fine-grained models, representing many elementary computation/communication operations and their dependencies; and coarse-grained models, where SGD steps at each node are represented as a sequence of high-level phases without parallelism between computation and communication. Using a PyTorch implementation, real-world DNN models and different cloud environments, our experimental evaluation illustrates that, while fine-grained models are more accurate and can be easily adapted to new variants of distributed SGD, coarse-grained models can provide similarly accurate predictions when augmented with ad hoc heuristics, and their parameters can be estimated with profiling information that is easier to collect. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

8. $TC-Stream$ T C - S t r e a m : Large-Scale Graph Triangle Counting on a Single Machine Using GPUs.

Author: Huang, Jianqiang, Wang, Haojie, Fei, Xiang, Wang, Xiaoying, and Chen, Wenguang
Subjects: GRAPH algorithms, SOLID state drives, GRAPHICS processing units, PARALLEL algorithms, TRIANGLES, ON-demand computing, COUNTING, SOCIAL network analysis
Abstract: In this paper, we build a $TC$ T C - $Stream$ S t r e a m , a high-performance graph processing system specific for a triangle counting algorithm on graph data with up to tens of billions of edges, which significantly exceeds the device memory capacity of Graphics Processing Units (GPUs). The triangle counting problem is a broad research topic in data mining and social network analysis in the graph processing field. As the scale of the graph data grows, a portion of the graph data must be loaded iteratively. In the existing literature, graphs with billions of edges need to be done distributively, which is cost-intensive. Also, many disk-based triangle counting systems are proposed for CPU architectures, but their tackling performances are inefficient. To solve the above problem, we propose $TC$ T C - $Stream$ S t r e a m , and it focuses on three issues: 1) For power-law graphs, because the amount of tasks of each vertex or edge is inconsistent, it is bound to cause different demands of computing and memory resources for different task types. We propose a parallel vertex approach and the reordering of vertices for graph data that can be placed in the GPU device memory to ensure the maximum workload balancing; 2) A binary-search-based set intersection method is designed to achieve the maximum parallelism in GPU; 3) For the graph data that exceeds the GPU device memory capacity, we develop a novel vertical partition algorithm to guarantee the independent computing on each partition so that the three computation processes, i.e., the computation on GPU, the data transmission between main memory of CPU and SSD, and the communication between the CPU and the GPU can be perfectly overlapped. Moreover, the $TC$ T C - $Stream$ S t r e a m optimizes edge-iterator models and benefits from multi-thread parallelism. Extensive experiments conducted on large-scale datasets showed that the $TC$ T C - $stream$ s t r e a m running on a single Tesla V100 GPU performs $2.4-6\times$ 2. 4 - 6 × and $1.8-4.4\times$ 1. 8 - 4. 4 × faster than the state-of-the-art single-machine in-memory triangle counting system and GPU-based triangle counting system, respectively, and achieves $2.4\times$ 2. 4 × faster than the state-of-the-art out-of-core distributed system PDTL running on an 8-node cluster when processing the graph data with 42.5 billion edges, which demonstrates the high performance and cost-effectiveness of the $TC$ T C - $Stream$ S t r e a m . [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

9. Guest Editor's Introduction: Special Section on Power-Aware Parallel and Distributed Computing (PAPADS).

Author: Ahmad, Ishfaq, Cameron, Kirk W., and Melhem, Rami
Subjects: ENERGY consumption, COMPUTER architecture
Abstract: The article discusses various topics published within the issue including one on design power-efficient architectures, one on prioritizing power saving among various computer components, and one on developing means of saving system-wide energy.
Published: 2008
Full Text: View/download PDF

10. Solving Computation Slicing Using Predicate Detection.

Author: Mittal, Neeraj, Sen, Alper, and Garg, Vijay K.
Subjects: COMPUTER software testing, DEBUGGING, ELECTRONIC data processing, COMPUTER programming, COMPUTER architecture, COMPUTER algorithms, COMPUTER interfaces, COMPUTER engineering, COMPUTER science
Abstract: Given a distributed computation and a global predicate, predicate detection involves determining whether there exists at least one consistent cut (or global state) of the computation that satisfies the predicate. On the other hand, computation slicing is concerned with computing the smallest subcomputation (with the least number of consistent cuts) that contains all consistent cuts of the computation satisfying the predicate. In this paper, we investigate the relationship between predicate detection and computation slicing and show that the two problems are actually equivalent. Specifically, given an algorithm to detect a predicate b in a computation C, we derive an algorithm to compute the slice of C with respect to b. The time complexity of the (derived) slicing algorithm is O(n∣E∣T), where n is the number of processes, E is the set of events, and O(T) is the time complexity of the detection algorithm. We discuss how the "equivalence" result of this paper can be utilized to derive a faster algorithm for solving the genera/predicate detection problem in many cases. Slicing algorithms described in our earlier papers are all offline in nature. In this paper, we also present two online algorithms for computing the slice. The first algorithm can be used to compute the slice for a general predicate. Its amortized time complexity is O(n(c+n)T) per event, where c is the average concurrency in the computation and O(T) is the time complexity of the detection algorithm. The second algorithm can be used to compute the slice for a regular predicate. Its amortized time complexity is only O(n2) per event. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

11. Symmetric Indefinite Linear Solver Using OpenMP Task on Multicore Architectures.

Author: Yamazaki, Ichitaro, Kurzak, Jakub, Wu, Panruo, Zounon, Mawussi, and Dongarra, Jack
Subjects: MULTIPROCESSORS, LINEAR algebra, PERFORMANCE of distributed shared memory, LINEAR systems, KERNEL operating systems
Abstract: Recently, the Open Multi-Processing (OpenMP) standard has incorporated task-based programming, where a function call with input and output data is treated as a task. At run time, OpenMP’s superscalar scheduler tracks the data dependencies among the tasks and executes the tasks as their dependencies are resolved. On a shared-memory architecture with multiple cores, the independent tasks are executed on different cores in parallel, thereby enabling parallel execution of a seemingly sequential code. With the emergence of many-core architectures, this type of programming paradigm is gaining attention—not only because of its simplicity, but also because it breaks the artificial synchronization points of the program and improves its thread-level parallelization. In this paper, we use these new OpenMP features to develop a portable high-performance implementation of a dense symmetric indefinite linear solver. Obtaining high performance from this kind of solver is a challenge because the symmetric pivoting, which is required to maintain numerical stability, leads to data dependencies that prevent us from using some common performance-improving techniques. To fully utilize a large number of cores through tasking, while conforming to the OpenMP standard, we describe several techniques. Our performance results on current many-core architectures—including Intel’s Broadwell, Intel’s Knights Landing, IBM’s Power8, and Arm’s ARMv8—demonstrate the portable and superior performance of our implementation compared with the Linear Algebra PACKage (LAPACK). The resulting solver is now available as a part of the PLASMA software package. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

12. Exploring Data Analytics Without Decompression on Embedded GPU Systems.

Author: Pan, Zaifeng, Zhang, Feng, Zhou, Yanliang, Zhai, Jidong, Shen, Xipeng, Mutlu, Onur, and Du, Xiaoyong
Subjects: GRAPHICS processing units, COMPUTER architecture, ENERGY consumption, RANDOM access memory
Abstract: With the development of computer architecture, even for embedded systems, GPU devices can be integrated, providing outstanding performance and energy efficiency to meet the requirements of different industries, applications, and deployment environments. Data analytics is an important application scenario for embedded systems. Unfortunately, due to the limitation of the capacity of the embedded device, the scale of problems handled by the embedded system is limited. In this paper, we propose a novel data analytics method, called G-TADOC, for efficient text analytics directly on compression on embedded GPU systems. A large amount of data can be compressed and stored in embedded systems, and can be processed directly in the compressed state, which greatly enhances the processing capabilities of the systems. Particularly, G-TADOC has three innovations. First, a novel fine-grained thread-level workload scheduling strategy for GPU threads has been developed, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, a GPU thread-safe memory pool has been developed to handle inconsistency with low synchronization overheads. Third, a sequence-support strategy is provided to maintain high GPU parallelism while ensuring sequence information for lossless compression. Moreover, G-TADOC involves special optimizations for embedded GPUs, such as utilizing the CPU-GPU shared unified memory. Experiments show that G-TADOC provides 13.2× average speedup compared to the state-of-the-art TADOC. G-TADOC also improves performance-per-cost by 2.6× and energy efficiency by 32.5× over TADOC. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

13. BFS-4K: An Efficient Implementation of BFS for Kepler GPU Architectures.

Author: Busato, Federico and Bombieri, Nicola
Subjects: SEARCH algorithms, GRAPHICS processing units, COMPUTER architecture, GRAPH theory, PERFORMANCE evaluation, PARALLEL processing, EDGE detection (Image processing)
Abstract: Breadth-first search (BFS) is one of the most common graph traversal algorithms and the building block for a wide range of graph applications. With the advent of graphics processing units (GPUs), several works have been proposed to accelerate graph algorithms and, in particular, BFS on such many-core architectures. Nevertheless, BFS has proven to be an algorithm for which it is hard to obtain better performance from parallelization. Indeed, the proposed solutions take advantage of the massively parallelism of GPUs but they are often asymptotically less efficient than the fastest CPU implementations. This paper presents BFS-4K, a parallel implementation of BFS for GPUs that exploits the more advanced features of GPU-based platforms (i.e., NVIDIA Kepler) and that achieves an asymptotically optimal work complexity. The paper presents different strategies implemented in BFS-4K to deal with the potential workload imbalance and thread divergence caused by any actual graph non-homogeneity. The paper presents the experimental results conducted on several graphs of different size and characteristics to understand how the proposed techniques are applied and combined to obtain the best performance from the parallel BFS visits. Finally, an analysis of the most representative BFS implementations for GPUs at the state of the art and their comparison with BFS-4K are reported to underline the efficiency of the proposed solution. [ABSTRACT FROM PUBLISHER]
Published: 2015
Full Text: View/download PDF

14. Exploring New Opportunities to Defeat Low-Rate DDoS Attack in Container-Based Cloud Environment.

Author: Li, Zhi, Jin, Hai, Zou, Deqing, and Yuan, Bin
Subjects: DENIAL of service attacks, CLOUDS & the environment, QUEUING theory, QUALITY of service, COMPUTER crimes
Abstract: DDoS attacks are rampant in cloud environments and continually evolve into more sophisticated and intelligent modalities, such as low-rate DDoS attacks. But meanwhile, the cloud environment is also developing in constant. Now container technology and microservice architecture are widely applied in cloud environment and compose container-based cloud environment. Comparing with traditional cloud environments, the container-based cloud environment is more lightweight in virtualization and more flexible in scaling service. Naturally, a question that arises is whether these new features of container-based cloud environment will bring new possibilities to defeat DDoS attacks. In this paper, we establish a mathematical model based on queueing theory to analyze the strengths and weaknesses of the container-based cloud environment in defeating low-rate DDoS attack. Based on this, we propose a dynamic DDoS mitigation strategy, which can dynamically regulate the number of container instances serving for different users and coordinate the resource allocation for these instances to maximize the quality of service. And extensive simulations and testbed-based experiments demonstrate our strategy can make the limited system resources be utilized sufficiently to maintain the quality of service acceptable and defeat DDoS attack effectively in the container-based cloud environment. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

15. Optimizing Finite Volume Method Solvers on Nvidia GPUs.

Author: Xu, Jingheng, Yang, Guangwen, Fu, Haohuan, Luk, Wayne, Gan, Lin, Shi, Wen, Xue, Wei, Yang, Chao, Jiang, Yong, and He, Conghui
Subjects: FINITE volume method, CACHE memory, GRAPHICS processing units
Abstract: As scientific applications are increasingly ported to GPUs to benefit from both the powerful computing capacity and high throughput, accelerating explicit solvers for GPU-based finite volume methods is gaining more and more attention. In this paper, based on the detailed analysis of the FVM algorithm, we present a set of novel optimization methods, including the explicit data cache mechanism, optimal global memory loading strategy, as well as the inner-thread rescheduling method, which derives a suitable mapping from the solver algorithm to the underlying GPU hardware architecture, so as to remarkably improve the solving performance of structured mesh based FVM. We demonstrate the impact of our tuning techniques on two widely-used atmospheric dynamic kernels (3-D Euler and 2-D SWE) on five kinds of mainstream GPU platforms, and make a detailed analysis of the different tuning methodologies so as to demonstrate how to select the proper tuning strategy to different applications on various GPU platforms. Specifically, 93.9x speedup is achieved for the 3D Euler solver on Nvidia V100 over one 12-core Intel E5-2697 (v2) CPU, which is a 77 percent improvement compared with the original speedup without adopting the tuning techniques presented in this work. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

16. GPU Implementation of Bitplane Coding with Parallel Coefficient Processing for High Performance Image Compression.

Author: Enfedaque, Pablo, Auli-Llinas, Francesc, and Moure, Juan Carlos
Subjects: GRAPHICS processing units, CODING theory, IMAGE compression, COMPUTER algorithms, PARALLEL processing
Abstract: The fast compression of images is a requisite in many applications like TV production, teleconferencing, or digital cinema. Many of the algorithms employed in current image compression standards are inherently sequential. High performance implementations of such algorithms often require specialized hardware like field integrated gate arrays. Graphics Processing Units (GPUs) do not commonly achieve high performance on these algorithms because they do not exhibit fine-grain parallelism. Our previous work introduced a new core algorithm for wavelet-based image coding systems. It is tailored for massive parallel architectures. It is called bitplane coding with parallel coefficient processing (BPC-PaCo). This paper introduces the first high performance, GPU-based implementation of BPC-PaCo. A detailed analysis of the algorithm aids its implementation in the GPU. The main insights behind the proposed codec are an efficient thread-to-data mapping, a smart memory management, and the use of efficient cooperation mechanisms to enable inter-thread communication. Experimental results indicate that the proposed implementation matches the requirements for high resolution (4 K) digital cinema in real time, yielding speedups of 30 $\times$ with respect to the fastest implementations of current compression standards. Also, a power consumption evaluation shows that our implementation consumes 40 $\times$ less energy for equivalent performance than state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

17. Task Scheduling Techniques for Asymmetric Multi-Core Systems.

Author: Chronaki, Kallia, Rico, Alejandro, Casas, Marc, Moreto, Miquel, Badia, Rosa M., Ayguade, Eduard, Labarta, Jesus, and Valero, Mateo
Subjects: MULTICORE processors, PRODUCTION scheduling, HIGH performance computing, PARALLEL programming, COMPUTER architecture
Abstract: As performance and energy efficiency have become the main challenges for next-generation high-performance computing, asymmetric multi-core architectures can provide solutions to tackle these issues. Parallel programming models need to be able to suit the needs of such systems and keep on increasing the application’s portability and efficiency. This paper proposes two task scheduling approaches that target asymmetric systems. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application, or by finding the earliest executor of a task. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with two existing state-of the art heterogeneous schedulers and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45$\times$ in a real 8-core asymmetric system and up to 2.1$\times$ in a simulated 32-core asymmetric chip. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

18. Trajectory Pattern Mining for Urban Computing in the Cloud.

Author: Altomare, Albino, Cesario, Eugenio, Comito, Carmela, Marozzo, Fabrizio, and Talia, Domenico
Subjects: CLOUD computing, DATA mining, COMPUTER architecture, DETECTORS, RADIO frequency identification systems
Abstract: The increasing pervasiveness of mobile devices along with the use of technologies like GPS, Wifi networks, RFID, and sensors, allows for the collections of large amounts of movement data. This amount of data can be analyzed to extract descriptive and predictive models that can be properly exploited to improve urban life. From a technological viewpoint, Cloud computing can play an essential role by helping city administrators to quickly acquire new capabilities and reducing initial capital costs by means of a comprehensive pay-as-you-go solution. This paper presents a workflow-based parallel approach for discovering patterns and rules from trajectory data, in a Cloud-based framework. Experimental evaluation has been carried out on both real-world and synthetic trajectory data, up to one million of trajectories. The results show that, due to the high complexity and large volumes of data involved in the application scenario, the trajectory pattern mining process takes advantage from the scalable execution environment offered by a Cloud architecture in terms of both execution time, speed-up and scale-up. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

19. Repurposing GPU Microarchitectures with Light-Weight Out-Of-Order Execution.

Author: Iliakis, Konstantinos, Xydis, Sotirios, and Soudris, Dimitrios
Subjects: GRAPHICS processing units, PARALLEL processing, VERNACULAR architecture, COMPUTER architecture
Abstract: GPU is the dominant platform for accelerating general-purpose workloads due to its computing capacity and cost-efficiency. GPU applications cover an ever-growing range of domains. To achieve high throughput, GPUs rely on massive multi-threading and fast context switching to overlap computations with memory operations. We observe that among the diverse GPU workloads, there exists a significant class of kernels that fail to maintain a sufficient number of active warps to hide the latency of memory operations, and thus suffer from frequent stalling. We argue that the dominant Thread-Level Parallelism model is not enough to efficiently accommodate the variability of modern GPU applications. To address this inherent inefficiency, we propose a novel micro-architecture with lightweight Out-Of-Order execution capability enabling Instruction-Level Parallelism to complement the conventional Thread-Level Parallelism model. To minimize the hardware overhead, we carefully design our extension to highly re-use the existing micro-architectural structures and study various design trade-offs to contain the overall area and power overhead, while providing improved performance. We show that the proposed architecture outperforms traditional platforms by 23 percent on average for low-occupancy kernels, with an area and power overhead of 1.29 and 10.05 percent, respectively. Finally, we establish the potential of our proposal as a micro-architecture alternative by providing 16 percent speedup over a wide collection of 60 general-purpose kernels. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

20. Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer.

Author: Chen, Yuedan, Li, Kenli, Yang, Wangdong, Xiao, Guoqing, Xie, Xianghui, and Li, Tao
Subjects: SPARSE matrices, COMPUTER architecture, PARALLEL processing, SUPERCOMPUTERS, KERNEL (Mathematics)
Abstract: General sparse matrix-sparse matrix multiplication (SpGEMM) is one of the fundamental linear operations in a wide variety of scientific applications. To implement efficient SpGEMM for many large-scale applications, this paper proposes scalable and optimized SpGEMM kernels based on COO, CSR, ELL, and CSC formats on the Sunway TaihuLight supercomputer. First, a multi-level parallelism design for SpGEMM is proposed to exploit the parallelism of over 10 millions cores and better control memory based on the special Sunway architecture. Optimization strategies, such as load balance, coalesced DMA transmission, data reuse, vectorized computation, and parallel pipeline processing, are applied to further optimize performance of SpGEMM kernels. Second, we thoroughly analyze the performance of the proposed kernels. Third, a performance-aware model for SpGEMM is proposed to select the most appropriate compressed storage formats for the sparse matrices that can achieve the optimal performance of SpGEMM on the Sunway. The experimental results show the SpGEMM kernels have good scalability and meet the challenge of the high-speed computing of large-scale data sets on the Sunway. In addition, the performance-aware model for SpGEMM achieves an absolute value of relative error rate of 8.31 percent on average when the kernels are executed in one single process and achieves 8.59 percent on average when the kernels are executed in multiple processes. It is proved that the proposed performance-aware model can perform at high accuracy and satisfies the precision of selecting the best formats for SpGEMM on the Sunway TaihuLight supercomputer. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

21. Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory.

Author: Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Leibo, Zhao, Jishen, Xu, Cong, Li, Shuangcheng, Xie, Yuan, and Wei, Shaojun
Subjects: DEEP learning, NEURAL circuitry, ARTIFICIAL intelligence, VIDEO surveillance, MACHINE learning
Abstract: Recent advances in deep learning (DL) have stimulated increasing interests in neural networks (NN). From the perspective of operation type and network architecture, deep neural networks can be categorized into full convolution-based neural network (ConvNet), recurrent neural network (RNN), and fully-connected neural network (FCNet). Different types of neural networks are usually cascaded and combined as a hybrid neural network (Hybrid-NN) to complete real-life cognitive tasks. Such hybrid-NN implementation is memory-intensive with large number of memory accesses, hence the performance of hybrid-NN is often limited by the insufficient memory bandwidth. A “3D + 2.5D” integration system, which integrates a high-bandwidth 3D stacked DRAM side-by-side with a highly-parallel neural processing unit (NPU) on a silicon interposer, overcomes the bandwidth bottleneck in hybrid-NN acceleration. However, intensive concurrent 3D DRAM accesses produced by the NPU lead to a serious thermal problem in 3D DRAM. In this paper, we propose a neural processor called Parana for hybrid-NN acceleration in consideration of thermal problem of 3D DRAM. Parana solves the thermal problem of 3D memory by optimizing both the total number of memory accesses and memory accessing behaviors. For memory accessing behaviors, Parana balances the memory bandwidth by spatial division mapping hybrid-NN onto computing resources, which efficiently avoids that masses of memory accesses are issued in a short time period. To reduce the total number of memory accesses, we design a new NPU architecture and propose a memory-oriented tiling and scheduling mechanism to exploit the maximum utilization of on-chip buffer. Experimental results show that Parana reduces the peak temperature by up to 54.72 $^\circ$ C and the steady temperature by up to 32.27 $^\circ$ C over state-of-the-art accelerators with 3D memory without performance degradation. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

22. A Self-Adaptive Network for HPC Clouds: Architecture, Framework, and Implementation.

Author: Zahid, Feroz, Taherkordi, Amir, Gran, Ernst Gunnar, Skeie, Tor, and Johnsen, Bjorn Dag
Subjects: CLOUD computing, HIGH performance computing, ADAPTIVE control systems, COMPUTER architecture, ROUTING (Computer network management)
Abstract: Clouds offer flexible and economically attractive compute and storage solutions for enterprises. However, the effectiveness of cloud computing for high-performance computing (HPC) systems still remains questionable. When clouds are deployed on lossless interconnection networks, like InfiniBand (IB), challenges related to load-balancing, low-overhead virtualization, and performance isolation hinder full potential utilization of the underlying interconnect. Moreover, cloud data centers incorporate a highly dynamic environment rendering static network reconfigurations, typically used in IB systems, infeasible. In this paper, we present a framework for a self-adaptive network architecture for HPC clouds based on lossless interconnection networks, demonstrated by means of our implemented IB prototype. Our solution, based on a feedback control and optimization loop, enables the lossless HPC network to dynamically adapt to the varying traffic patterns, current resource availability, workload distributions, and also in accordance with the service provider-defined policies. Furthermore, we present IBAdapt, a simplified ruled-based language for the service providers to specify adaptation strategies used by the framework. Our developed self-adaptive IB network prototype is demonstrated using state-of-the-art industry software. The results obtained on a test cluster demonstrate the feasibility and effectiveness of the framework when it comes to improving Quality-of-Service compliance in HPC clouds. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

23. Power/Performance/Thermal Design-Space Exploration for Multicore Architectures.

Author: Monchiero, Matteo, Canal, Ramon, and Gonzalez, Antonio
Subjects: COMPUTER architecture, MICROPROCESSOR design & construction, MEMORY hierarchy (Computer science), INTEGRATED circuit interconnections, DISTRIBUTED shared memory, COMPUTATIONAL complexity, DISTRIBUTED computing
Abstract: Multicore architectures have been ruling the recent microprocessor design trend. This is due to different reasons: better performance, thread-level parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one), and the ease and reuse of design. This paper presents a thorough evaluation of multicore architectures. The architecture that we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1, shared/private L2, and a shared bus interconnect. We consider a benchmark set composed of several parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size, and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption, and temperature. Design trade-offs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed, showing the importance of considering thermal effects at the architectural level to achieve the best design choice. Multicore architectures are ruling the recent microprocessor design trend. This is due to different reasons: better performance, thread-level parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one); and, ease and reuse of design. This paper presents a thorough evaluation of multicore architectures. The architecture we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1, shared/private L2, and a shared bus interconnect. We consider a benchmark set composed of several parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption and temperature. Design tradeoffs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed, showing the importance of considering thermal effects at the architectural level to achieve the best design choice. [ABSTRACT FROM AUTHOR]
Published: 2008
Full Text: View/download PDF

24. Towards Exploring Data-Intensive Scientific Applications at Extreme Scales through Systems and Simulations.

Author: Zhao, Dongfang, Liu, Ning, Kimpe, Dries, Ross, Robert, Sun, Xian-He, and Raicu, Ioan
Subjects: INFORMATION storage & retrieval systems, HIGH performance computing research, SUPERCOMPUTERS, CYBERINFRASTRUCTURE, ELECTRONIC file management
Abstract: The state-of-the-art storage architecture of high-performance computing systems was designed decades ago, and with today's scale and level of concurrency, it is showing significant limitations. Our recent work proposed a new architecture to address the I/O bottleneck of the conventional wisdom, and the system prototype (FusionFS) demonstrated its effectiveness on up to 16 K nodes—the scale on par with today's largest supercomputers. The main objective of this paper is to investigate FusionFS's scalability towards exascale. Exascale computers are predicted to emerge by 2018, comprising millions of cores and billions of threads. We built an event-driven simulator (FusionSim) according to the FusionFS architecture, and validated it with FusionFS's traces. FusionSim introduced less than 4 percent error between its simulation results and FusionFS traces. With FusionSim we simulated workloads on up to two million nodes and find out almost linear scalability of I/O performance; results justified FusionFS's viability for exascale systems. In addition to the simulation work, this paper extends the FusionFS system prototype in the following perspectives: (1) the fault tolerance of file metadata is supported, (2) the limitations of the current system design is discussed, and (3) a more thorough performance evaluation is conducted, such as N-to-1 metadata write, system efficiency, and more platforms such as Amazon Cloud. [ABSTRACT FROM PUBLISHER]
Published: 2016
Full Text: View/download PDF

25. An Efficient Privacy-Preserving Ranked Keyword Search Method.

Author: Chen, Chi, Zhu, Xiaojie, Shen, Peisong, Hu, Jiankun, Guo, Song, Tari, Zahir, and Zomaya, Albert Y.
Subjects: CLOUD computing, CIPHERS, ELECTRONIC information resource searching, INTERNET searching, HIERARCHICAL clustering (Cluster analysis)
Abstract: Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it is essential to develop efficient and reliable ciphertext search techniques. One challenge is that the relationship between documents will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design ciphertext search schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast ciphertext search within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents. [ABSTRACT FROM PUBLISHER]
Published: 2016
Full Text: View/download PDF

26. Reproducibility: Performance Evaluation of MemXCT on Azure CycleCloud Platform.

Author: Liu, Yuchen, Meng, Yixuan, Xu, Kaiyuan, Xu, Zijun, Wu, Tianyuan, Yang, Yiwei, and Yin, Shu
Subjects: IMAGE reconstruction algorithms, GRAPHICS processing units, COMPUTER architecture
Abstract: Memory-Centric X-ray Computational Tomography(CT) is an iterative reconstruction technique that trades compute simplifications with higher memory accesses. MemXCT implements a sparse matrix-vector multiplication(SpMV) with multi-stage buffering and two-level pseudo-Hilbert ordering for optimization. Motivated by the need to validate conclusions from previous work, we reproduce the numerical results, the algorithm’s performance, and the scaling behavior of the algorithms as the number of MPI processes increases on Azure. Digital artifacts from these experiments are available at: 10.5281/zenodo.5598108 [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

27. GPU Acceleration for Simulating Massively Parallel Many-Core Platforms.

Author: Raghav, Shivani, Ruggiero, Martino, Marongiu, Andrea, Pinto, Christian, Atienza, David, and Benini, Luca
Subjects: GRAPHICS processing units, PARALLEL programs (Computer programs), COMPUTER architecture, NETWORKS on a chip, DISTRIBUTED shared memory
Abstract: Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation technologies are exceedingly slow and the need to model full system many-core architectures adds further to the complexity issues. This paper presents a fast, scalable and parallel simulator, which uses a novel methodology to accelerate the simulation of a many-core coprocessor using GPU platforms. The main idea is to use. The target architecture of the associated . Simulation of many target nodes is mapped to the many hardware-threads available on highly parallel GPU platforms. This paper presents a novel methodology to accelerate the simulation of many-core coprocessors using GPU platforms. We demonstrate the challenges, feasibility and benefits of our idea to use heterogeneous system (CPU and GPU) to simulate future architecture of many-core heterogeneous platforms. The target architecture selected to evaluate our methodology consists of an ARM general purpose CPU coupled withmany-core coprocessor with thousands of simple in-order cores connected in a tile network. This work presents optimization techniques used to parallelize the simulation specifically for acceleration on GPUs. We partition the full system simulation between CPU and GPU, where the target general purpose CPU is simulated on the host CPU, whereas the many-core coprocessor is simulated on the NVIDIA Tesla 2070 GPU platform. Our experiments show performance of up to 50 MIPS when simulating the entire heterogeneous chip, and high scalability with increasing cores on coprocessor. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

28. On All-to-All Broadcast in Dense Gaussian Network On-Chip.

Author: Touzene, Abderezak
Subjects: NETWORKS on a chip, COMPUTER architecture, DELAY lines, ELECTRIC network topology, INTEGRATED circuit interconnections, GAUSSIAN processes
Abstract: Gaussian networks are gaining popularity as good candidates Network On-Chip (NoC) for interconnecting Multiprocessor System-on-Chips (MPSoCs). They showed better topological properties compared to the 2D torus networks with the same number of nodes $N$ and the same degree 4. All-to-all broadcast is a collective communication algorithm used frequently in many parallel applications. Recently, Z. Zhang et al. [1] have proposed an all-to-all broadcast algorithm for Gaussian on-chip networks that achieves the minimum delay time but requires 4$k$ extra buffers per router, where $k$ is the network diameter. In this paper, we propose a new all-to-all broadcast algorithm for dense Gaussian on-chip networks that achieves the minimum delay time without requiring any extra buffers per router. In this paper, we propose a new all-to-all broadcast algorithm for dense Gaussian on-chip networks that achieves the minimum delay time without requiring any extra buffers per router. Along with low latency, reducing the amount of buffer space and power consumption are very important issues in NoCs architectures. [ABSTRACT FROM PUBLISHER]
Published: 2015
Full Text: View/download PDF

29. Power-Aware Job Scheduling on Heterogeneous Multicore Architectures.

Author: Chiesi, Matteo, Vanzolini, Luca, Mucci, Claudio, Franchi Scarselli, Eleonora, and Guerrieri, Roberto
Subjects: POWER aware computing, COMPUTER scheduling, COMPUTER algorithms, WORKLOAD of computer networks, CENTRAL processing units, GRAPHICS processing units, COMPUTER architecture, COST effectiveness
Abstract: This paper presents a power-aware scheduling algorithm based on efficient distribution of the computing workload to the resources on heterogeneous CPU-GPU architectures. The scheduler manages the resources of several computing nodes with a view to reducing the peak power. The algorithm can be used in concert with adjustable power state software services in order to further reduce the computing cost during high demand periods. Although our study relies on GPU workloads, the approach can be extended to other heterogeneous computer architectures. The algorithm has been implemented in a real CPU-GPU heterogeneous system. Experiments prove that the approach presented reduces peak power by 10 percent compared to a system without any power-aware policy and by up to 24 percent with respect to the worst case scenario with an execution time increase in the range of 2 percent. This leads to a reduction in the system and service costs. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

30. Distributed Randomized $k$ <alternatives><inline-graphic xlink:href="pratap-ieq1-2800050.gif"/></alternatives>-Clustering Based PCID Assignment for Ultra-Dense Femtocellular Networks.

Author: Pratap, Ajay, Singhal, Rishabh, Misra, Rajiv, and Das, Sajal K.
Subjects: FEMTOCELLS, CELL phone systems, DISTRIBUTED computing, ELECTRONIC data processing, DATA transmission systems
Abstract: Next-generation wireless networks are going to have highly dense, small cell structure with a large number of femtocells. The dense deployment of the femtocell network architecture is expected to meet the growing data demand by leveraging millimeter-wave structure of 5G wireless networks. However, arbitrary deployment of large number of femtocells underlying a macrocell will pose a challenge for collision and confusion-free Physical Cell ID (PCID) assignments as the total number of available PCIDs is limited to 504. In this paper we propose a distributed, randomized $k$ -clustering algorithm for collision and confusion-free PCID assignment problem, which is known to be NP-complete. To reduce the total control message flow, we create overlapping clusters in ultra-dense femtocellular networks, where each cluster head runs the distributed randomized PCID allocation algorithm and locally monitors the conflicts to avoid the collision and confusion constraints. We prove the correctness of our proposed algorithm and analyze its time and message complexity. Through simulation experiments, we also show the effect of different parameters on the PCID allocation objectives. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

31. A Hardware Architecture for Radial Basis Function Neural Network Classifier.

Author: Mohammadi, Mahnaz, Krishna, Akhil, S, Nalesh, and Nandy, S. K.
Subjects: HARDWARE, RADIAL basis functions, ARTIFICIAL neural networks, BIG data, PARALLEL algorithms
Abstract: In this paper we present design and analysis of scalable hardware architectures for training learning parameters of RBFNN to classify large data sets. We design scalable hardware architectures for K-means clustering algorithm to training the position of hidden nodes at hidden layer of RBFNN and pseudoinverse algorithm for weight adjustments at output layer. These scalable parallel pipelined architectures are capable of implementing data sets with no restriction on their dimensions. This paper also presents a flexible and scalable hardware accelerator for realization of classification using RBFNN, which puts no limitation on the dimension of the input data is developed. We report FPGA synthesis results of our implementations. We compare results of our hardware accelerator with CPU, GPU and implementations of the same algorithms and with other existing algorithms. Analysis of these results show that scalability of our hardware architecture makes it favorable solution for classification of very large data sets. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

32. Architectural Synthesis of Multi-SIMD Dataflow Accelerators for FPGA.

Author: Wu, Yun and McAllister, John
Subjects: FIELD programmable gate arrays, HIGH performance processors, COMPUTER software, DATA, COMPUTER architecture
Abstract: Field Programmable Gate Array (FPGA) boast abundant resources with which to realise high-performance accelerators for computationally demanding operations. Highly efficient accelerators may be automatically derived from Signal Flow Graph (SFG) models by using architectural synthesis techniques, but in practical design scenarios, these currently operate under two important limitations - they cannot efficiently harness the programmable datapath components which make up an increasing proportion of the computational capacity of modern FPGA and they are unable to automatically derive accelerators to meet a prescribed throughput or latency requirement. This paper addresses these limitations. SFG synthesis is enabled which derives software-programmable multicore single-instruction, multiple-data (SIMD) accelerators which, via combined offline characterisation of multicore performance and compile-time program analysis, meet prescribed throughput requirements. The effectiveness of these techniques is demonstrated on tree-search and linear algebraic accelerators for 802.11n WiFi transceivers, an application for which satisfying real-time performance requirements has, to this point, proven challenging for even manually-derived architectures. [ABSTRACT FROM PUBLISHER]
Published: 2018
Full Text: View/download PDF

33. A GPU-Architecture Optimized Hierarchical Decomposition Algorithm for Support Vector Machine Training.

Author: Vanek, Jan, Michalek, Josef, and Psutka, Josef
Subjects: SUPPORT vector machines, GRAPHICS processing units, CUDA (Computer architecture), PROGRAM transformation, MOTHERBOARDS
Abstract: In the last decade, several GPU implementations of Support Vector Machine (SVM) training with nonlinear kernels were published. Some of them even with source codes. The most effective ones are based on Sequential Minimal Optimization (SMO). They decompose the restricted quadratic problem into a series of smallest possible subproblems, which are then solved analytically. For large datasets, the majority of elapsed time is spent by a large amount of matrix-vector multiplications that cannot be computed efficiently on current GPUs because of limited memory bandwidth. In this paper, we introduce a novel GPU approach to the SVM training that we call Optimized Hierarchical Decomposition SVM (OHD-SVM). It uses a hierarchical decomposition iterative algorithm that fits better to actual GPU architecture. The low decomposition level uses a single GPU multiprocessor to efficiently solve a local subproblem. Nowadays a single GPU multiprocessor can run thousand or more threads that are able to synchronize quickly. It is an ideal platform for a single kernel SMO-based local solver with fast local iterations. The high decomposition level updates gradients of entire training set and selects a new local working set. The gradient update requires many kernel values that are costly to compute. However, solving a large local subproblem offers an efficient kernel values computation via a matrix-matrix multiplication that is much more efficient than the matrix-vector multiplication used in already published implementations. Along with a description of our implementation, the paper includes an exact comparison of five publicly available C++ SVM training GPU implementations. In this paper, the binary classification task and RBF kernel function are taken into account as it is usual in most of the recent papers. According to the measured results on a wide set of publicly available datasets, our proposed approach excelled significantly over the other methods in all datasets. The biggest difference was on the largest dataset where we achieved speed-up up to 12 times in comparison with the fastest already published GPU implementation. Moreover, our OHD-SVM is the only one that can handle dense as well as sparse datasets. Along with this paper, we published the source-codes at https://github.com/OrcusCZ/OHD-SVM. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

34. A General-Purpose Architecture for Replicated Metadata Services in Distributed File Systems.

Author: Stamatakis, Dimokritos, Tsikoudis, Nikos, Micheli, Eirini, and Magoutis, Kostas
Subjects: ELECTRONIC file management, METADATA, COMPUTER architecture, COMPUTING platforms, COMPUTER systems
Abstract: A large class of modern distributed file systems treat metadata services as an independent system component, separately from data servers. The availability of the metadata service is key to the availability of the overall system. Given the high rates of failures observed in large-scale data centers, distributed file systems usually incorporate high-availability (HA) features. A typical approach in the development of distributed file systems is to design and develop metadata services from the ground up, at significant cost in terms of complexity and time, often leading to functional shortcomings. Our motivation in this paper was to improve on this state of things by defining a general-purpose architecture for HA metadata services (which we call RMS) that can be easily incorporated and reused in new or existing file systems, reducing development time. Taking two prominent distributed file systems as case studies, PVFS and HDFS, we developed RMS variants that improve on functional shortcomings of the original HA solutions, while being easy to build and test. Our extensive evaluation of the RMS variant of HDFS shows that it does not incur an overall performance or availability penalty compared to the original implementation. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

35. Multi-Core Embedded Wireless Sensor Networks: Architecture and Applications.

Author: Munir, Arslan, Gordon-Ross, Ann, and Ranka, Sanjay
Subjects: MULTICORE processors, WIRELESS sensor networks, COMPUTER architecture, SILICON industry, MOORE'S law, TRANSISTORS, STREAMING technology
Abstract: Technological advancements in the silicon industry, as predicted by Moore's law, have enabled integration of billions of transistors on a single chip. To exploit this high transistor density for high performance, embedded systems are undergoing a transition from single-core to multi-core. Although a majority of embedded wireless sensor networks (EWSNs) consist of single-core embedded sensor nodes, multi-core embedded sensor nodes are envisioned to burgeon in selected application domains that require complex in-network processing of the sensed data. In this paper, we propose an architecture for heterogeneous hierarchical multi-core embedded wireless sensor networks (MCEWSNs) as well as an architecture for multi-core embedded sensor nodes used in MCEWSNs. We elaborate several compute-intensive tasks performed by sensor networks and application domains that would especially benefit from multi-core embedded sensor nodes. This paper also investigates the feasibility of two multi-core architectural paradigms—symmetric multiprocessors (SMPs) and tiled many-core architectures (TMAs)—for MCEWSNs. We compare and analyze the performance of an SMP (an Intel-based SMP) and a TMA (Tilera's TILEPro64) based on a parallelized information fusion application for various performance metrics (e.g., runtime, speedup, efficiency, cost, and performance per watt). Results reveal that TMAs exploit data locality effectively and are more suitable for MCEWSN applications that require integer manipulation of sensor data, such as information fusion, and have little or no communication between the parallelized tasks. To demonstrate the practical relevance of MCEWSNs, this paper also discusses several state-of-the-art multi-core embedded sensor node prototypes developed in academia and industry. We further discuss research challenges and future research directions for MCEWSNs. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

36. Hybrid Dataflow/von-Neumann Architectures.

Author: Yazdanpanah, Fahimeh, Alvarez-Martinez, Carlos, Jimenez-Gonzalez, Daniel, and Etsion, Yoav
Subjects: DATA flow computing, COMPUTER architecture, HYBRID systems, SYNCHRONIZATION, PARALLEL processing
Abstract: General purpose hybrid dataflow/von-Neumann architectures are gaining attraction as effective parallel platforms. Although different implementations differ in the way they merge the conceptually different computational models, they all follow similar principles: they harness the parallelism and data synchronization inherent to the dataflow model, yet maintain the programmability of the von-Neumann model. In this paper, we classify hybrid dataflow/von-Neumann models according to two different taxonomies: one based on the execution model used for inter- and intrablock execution, and the other based on the integration level of both control and dataflow execution models. The paper reviews the basic concepts of von-Neumann and dataflow computing models, highlights their inherent advantages and limitations, and motivates the exploration of a synergistic hybrid computing model. Finally, we compare a representative set of recent general purpose hybrid dataflow/von-Neumann architectures, discuss their different approaches, and explore the evolution of these hybrid processors. [ABSTRACT FROM AUTHOR]
Published: 2014
Full Text: View/download PDF

37. Optimization of Duplication-Based Schedules on Network-on-Chip Based Multi-Processor System-on-Chips.

Author: Tang, Qi, Wu, Shang-Feng, Shi, Jun-Wu, and Wei, Ji-Bo
Subjects: SYSTEMS on a chip, COMPUTER scheduling, MULTIPROCESSORS, COMPUTER architecture, APPLICATION software, BANDWIDTH allocation, SCALABILITY
Abstract: Many applications such as streaming applications are both computation and communication intensive. The Multi-Processor System-on-Chip (MPSoC) based on Network-on-Chip (NoC) outperforms the multiprocessors with bus-based networking architecture in communication bandwidth and scalability, making it a better choice for implementing systems running these applications. It's important to schedule both the computation and communication onto processors and the networking architecture so as to satisfy the stringent timing requirements. To reduce or avoid inter-processor communication, task duplication has been employed in scheduling. Most of the available techniques for the duplication-based scheduling problem use heuristics to solve the problem, and seldom has any work studied further improving the schedule performance, despite the fact that the heuristic cannot provide quality guarantee. To fill in this gap, this paper introduces a duplication and mapping constrained task-communication co-scheduling problem that assumes the duplication strategy and task-to-processor mapping are known a priory, and proposes two Integer Linear Programming (ILP) formulations, i.e., CF-ILP and CA-ILP, to solve two editions of this problem, i.e., the contention-free problem and the contention-aware problem. The proposed ILP formulations optimize the ordering and timing of the communication and computation, thus improving the performance. Both synthesized and real applications are tested on a set of platforms to evaluate the performance of the proposed methods. The experimental results demonstrate the effectiveness of the proposed methods. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

38. Automated Synthesis of Distributed Network Access Controls: A Formal Framework with Refinement.

Author: Rahman, Mohammad Ashiqur and Al-Shaer, Ehab
Subjects: DISTRIBUTED computing, NETWORK PC (Computer), COMPUTER architecture, ELECTRIC network topology, MATHEMATICAL models
Abstract: Due to the extensive use of network services and emerging security threats, enterprise networks deploy varieties of security devices for controlling resource access based on organizational security requirements. These requirements need fine-grained access control rules based on heterogeneous isolation patterns like access denial, trusted communication, and payload inspection. Organizations are also seeking for usable and optimal security configurations that can harden the network security within enterprise budget constraints. In order to design a security architecture, i.e., the distribution of security devices along with their security policies, that satisfies the organizational security requirements as well as the business constraints, it is required to analyze various alternative security architectures considering placements of network security devices in the network and the corresponding access controls. In this paper, we present an automated formal framework for synthesizing network security configurations. The main design alternatives include different kinds of isolation patterns for network traffic flows. The framework takes security requirements and business constraints along with the network topology as inputs. Then, it synthesizes cost-effective security configurations satisfying the constraints and provides placements of different security devices, optimally distributed in the network, according to the given network topology. In addition, we provide a hypothesis testing-based security architecture refinement mechanism that explores various security design alternatives using ConfigSynth and improves the security architecture by systematically increasing the security requirements. We demonstrate the execution of ConfigSynth and the refinement mechanism using case studies. Finally, we evaluate their scalability using simulated experiments. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

39. Workflow Scheduling in Multi-Tenant Cloud Computing Environments.

Author: Rimal, Bhaskar Prasad and Maier, Martin
Subjects: WORKFLOW management systems, CLOUD computing, SCALABILITY, END users (Information technology), PROOF of concept
Abstract: Multi-tenancy is one of the key features of cloud computing, which provides scalability and economic benefits to the end-users and service providers by sharing the same cloud platform and its underlying infrastructure with the isolation of shared network and compute resources. However, resource management in the context of multi-tenant cloud computing is becoming one of the most complex task due to the inherent heterogeneity and resource isolation. This paper proposes a novel cloud-based workflow scheduling (CWSA) policy for compute-intensive workflow applications in multi-tenant cloud computing environments, which helps minimize the overall workflow completion time, tardiness, cost of execution of the workflows, and utilize idle resources of cloud effectively. The proposed algorithm is compared with the state-of-the-art algorithms, i.e., First Come First Served (FCFS), EASY Backfilling, and Minimum Completion Time (MCT) scheduling policies to evaluate the performance. Further, a proof-of-concept experiment of real-world scientific workflow applications is performed to demonstrate the scalability of the CWSA, which verifies the effectiveness of the proposed solution. The simulation results show that the proposed scheduling policy improves the workflow performance and outperforms the aforementioned alternative scheduling policies under typical deployment scenarios. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

40. Fast Consensus Using Bounded Staleness for Scalable Read-Mostly Synchronization.

Author: Chen, Haibo, Zhang, Heng, Liu, Ran, Zang, Binyu, and Guan, Haibing
Subjects: SYNCHRONIZATION software, COMPUTER architecture, VIRTUAL storage (Computer science), SCALABILITY, SEMANTICS
Abstract: Reader-mostly synchronization schemes, such as rwlocks and RCU, aim to maximize parallelism among readers, but many existing designs either cause readers to contend, or significantly extend writer latency, or both. This paper attributes such a problem to the lack of a fast consensus protocol between readers and writers, by which the two parts cooperate to obey the semantics of a synchronization construct. This paper describes FCP, a fast consensus protocol among readers and writers that provides scalable read-side performance as well as small writer latency for TSO architectures. The heart of FCP is a version-based consensus protocol between multiple non-communicating readers and a pending writer. FCP leverages bounded staleness of memory consistency to avoid atomic instructions and memory barriers in readers’ common paths, and uses message-passing (e.g., IPI) for straggling readers so that the writer latency can be bounded. To demonstrate the effectiveness of FCP, this paper applies FCP to construct a scalable reader-writers lock (rwlock) and a scalable RCU implementation. Evaluation on a 64-core machine shows that FCP significantly boosts the performance of the Linux virtual memory subsystem, a concurrent hashtable and an in-memory database. Micro-benchmarks show that FCP achieves smaller reader-side latency and lower writer-side latency when compared to state-of-the-art rwlocks and RCU implementation. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

41. Shield: A Reliable Network-on-Chip Router Architecture for Chip Multiprocessors.

Author: Poluri, Pavan and Louri, Ahmed
Subjects: NETWORKS on a chip, COMPUTER algorithms, COMPUTER programming, MULTIPROCESSORS, NETWORK routers, COMPUTER networks, INTERNETWORKING devices
Abstract: The increasing number of cores on a chip has made the network on chip (NoC) concept the standard communication paradigm for chip multiprocessors. A fault in an NoC leads to undesirable ramifications that can severely impact the performance of a chip. Therefore, it is vital to design fault tolerant NoCs. In this paper, we present Shield , a reliable NoC router architecture that has the unique ability to tolerate both hard and soft errors in the routing pipeline using techniques such as spatial redundancy, exploitation of idle cycles, bypassing of faulty resources and selective hardening. Using Mean Time to Failure and Silicon Protection Factor metrics, we show that Shield is six times more reliable than the baseline-unprotected router and is at least 1.5 times more reliable than existing fault tolerant router architectures. We introduce a new metric called Soft Error Improvement Factor and show that the soft error tolerance of Shield has improved by three times in comparison to the baseline-unprotected router. This reliability improvement is accomplished by incurring an area and power overhead of 34 and 31 percent respectively. Latency analysis using SPLASH-2 and PARSEC reveals that in the presence of faults, latency increases by a modest 13 and 10 percent respectively. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

42. Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer.

Author: Villa, Oreste, Tumeo, Antonino, Secchi, Simone, and Manzano, Joseph B.
Subjects: COMPUTER simulation, THREADS (Computer programs), SUPERCOMPUTERS, DATA mining, GRAPH theory, COMPUTER storage devices, COMPUTER architecture
Abstract: Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2, and XMT, appear to address irregular application requirements better than commodity clusters. However, the research on massively multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy, and customization. At the same time, Shared Memory MultiProcessors (SMPs) with multicore processors have become an attractive platform to simulate large-scale systems. This paper introduces a cycle-level simulator of the massively multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques implemented to obtain high-simulation speed while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at runtime and includes a parametric network and memory model that takes into account contention and hot spotting. On a modern 48-core SMP host, the proposed infrastructure simulates a large set of irregular applications 500 to 2,000 times slower than real time when compared to a 128-processor XMT, with an accuracy error under 10 percent. Emulation is only from 25 to 200 times slower than real time. The paper also presents a case study, where the simulation infrastructure is used to identify bottlenecks in the current XMT architecture and to estimate the performance scaling of a possible multicore design with next generation memory and network interconnect. [ABSTRACT FROM PUBLISHER]
Published: 2012
Full Text: View/download PDF

43. Floating Point Calculation of the Cube Function on FPGAs.

Author: Osorio, Roberto R.
Subjects: MATHEMATICAL functions, FIELD programmable gate arrays, DIGITAL integrated circuits, CUBES, ARITHMETIC
Abstract: Specialized arithmetic units allow fast and efficient computation of lesser used mathematical functions. The overall impact of those units would be negligible in a general purpose processor, as added circuitry makes chips more complex despite most software would seldom make use of it. On the opposite side, custom computing machines are built for a specific task, and they can always benefit from specialized units if they are available. In this work, floating point architectures are proposed for computing the cube on Intel and Xilinx FPGAs. Those implementations reduce the cost and latency compared to using simple floating point multiplications and squarers. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

44. Coupling-Based Internal Clock Synchronization for Large-Scale Dynamic Distributed Systems.

Author: Baldoni, Roberto, Corsaro, Angelo, Querzoni, Leonardo, Scipioni, Sirio, and Piergiovanni, Sara Tucci
Subjects: PEER-to-peer architecture (Computer networks), COMPUTER network architectures, SYNCHRONIZATION, COMPUTER architecture, COMPUTER networks
Abstract: This paper studies the problem of realizing a common software clock among a large set of nodes without an external time reference (i.e., internal clock synchronization), any centralized control, and where nodes can join and leave the distributed system at their will. The paper proposes an internal clock synchronization algorithm which combines the gossip-based paradigm with a nature-inspired approach, coming from the coupled oscillators phenomenon, to cope with scale and churn. The algorithm works on the top of an overlay network and uses a uniform peer sampling service to fulfill each node's local view. Therefore, differently from clock synchronization protocols for small scale and static distributed systems, here, each node synchronizes regularly with only the neighbors in its local view and not with the whole system. An evaluation of the convergence speed and the synchronization error of the coupled-based internal clock synchronization algorithm has been carried out, showing how convergence time and the synchronization error depends on the coupling factor and the local view size. Moreover, the variation of the synchronization error with respect to churn and the impact of a sudden variation of the number of nodes have been analyzed to show the stability of the algorithm. In all these contexts, the algorithm shows nice performance and very good self-organizing properties. Finally, we showed how the assumption on the existence of a uniform peer-sampling service is instrumental for the good behavior of the algorithm and how, in system models where network delays are unbounded, a mean-based convergence function reaches a lower synchronization error than median-based convergence functions exploiting the number of averaged clock values. [ABSTRACT FROM AUTHOR]
Published: 2010
Full Text: View/download PDF

45. gMig: Efficient vGPU Live Migration with Overlapped Software-Based Dirty Page Verification.

Author: Lu, Qiumin, Zheng, Xiao, Ma, Jiacheng, Dong, Yaozu, Qi, Zhengwei, Yao, Jianguo, He, Bingsheng, and Guan, Haibing
Subjects: GRAPHICS processing units, COMPUTER architecture
Abstract: This paper introduces gMig, an open-source and practical vGPU live migration solution for full virtualization. Taking the advantage of the dirty pattern of GPU workloads, gMig presents the One-Shot Pre-Copy mechanism combined with the hashing based Software Dirty Page technique to achieve efficient vGPU live migration. Particularly, we propose three core techniques for gMig: 1) Dynamic Graphics Address Remapping, which parses and manipulates GPU commands to adjust the address mapping and adapt to a different environment after migration, 2) Software Dirty Page, which utilizes a hashing based approach with sampling pre-filtering to detect page modification, overcomes the commodity GPU's hardware limitation, and speeds up the migration by only sending the dirtied pages, 3) Overlapped Migration Process, which significantly compresses the hanging overhead by overlapping the dirty page verification and transmission concurrently. Our evaluation shows that gMig achieves GPU live migration with an average downtime of 302 ms on Windows and 119 ms on Linux. With the help of Software Dirty Page, the number of GPU pages transferred during the downtime is effectively reduced by up to 80.0 percent. The design of sampling filter and overlapped processing can bring about further 30.0 and 10.0 percent improvements in page processing. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

46. cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs.

Author: Zhang, Tao, Liu, Xiao-Yang, Wang, Xiaodong, and Walid, Anwar
Subjects: GRAPHICS processing units, VIDEO compression, PARALLEL processing, DATA structures, INFORMATION storage & retrieval systems, LINEAR algebra
Abstract: Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations of t-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum $16.91 \times, 27.03 \times, 38.97 \times, 22.36 \times, 15.43 \times$ 16. 91 × , 27. 03 × , 38. 97 × , 22. 36 × , 15. 43 × speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum $9.80 \times$ 9. 80 × and $269.26 \times$ 269. 26 × speedups over multi-core CPU implementations. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

47. JSensor: A Parallel Simulator for Huge Wireless Sensor Networks Applications.

Author: Silva, Matheus Leonidas, Junior, Lincoln N. Santos, Aquino, Andre L. L., and Lima, Joubert de Castro
Subjects: WIRELESS sensor networks, APPLICATION program interfaces, COMPUTER architecture
Abstract: This paper presents JSensor, a parallel general purpose simulator which enables huge simulations of Wireless Sensor Networks applications. Its main advantages are: i) to have a simple API with few classes to be extended, allowing easy prototyping and validation of WSNs applications and protocols; ii) to enable transparent and reproducible simulations, regardless of the number of threads of the parallel kernel; and iii) to scale over multi-core computer architectures, allowing simulations of more realistic applications. JSensor is a parallel event-driven simulator which executes according to event timers. The simulation elements, nodes, application, and events, can send messages, process task or move around the simulated environment. The mentioned environment follows a grid structure of extensible spatial cells. The results demonstrated that JSensor scales well, precisely it achieved a speedup of 7.45 with 16 threads in a machine with 16 cores (eight physical and eight virtual cores), and comparative evaluations versus OMNeT++ showed that the presented solution could be 43 times faster. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

48. Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery.

Author: Yi Ma, Hongliang Gao, Dimitrov, Martin, and Huiyang Zhou
Subjects: FAULT tolerance (Engineering), ELECTRIC power consumption, THREADS (Computer programs), MULTIPROCESSORS, SYSTEMS design, COMPUTER architecture
Abstract: Dual-core execution (DCE) is an execution paradigm proposed to utilize chip multiprocessors to improve the performance of single-threaded applications. Previous research has shown that DCE provides a complexity-effective approach to building a highly scalable instruction window and achieves significant latency-hiding capabilities. In this paper, we propose to optimize DCE for power efficiency and/or transient-fault recovery. In DCE, a program is first processed (speculatively) in the front processor and then reexecuted by the back processor. Such reexecution is the key to eliminating the centralized structures that are normally associated with very large instruction windows. In this paper, we exploit the computational redundancy in DCE to improve its reliability and its power efficiency. The main contributions include: 1) DCE-based redundancy checking for transient-fault tolerance and a complexity-effective approach to achieving full redundancy coverage and 2) novel techniques to improve the power/energy efficiency of DCE-based execution paradigms. Our experimental results demonstrate that, with the proposed simple techniques, the optimized DCE can effectively achieve transient- fault tolerance or significant performance enhancement in a power/energy-efficient way. Compared to the original DCE, the optimized DCE has similar speedups (34 percent on average) over single-core processors while reducing the energy overhead from 93 percent to 31 percent. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

49. Throughput Region of Finite-Buffered Networks.

Author: Giaccone, Paolo, Leonardi, Emilio, and Shah, Devavrat
Subjects: QUEUING theory, PACKET switching, DATA transmission systems, ELECTRONIC data processing, COMPUTER architecture, COMPUTER networks
Abstract: Most of the current communication networks, including the Internet, are packet switched networks. One of the main reasons behind the success of packet switched networks is the possibility of performance gain due to multiplexing of network bandwidth. The multiplexing gain crucially depends on the size of the buffers available at the nodes of the network to store packets at the congested links. However, most of the previous work assumes the availability of infinite buffer-size. In this paper, we study the effect of finite buffer-size on the performance of networks of interacting queues. In particular, we study the throughput of flow-controlled loss-less networks with finite buffers. The main result of this paper is the characterization of a dynamic scheduling policy that achieves the maximal throughput with a minimal finite buffer at the internal nodes of the network under memory-less (e.g., Bernoulli IID) exogenous arrival process. However, this ideal performance policy is rather complex and, hence, difficult to implement. This leads us to the design of a simpler and possibly implementable policy. We obtain a natural trade-off between throughput and buffer-size for such implementable policy. Finally, we apply our results to packet switches with buffered crossbar architecture. [ABSTRACT FROM AUTHOR]
Published: 2007
Full Text: View/download PDF

50. A Class of Multistage Conference Switching Networks for Group Communication.

Author: Yuanyuan Yang and Jianchao Wang
Subjects: COMPUTER conferencing, COMPUTER networks, NETWORK routers, COMPUTER input-output equipment, SWITCHING circuits, COMPUTER architecture
Abstract: There is a growing demand for network support for group applications, in which messages from one or more sender(s) are delivered to a large number of receivers. In this paper, we propose a network architecture for supporting a fundamental type of group communication, conferencing. A conference refers to a group of members in a network who communicate with each other within the group. We consider adopting a class of multistage networks, such as a baseline, an omega, or an indirect binary cube network, composed of switch modules with fan-in and fan-out capability for a conference network which supports multiple disjoint conferences. The key issue in designing a conference network is to determine the multiplicity of routing conflicts, which is the maximum number of conflict parties competing a single interstage link when multiple disjoint conferences simultaneously present in the network. Our results in this paper show that, for a network of size n × n, the multiplicities of routing conflicts are small constants (between 2 and 4) for an omega network or an indirect binary cube network; while it can be as large as &frac&radicn;q; + 1 for a baseline network, where q is the minimum allowable conference size. Thus, our design for conference networks is based on an omega network or an indirect binary cube network. We also develop fast self-routing algorithms for setting up routing paths in the newly designed conference networks. As can be seen, such an it x it conference network has O(n Iog n) routing time and communication delay and O(n log n) hardware cost. The conference networks are superior to existing designs in terms of routing complexity, communication delay and hardware cost. The conference network proposed is rearrangeably nonblocking in general, and is strictly nonblocking under some conference service policy. It can be used in applications that require efficient or real-time group communication. [ABSTRACT FROM AUTHOR]
Published: 2004
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

487 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources