Descriptor: "PARALLEL processing" / Journal: ieee transactions on parallel & distributed systems - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"PARALLEL processing"' showing total 620 results

Start Over Descriptor "PARALLEL processing" Journal ieee transactions on parallel & distributed systems

620 results on '"PARALLEL processing"'

1. $TC-Stream$ T C - S t r e a m : Large-Scale Graph Triangle Counting on a Single Machine Using GPUs.

Author: Huang, Jianqiang, Wang, Haojie, Fei, Xiang, Wang, Xiaoying, and Chen, Wenguang
Subjects: *GRAPH algorithms, *SOLID state drives, *GRAPHICS processing units, *PARALLEL algorithms, *TRIANGLES, *ON-demand computing, *COUNTING, *SOCIAL network analysis
Abstract: In this paper, we build a $TC$ T C - $Stream$ S t r e a m , a high-performance graph processing system specific for a triangle counting algorithm on graph data with up to tens of billions of edges, which significantly exceeds the device memory capacity of Graphics Processing Units (GPUs). The triangle counting problem is a broad research topic in data mining and social network analysis in the graph processing field. As the scale of the graph data grows, a portion of the graph data must be loaded iteratively. In the existing literature, graphs with billions of edges need to be done distributively, which is cost-intensive. Also, many disk-based triangle counting systems are proposed for CPU architectures, but their tackling performances are inefficient. To solve the above problem, we propose $TC$ T C - $Stream$ S t r e a m , and it focuses on three issues: 1) For power-law graphs, because the amount of tasks of each vertex or edge is inconsistent, it is bound to cause different demands of computing and memory resources for different task types. We propose a parallel vertex approach and the reordering of vertices for graph data that can be placed in the GPU device memory to ensure the maximum workload balancing; 2) A binary-search-based set intersection method is designed to achieve the maximum parallelism in GPU; 3) For the graph data that exceeds the GPU device memory capacity, we develop a novel vertical partition algorithm to guarantee the independent computing on each partition so that the three computation processes, i.e., the computation on GPU, the data transmission between main memory of CPU and SSD, and the communication between the CPU and the GPU can be perfectly overlapped. Moreover, the $TC$ T C - $Stream$ S t r e a m optimizes edge-iterator models and benefits from multi-thread parallelism. Extensive experiments conducted on large-scale datasets showed that the $TC$ T C - $stream$ s t r e a m running on a single Tesla V100 GPU performs $2.4-6\times$ 2. 4 - 6 × and $1.8-4.4\times$ 1. 8 - 4. 4 × faster than the state-of-the-art single-machine in-memory triangle counting system and GPU-based triangle counting system, respectively, and achieves $2.4\times$ 2. 4 × faster than the state-of-the-art out-of-core distributed system PDTL running on an 8-node cluster when processing the graph data with 42.5 billion edges, which demonstrates the high performance and cost-effectiveness of the $TC$ T C - $Stream$ S t r e a m . [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

2. Guest Editorial.

Author: Zhai, Jidong, Si, Min, and Pena, Antonio J.
Subjects: *ARTIFICIAL intelligence, *DEEP learning, *PARALLEL programming, *LEARNING ability, *MACHINE learning
Abstract: This special section focuses on the state-of-the-art technologies on parallel and distributed computing techniques for artificial intelligence (AI), machine learning (ML), and deep learning (DL). AI, ML, and DL can enable computers the ability to learn from a large amount of data and use the learned model to optimize a complex problem or discover rules in a complicated system. AI, ML and DL can be applied to push forward the boundaries for many domains and significantly influence our daily life. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

3. Auto-GNAS: A Parallel Graph Neural Architecture Search Framework.

Author: Chen, Jiamin, Gao, Jianliang, Chen, Yibo, Oloulade, Babatounde Moctard, Lyu, Tengfei, and Li, Zhao
Subjects: *GRAPH algorithms, *SEARCH algorithms, *GENETIC algorithms, *LINEAR acceleration, *PARALLEL programming, *COMPUTER architecture
Abstract: Graph neural networks (GNNs) have received much attention as GNNs have recently been successfully applied on non-euclidean data. However, artificially designed graph neural networks often fail to get satisfactory model performance for a given graph data. Graph neural architecture search effectively constructs the GNNs that achieve the expected model performance with the rise of automatic machine learning. The challenge is efficiently and automatically getting the optimal GNN architecture in a vast search space. Existing search methods serially evaluate the GNN architectures, severely limiting system efficiency. To solve these problems, we develop an Automatic Graph Neural Architecture Search framework (Auto-GNAS) with parallel estimation to implement an automatic graph neural search process that requires almost no manual intervention. In Auto-GNAS, we design the search algorithm with multiple genetic searchers. Each searcher can simultaneously use evaluation feedback information, information entropy, and search results from other searchers based on sharing mechanism to improve the search efficiency. As far as we know, this is the first work using parallel computing to improve the system efficiency of graph neural architecture search. According to the experiment on the real datasets, Auto-GNAS obtain competitive model performance and better search efficiency than other search algorithms. Since the parallel estimation ability of Auto-GNAS is independent of search algorithms, we expand different search algorithms based on Auto-GNAS for scalability experiments. The results show that Auto-GNAS with varying search algorithms can achieve nearly linear acceleration with the increase of computing resources. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

4. Mixing Activations and Labels in Distributed Training for Split Learning.

Author: Xiao, Danyang, Yang, Chengang, and Wu, Weigang
Subjects: *ARTIFICIAL neural networks, *BIPARTITE graphs
Abstract: Split Learning (SL) is a distributed machine learning setting that allows several nodes to train neural networks based on model parallelism. Since SL avoids sharing raw data among training nodes, it can protect data privacy by nature. However, recent studies show that, raw data may be reconstructed from activations in training, which may cause data privacy leakage. Besides raw data, label sharing in SL may also cause privacy problems. In order to address these issues, we propose a novel mechanism called multiple activations and labels mix (MALM). By taking advantage of the diversity of sample categories, MALM generates mixed activations that preserve a low distance correlation with the raw data so as to reduce the risk of reconstruction attacks. To protect label information, MALM creates obfuscated labels associated with the raw data so as to prevent adversaries from inferring ground-truth labels. Since clients with few sample categories may not effectively generate mixed activations and obfuscated labels, we propose a bipartite graph based assistant client match technique for MALM, which lets clients with a large number of categories provide mixed activations and obfuscated labels for clients with few categories. Those clients with few categories can mix the obtained mixed activations and obfuscated labels with their own activations and labels. Experimental results show that, compared with baselines, MALM can reduce the risk of raw data and label information leakage with lower cost, while achieving comparable even better model performance. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

5. Accelerating Large Sparse Neural Network Inference Using GPU Task Graph Parallelism.

Author: Lin, Dian-Lun and Huang, Tsung-Wei
Subjects: *GRAPH algorithms, *GRAPHICS processing units
Abstract: The ever-increasing size of modern deep neural network (DNN) architectures has put increasing strain on the hardware needed to implement them. Sparsified DNNs can greatly reduce memory costs and increase throughput over standard DNNs, if the loss of accuracy can be adequately controlled. However, sparse DNNs present unique computational challenges. Efficient model or data parallelism algorithms are extremely hard to design and implement. The recent effort MIT/IEEE/Amazon HPEC Graph Challenge has drawn attention to high-performance inference methods for large sparse DNNs. In this article, we introduce SNIG, an efficient inference engine for large sparse DNNs. SNIG develops highly optimized inference kernels and leverages the power of CUDA Graphs to enable efficient decomposition of model and data parallelisms. Our decomposition strategy is flexible and scalable to different partitions of data volumes, model sizes, and GPU numbers. We have evaluated SNIG on the official benchmarks of HPEC Sparse DNN Challenge and demonstrated its promising performance scalable from a single GPU to multiple GPUs. Compared to the champion of the 2019 HPEC Sparse DNN Challenge, SNIG can finish all inference workloads using only a single GPU. At the largest DNN, which has more than 4 billion parameters across 1920 layers each of 65536 neurons, SNIG is up to 2.3× faster than a state-of-the-art baseline under a machine of 4 GPUs. SNIG receives the Champion Award in 2020 HPEC Sparse DNN Challenge. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

6. ReHy: A ReRAM-Based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training.

Author: Jin, Hai, Liu, Cong, Liu, Haikun, Luo, Ruikun, Xu, Jiahong, Mao, Fubing, and Liao, Xiaofei
Subjects: *NONVOLATILE random-access memory, *CONVOLUTIONAL neural networks, *BLOOD pressure testing machines, *MATRIX multiplications, *LOGIC design
Abstract: Processing-In-Memory(PIM) has emerged as a high-performance and energy-efficient computing paradigm for accelerating convolutional neural network (CNN) applications. Resistive random access memory (ReRAM) has been widely used in PIM architectures due to its extremely high efficiency for accelerating matrix-vector multiplications through analog computing. However, because CNN training usually requires high-precision computation in the backward propagation (BP) stage, the limited precision of analog PIM accelerators impedes their adoption in CNN training. In this article, we propose ReHy, a hybrid PIM accelerator to support CNN training in ReRAM arrays. It is composed of Analog PIM (APIM) and Digital PIM (DPIM) modules. ReHy uses APIM to accelerate the feed-forward propagation (FP) stage for high performance, and DPIM to process the BP stage for high accuracy. We exploit the capability of ReRAM for Boolean logic operations to design the DPIM architecture. Particularly, we design floating-point multiplication and addition operators to support matrix multiplications in ReRAM arrays. We also propose a performance model to offload high-precision matrix multiplications to DPIM according to the data parallelism. Experimental results show that ReHy can speed up CNN training by 48.8× and 2.4×, and reduce energy consumption by 35.1× and 2.33×, compared with CPU/GPU architectures (baseline) and the state-of-the-art FloatPIM, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

7. BLB-gcForest: A High-Performance Distributed Deep Forest With Adaptive Sub-Forest Splitting.

Author: Chen, Zexi, Wang, Ting, Cai, Haibin, Mondal, Subrota Kumar, and Sahoo, Jyoti Prakash
Subjects: *DISTRIBUTED computing, *DISTRIBUTED algorithms, *PARALLEL programming, *PARALLEL algorithms, *SCALABILITY, *PROBLEM solving, *ALGORITHMS
Abstract: As an emulous alternative to deep neural networks, Deep Forest emerges with features like low complexity, fewer hyper-parameters, and good robustness, which are predominantly desired in distributed computing applications and ecosystems. Recently, an efficient distributed Deep Forest system, named ForestLayer, was proposed, designing a fine-grained sub-Forest-based task-parallel algorithm to improve the parallel computing efficiency of Deep Forest. However, the sub-Forest splitting of ForestLayer is static and one-off without adaptability to the computing environment, nevertheless, the size of splitting granularity has a significant impact on the system performance. To further improve the computing efficiency and scalability of the distributed Deep Forest, in this paper, we propose a novel distributed Deep Forest algorithm, named BLB-gcForest (Bag of Little Bootstraps-gcForest), which augments the gcForest (multi-Grained Cascade Forest) approach for constructing Deep Forest. BLB-gcForest carries out parallel computation for each tree in sub-Forests at a finer parallel granularity and integrates with the Bag of Little Bootstraps (BLB) mechanism to reduce massive transmitted feature instances for Cascade Forest Layers, utterly improving both computation efficiency and communication efficiency. Moreover, to solve the problem of the forest splitting granularity, we further design an adaptive sub-Forest splitting algorithm to ensure the maximum resource utilization for parallel computation of each sub-Forest. Experimental results on four well-known large-scale datasets, namely YEAST, LETTER, MNIST, CIFAR10, show that the training efficiency of BLB-gcForest achieves up to 20.3x and 1.64x speedups compared with the state-of-the-art gcForest and ForestLayer, respectively while guaranteeing higher accuracy and better robustness [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

8. Exploring Data Analytics Without Decompression on Embedded GPU Systems.

Author: Pan, Zaifeng, Zhang, Feng, Zhou, Yanliang, Zhai, Jidong, Shen, Xipeng, Mutlu, Onur, and Du, Xiaoyong
Subjects: *GRAPHICS processing units, *COMPUTER architecture, *ENERGY consumption, *RANDOM access memory
Abstract: With the development of computer architecture, even for embedded systems, GPU devices can be integrated, providing outstanding performance and energy efficiency to meet the requirements of different industries, applications, and deployment environments. Data analytics is an important application scenario for embedded systems. Unfortunately, due to the limitation of the capacity of the embedded device, the scale of problems handled by the embedded system is limited. In this paper, we propose a novel data analytics method, called G-TADOC, for efficient text analytics directly on compression on embedded GPU systems. A large amount of data can be compressed and stored in embedded systems, and can be processed directly in the compressed state, which greatly enhances the processing capabilities of the systems. Particularly, G-TADOC has three innovations. First, a novel fine-grained thread-level workload scheduling strategy for GPU threads has been developed, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, a GPU thread-safe memory pool has been developed to handle inconsistency with low synchronization overheads. Third, a sequence-support strategy is provided to maintain high GPU parallelism while ensuring sequence information for lossless compression. Moreover, G-TADOC involves special optimizations for embedded GPUs, such as utilizing the CPU-GPU shared unified memory. Experiments show that G-TADOC provides 13.2× average speedup compared to the state-of-the-art TADOC. G-TADOC also improves performance-per-cost by 2.6× and energy efficiency by 32.5× over TADOC. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

9. Workload Balancing via Graph Reordering on Multicore Systems.

Author: Chen, YuAng and Chung, Yeh-Ching
Subjects: *DATA structures, *PARALLEL processing, *SCALABILITY, *STATISTICS
Abstract: In a shared-memory multicore system, the intrinsic irregular data structure of graphs leads to poor cache utilization, and therefore deteriorates the performance of graph analytics. To address the problem, prior works have proposed a variety of lightweight reordering methods with focus on the optimization of cache locality. However, there is a compromise between cache locality and workload balance. Little insight has been devoted into the issue of workload imbalance for the underlying multicore system, which degrades the effectiveness of parallel graph processing. In this work, a measurement approach is proposed to quantify the imbalance incurred by the concentration of vertices. Inspired by it, we present Cache-aware Reorder (Corder), a lightweight reordering method exploiting the cache hierarchy of multicore systems. At the shared-memory level, Corder promotes even distribution of computation loads amongst multicores. At the private-cache level, Corder facilitates cache efficiency by applying further refinement to local vertex order. Comprehensive performance evaluation of Corder is conducted on various graph applications and datasets. Experimental results show that Corder yields speedup of up to $2.59\times$ 2. 59 × and on average $1.45\times$ 1. 45 × , which significantly outperforms existing lightweight reordering methods. To identify the root causes of performance boost delivered by Corder, multicore activities are investigated in terms of thread behavior, cache efficiency, and memory utilization. Statistical analysis demonstrates that the issue of imbalanced thread execution time dominates other factors in determining the overall graph processing time. Moreover, Corder achieves remarkable advantages in cross-platform scalability and reordering overhead. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

10. An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications.

Author: Zhou, Keren, Meng, Xiaozhu, Sai, Ryuichi, Grubisic, Dejan, and Mellor-Crummey, John
Subjects: *HIGH performance computing, *GRAPHICS processing units, *ACCELERATED life testing, *SUPERCOMPUTERS, *PARALLEL processing, *PARALLEL programming
Abstract: The US Department of Energy’s fastest supercomputers and forthcoming exascale systems employ Graphics Processing Units (GPUs) to increase the computational performance of compute nodes. However, the complexity of GPU architectures makes tailoring sophisticated applications to achieve high performance on GPU-accelerated systems a major challenge. At best, prior performance tools for GPU code only provide coarse-grained tuning advice at the kernel level. In this article, we describe GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To gather the fine-grained measurements needed to produce such insights, GPA uses instruction sampling and binary instrumentation to monitor execution of GPU code. At the time of this writing, GPU instruction sampling is only available on NVIDIA GPUs. To understand performance losses, GPA uses data flow analysis to approximately attribute measured instruction stalls back to their causes. GPA then analyzes patterns of stalls using information about a program’s structure and the GPU architecture to identify optimization strategies that address inefficiencies observed. GPA then employs detailed performance models to estimate the potential speedup that each optimization might provide. Experiments with benchmarks and applications show that GPA provides useful advice for tuning GPU code. We applied GPA to analyze and tune a collection of codes on NVIDIA V100 and A100 GPUs. GPA suggested optimizations that it estimates will accelerate performance across the set of codes by a geometric mean of 1.21×. Applying these optimizations suggested by GPA accelerated these codes by a geometric mean of 1.19×. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

11. Design and Performance Characterization of RADICAL-Pilot on Leadership-Class Platforms.

Author: Merzky, Andre, Turilli, Matteo, Titov, Mikhail, Al-Saadi, Aymen, and Jha, Shantenu
Subjects: *WORKFLOW, *RESOURCE management, *HIGH performance computing, *GRAPHICS processing units
Abstract: Many extreme scale scientific applications have workloads comprised of a large number of individual high-performance tasks. The Pilot abstraction decouples workload specification, resource management, and task execution via job placeholders and late-binding. As such, suitable implementations of the Pilot abstraction can support the collective execution of large number of tasks on supercomputers. We introduce RADICAL-Pilot (RP) as a portable, modular and extensible pilot-enabled runtime system. We describe RP's design, architecture and implementation. We characterize its performance and show its ability to scalably execute workloads comprised of tens of thousands heterogeneous tasks on DOE and NSF leadership-class HPC platforms. Specifically, we investigate RP's weak/strong scaling with CPU/GPU, single/multi core, (non)MPI tasks and Python functions when using most of ORNL Summit and TACC Frontera. RADICAL-Pilot can be used stand-alone, as well as the runtime for third-party workflow systems. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

12. LB4OMP: A Dynamic Load Balancing Library for Multithreaded Applications.

Author: Korndorfer, Jonas H. Muller, Eleliemy, Ahmed, Mohammed, Ali, and Ciorba, Florina M.
Subjects: *DYNAMIC loads, *DYNAMIC balance (Mechanics), *COMPUTER systems
Abstract: Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of cores per node. Efficiently exploiting hierarchical parallelism is challenging due to load imbalance that arises at multiple levels. OpenMP is the most widely-used standard for expressing and exploiting the ever-increasing node-level parallelism. The scheduling options in OpenMP are insufficient to address the load imbalance that arises during the execution of multithreaded applications. The limited scheduling options in OpenMP hinder research on novel scheduling techniques which require comparison with others from the literature. This work introduces LB4OMP, an open-source dynamic load balancing library that implements successful scheduling algorithms from the literature. LB4OMP is a research infrastructure designed to spur and support present and future scheduling research, for the benefit of multithreaded applications performance. Through an extensive performance analysis campaign, we assess the effectiveness and demystify the performance of all loop scheduling techniques in the library. We show that, for numerous applications-systems pairs, the scheduling techniques in LB4OMP outperform the scheduling options in OpenMP. Node-level load balancing using LB4OMP leads to reduced cross-node load imbalance and to improved MPI+OpenMP applications performance, which is critical for Exascale computing. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

13. Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication.

Author: Moon, Gordon Euhyun, Kwon, Hyoukjun, Jeong, Geonhwa, Chatarasi, Prasanth, Rajamanickam, Sivasankaran, and Krishna, Tushar
Subjects: *MULTIPLICATION, *ARRAY processing, *MACHINE learning, *SPARSE matrices, *PARALLEL processing
Abstract: There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures using a tiled general matrix-matrix multiplication (GEMM) kernel. To do so, we develop a framework that finds optimized mappings (dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy. Our evaluations over five spatial accelerators demonstrate that the tiled GEMM mappings systematically generated by our framework achieve high performance on various GEMM workloads and accelerators. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

14. Compiler-Assisted Compaction/Restoration of SIMD Instructions.

Author: Cebrian, Juan M., Balem, Thibaud, Barredo, Adrian, Casas, Marc, Moreto, Miquel, Ros, Alberto, and Jimborean, Alexandra
Subjects: *HIGH performance computing, *COMPACTING, *SUPERCOMPUTERS, *COMPUTER systems, *ENERGY consumption, *COMPUTER architecture
Abstract: Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This article proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent dense vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29 percent and reduces dynamic energy by up to 24.2 percent on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6 percent performance and 14 percent energy improvements for the same configuration and applications. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

15. gSoFa: Scalable Sparse Symbolic LU Factorization on GPUs.

Author: Gaihre, Anil, Li, Xiaoye Sherry, and Liu, Hang
Subjects: *GRAPHICS processing units, *NUMERICAL solutions for linear algebra, *FACTORIZATION, *SPARSE matrices, *MATRIX decomposition
Abstract: Decomposing a matrix $\mathbf {A}$ A into a lower matrix $\mathbf {L}$ L and an upper matrix $\mathbf {U}$ U , which is also known as LU decomposition, is an essential operation in numerical linear algebra. For a sparse matrix, LU decomposition often introduces more nonzero entries in the $\mathbf {L}$ L and $\mathbf {U}$ U factors than in the original matrix. A symbolic factorization step is needed to identify the nonzero structures of $\mathbf {L}$ L and $\mathbf {U}$ U matrices. Attracted by the enormous potentials of the Graphics Processing Units (GPUs), an array of efforts have surged to deploy various LU factorization steps except for the symbolic factorization, to the best of our knowledge, on GPUs. This article introduces gSoFa, the first GPU-based symbolic factorization design with the following three optimizations to enable scalable LU symbolic factorization for nonsymmetric pattern sparse matrices on GPUs. First, we introduce a novel fine-grained parallel symbolic factorization algorithm that is well suited for the Single Instruction Multiple Thread (SIMT) architecture of GPUs. Second, we tailor supernode detection into a SIMT friendly process and strive to balance the workload, minimize the communication and saturate the GPU computing resources during supernode detection. Third, we introduce a three-pronged optimization to reduce the excessive space consumption problem faced by multi-source concurrent symbolic factorization. Taken together, gSoFa achieves up to 31× speedup from 1 to 44 Summit nodes (6 to 264 GPUs) and outperforms the state-of-the-art CPU project, on average, by 5×. Notably, gSoFa also achieves up to 47 percent of the peak memory throughput of a V100 GPU in the Summit Supercomputer. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. Repurposing GPU Microarchitectures with Light-Weight Out-Of-Order Execution.

Author: Iliakis, Konstantinos, Xydis, Sotirios, and Soudris, Dimitrios
Subjects: *GRAPHICS processing units, *PARALLEL processing, *VERNACULAR architecture, *COMPUTER architecture
Abstract: GPU is the dominant platform for accelerating general-purpose workloads due to its computing capacity and cost-efficiency. GPU applications cover an ever-growing range of domains. To achieve high throughput, GPUs rely on massive multi-threading and fast context switching to overlap computations with memory operations. We observe that among the diverse GPU workloads, there exists a significant class of kernels that fail to maintain a sufficient number of active warps to hide the latency of memory operations, and thus suffer from frequent stalling. We argue that the dominant Thread-Level Parallelism model is not enough to efficiently accommodate the variability of modern GPU applications. To address this inherent inefficiency, we propose a novel micro-architecture with lightweight Out-Of-Order execution capability enabling Instruction-Level Parallelism to complement the conventional Thread-Level Parallelism model. To minimize the hardware overhead, we carefully design our extension to highly re-use the existing micro-architectural structures and study various design trade-offs to contain the overall area and power overhead, while providing improved performance. We show that the proposed architecture outperforms traditional platforms by 23 percent on average for low-occupancy kernels, with an area and power overhead of 1.29 and 10.05 percent, respectively. Finally, we establish the potential of our proposal as a micro-architecture alternative by providing 16 percent speedup over a wide collection of 60 general-purpose kernels. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

17. Tardiness Bounds for Sporadic Gang Tasks Under Preemptive Global EDF Scheduling.

Author: Dong, Zheng, Yang, Kecheng, Fisher, Nathan, and Liu, Cong
Subjects: *TARDINESS, *SCHEDULING, *CYBER physical systems, *PARALLEL processing, *GANGS
Abstract: Following the trend of increasing autonomy in cyber-physical systems, parallel embedded architectures have enabled devices to better handle the large streams of data and intensive computation required by such autonomous systems. However, while the explosion of highly-parallel platforms has seen a proportional growth in the number of applications/devices that utilize these platforms, the embedded systems community’s understanding of how to build time-predictable, safety-critical systems with parallel platforms has not kept pace. As a well-motivated but challenging parallel scheduling model, gang scheduling requires all parallel threads of each parallel task to simultaneously execute in unison, which is in contrast to traditional, multi-threaded parallel scheduling, where a parallel task may spawn multiple threads, and each thread will be scheduled independently of other threads of the same task. While increasing research efforts on hard real-time (HRT) gang scheduling have recently been seen, the problem of gang scheduling in the context of soft real-time (SRT) systems, where provably bounded deadline tardiness can be tolerated, has hardly been studied yet. In this article, we derive and prove the first tardiness bounds for sporadic gang task systems under preemptive GEDF scheduling. A total utilization bound for SRT-schedulability is required for ensuring such tardiness bounds but it is shown to be tight with respect to the platform capacity and maximum parallelism-induced idleness. Furthermore, we also empirically evaluate the effects of different degrees of task parallelism upon the SRT-schedulability. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

18. Trust: Triangle Counting Reloaded on GPUs.

Author: Pandey, Santosh, Wang, Zhibin, Zhong, Sheng, Tian, Chen, Zheng, Bolong, Li, Xiaoye, Li, Lingda, Hoisie, Adolfy, Ding, Caiwen, Li, Dong, and Liu, Hang
Subjects: *TRIANGLES, *INTERSECTION graph theory, *COUNTING, *GRAPHICS processing units, *CONSTRUCTION costs
Abstract: Triangle counting is a building block for a wide range of graph applications. Traditional wisdom suggests that i) hashing is not suitable for triangle counting, ii) edge-centric triangle counting beats vertex-centric design, and iii) communication-free and workload balanced graph partitioning is a grand challenge for triangle counting. On the contrary, we advocate that i) hashing can help the key operations for scalable triangle counting on Graphics Processing Units (GPUs), i.e., list intersection and graph partitioning, ii) vertex-centric design reduces both hash table construction cost and memory consumption, which is limited on GPUs. In addition, iii) we exploit graph and workload collaborative, and hashing-based 2D partitioning to scale vertex-centric triangle counting over 1000 GPUs with sustained scalability. In this article, we present Trust which performs triangle counting with the hash operation and vertex-centric mechanism at the core. To the best of our knowledge, Trust is the first work that achieves over one trillion Traversed Edges Per Second (TEPS) rate for triangle counting. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

19. gIM: GPU Accelerated RIS-Based Influence Maximization Algorithm.

Author: Shahrouz, Soheil, Salehkaleybar, Saber, and Hashemi, Matin
Subjects: *GRAPHICS processing units, *GREEDY algorithms, *SOCIAL networks, *ALGORITHMS, *PROBLEM solving, *WEIGHTED graphs, *SOURCE code
Abstract: Given a social network modeled as a weighted graph $G$ G , the influence maximization problem seeks $k$ k vertices to become initially influenced, to maximize the expected number of influenced nodes under a particular diffusion model. The influence maximization problem has been proven to be NP-hard, and most proposed solutions to the problem are approximate greedy algorithms, which can guarantee a tunable approximation ratio for their results with respect to the optimal solution. The state-of-the-art algorithms are based on Reverse Influence Sampling (RIS) technique, which can offer both computational efficiency and non-trivial $(1-{1}/{e}-\epsilon)$ (1 - 1 / e - ε) -approximation ratio guarantee for any $\epsilon >0$ ε > 0 . RIS-based algorithms, despite their lower computational cost compared to other methods, still require long running times to solve the problem in large-scale graphs with low values of $\epsilon$ ε . In this article, we present a novel and efficient parallel implementation of a RIS-based algorithm, namely IMM, on GPU. The proposed GPU-accelerated influence maximization algorithm, named gIM, can significantly reduce the running time on large-scale graphs with low values of $\epsilon$ ε . Furthermore, we show that gIM algorithm can solve other variations of the IM problem, only by applying minor modifications. Experimental results show that the proposed solution reduces the runtime by a factor up to 220 ×. The source code of gIM is publicly available online. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

20. The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs With Hybrid Parallelism.

Author: Oyama, Yosuke, Maruyama, Naoya, Dryden, Nikoli, McCarthy, Erin, Harrington, Peter, Balewski, Jan, Matsuoka, Satoshi, Nugent, Peter, and Van Essen, Brian
Subjects: *CONVOLUTIONAL neural networks, *DEEP learning, *PERFORMANCE theory
Abstract: We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks. Deep learning-based emerging scientific workflows often require model training with large, high-dimensional samples, which can make training much more costly and even infeasible due to excessive memory usage. We solve these challenges by extensively applying hybrid parallelism throughout the end-to-end training pipeline, including both computations and I/O. Our hybrid-parallel algorithm extends the standard data parallelism with spatial parallelism, which partitions a single sample in the spatial domain, realizing strong scaling beyond the mini-batch dimension with a larger aggregated memory capacity. We evaluate our proposed training algorithms with two challenging 3D CNNs, CosmoFlow and 3D U-Net. Our comprehensive performance studies show that good weak and strong scaling can be achieved for both networks using up to 2K GPUs. More importantly, we enable training of CosmoFlow with much larger samples than previously possible, realizing an order-of-magnitude improvement in prediction accuracy. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

21. Model Parallelism Optimization for Distributed Inference Via Decoupled CNN Structure.

Author: Du, Jiangsu, Zhu, Xin, Shen, Minghua, Du, Yunfei, Lu, Yutong, Xiao, Nong, and Liao, Xiangke
Subjects: *SIGNAL convolution, *TASK analysis
Abstract: It is promising to deploy CNN inference on local end-user devices for high-accuracy and time-sensitive applications. Model parallelism has the potential to provide high throughput and low latency in distributed CNN inference. However, it is non-trivial to use model parallelism as the original CNN model is inherently tightly-coupled structure. In this article, we propose DeCNN, a more effective inference approach that uses decoupled CNN structure to optimize model parallelism for distributed inference on end-user devices. DeCNN is novel consisting of three schemes. Scheme-1 is structure-level optimization. It exploits group convolution and channel shuffle to decouple the original CNN structure for model parallelism. Scheme-2 is partition-level optimization. It is based on channel group to partition the convolutional layers, and then leverages input-based method to partition the fully connected layers, further exposing high degree of parallelism. Scheme-3 is communication-level optimization. It uses inter-sample parallelism to hide communications for better performance and robustness, especially in the weak network connections. We use ImageNet classification task to evaluate the effectiveness of DeCNN on a distributed multi-ARM platform. Notably, when using the number of devices from 1 to 4, DeCNN can accelerate the inference of large-scale ResNet-50 by 3.21×, and reduce 65.3 percent memory footprint, with 1.29 percent accuracy improvement. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

22. Analysis of Global and Local Synchronization in Parallel Computing.

Author: Cicirelli, Franco, Giordano, Andrea, and Mastroianni, Carlo
Subjects: *PARALLEL programming, *SYNCHRONIZATION, *ASYMPTOTIC efficiencies, *KEY performance indicators (Management)
Abstract: In a parallel computing scenario, the synchronization overhead, needed to coordinate the execution on the parallel computing nodes, can significantly impair the overall execution performance. Typically, synchronization is achieved by adopting a global synchronization schema involving all the nodes. In many application domains, though, a looser synchronization schema, namely, local synchronization, can be exploited, in which each node needs to synchronize only with a subset of the other nodes. In this work, we compare the performance of global and local synchronization using the efficiency, i.e., the ratio between the useful computing time and the total computing time, including the synchronization overhead, as a key performance indicator. We present an analytical study of the asymptotic behavior of the efficiency when the number of nodes increases. As an original contribution, we prove, using the Max-Plus algebra, that there is a non-zero lower bound on the efficiency in the case of local synchronization and we present a statistical procedure to find a value of this bound. This outcome marks a significant advantage of local synchronization with respect to global synchronization, for which the efficiency tends to zero when increasing the number of nodes. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

23. High Performance Simulation of Spiking Neural Network on GPGPUs.

Author: Qu, Peng, Zhang, Youhui, Fei, Xiang, and Zheng, Weimin
Subjects: *COMPUTATIONAL neuroscience, *GRAPHICS processing units, *BIOLOGICAL neural networks, *ENERGY consumption, *BIOLOGICAL systems
Abstract: Spiking neural network (SNN) is the most commonly used computational model for neuroscience and neuromorphic computing communities. It provides more biological reality and possesses the potential to achieve high computational power and energy efficiency. Because existing SNN simulation frameworks on general-purpose graphics processing units (GPGPUs) do not fully consider the biological oriented properties of SNNs, like spike-driven, activity sparsity, etc., they suffer from insufficient parallelism exploration, irregular memory access, and load imbalance. In this article, we propose specific optimization methods to speed up the SNN simulation on GPGPU. First, we propose a fine-grained network representation as a flexible and compact intermediate representation (IR) for SNNs. Second, we propose the cross-population/-projection parallelism exploration to make full use of GPGPU resources. Third, sparsity aware load balance is proposed to deal with the activity sparsity. Finally, we further provide dedicated optimization to support multiple GPGPUs. Accordingly, BSim, a code generation framework for high-performance simulation of SNN on GPGPUs is also proposed. Tests show that, compared to a state-of-the-art GPU-based SNN simulator GeNN, BSim achieves $1.41\times \sim 9.33\times$ 1. 41 × ∼ 9. 33 × speedup for SNNs with different configurations; it outperforms other simulators much more. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

24. Boosting the Performance of SSDs via Fully Exploiting the Plane Level Parallelism.

Author: Gao, Congming, Shi, Liang, Liu, Kai, Xue, Chun Jason, Yang, Jun, and Zhang, Youtao
Subjects: *SOLID state drives, *STATIC random access memory, *REFUSE collection, *PARALLEL processing, *RANDOM access memory
Abstract: Solid state drives (SSDs) are constructed with multiple level parallel organization, including channels, chips, dies, and planes. Among these parallel levels, plane level parallelism, which is the last level parallelism of SSDs, has the most strict restrictions. Only the same type of operations that access the same address in different planes can be processed in parallel. In order to maximize the access performance, several previous works have been proposed to exploit the plane level parallelism for host accesses and internal operations of SSDs. However, our preliminary studies show that the plane level parallelism is far from well utilized and should be further improved. The reason is that the strict restrictions of plane level parallelism are hard to be satisfied. In this article, a from plane to die parallel optimization framework is proposed to exploit the plane level parallelism through smartly satisfying the strict restrictions all the time. In order to achieve the objective, there are at least two challenges. First, due to that host access patterns are always complex, receiving multiple same-type requests to different planes at the same time is uncommon. Second, there are many internal activities, such as garbage collection (GC), which may destroy the restrictions. In order to solve above challenges, two schemes are proposed in the SSD controller: First, a die level write construction scheme is designed to make sure there are always $N$ N pages of data written by each write operation. Second, in a further step, a die level GC scheme is proposed to activate GC in the unit of all planes in the same die. Combing the die level write and die level GC, write accesses from both host write operations and GC induced valid page movements can be processed in parallel at all time. To further improve the performance of SSDs, host write operations blocked by GCs are suggested to be processed in parallel with GC induced valid page movements, bringing lesser waiting time cost of host write operations. As a result, the GC cost and average write latency can be significantly reduced. Experiment results show that the proposed framework is able to significantly improve the write performance without read performance impact. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

25. Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures.

Author: Zhang, Peng, Fang, Jianbin, Yang, Canqun, Huang, Chun, Tang, Tao, and Wang, Zheng
Subjects: *PARALLEL programming, *GRAPHICS processing units, *MACHINE learning, *PREDICTION models, *HETEROGENEOUS computing
Abstract: As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way of improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks – a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learned model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93 percent of the performance delivered by a theoretically perfect predictor. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

26. Partitioning Tree-Shaped Task Graphs for Distributed Platforms With Limited Memory.

Author: Gou, Changjiang, Benoit, Anne, and Marchal, Loris
Subjects: *NP-complete problems, *MULTIPROCESSORS, *DIRECTED acyclic graphs, *SPANNING trees, *TREE graphs, *PARALLEL processing, *TERRITORIAL partition, *MEMORY
Abstract: Scientific applications are commonly modeled as the processing of directed acyclic graphs of tasks, and for some of them, the graph takes the special form of a rooted tree. This tree expresses both the computational dependencies between tasks and their storage requirements. The problem of scheduling/traversing such a tree on a single processor to minimize its memory footprint has already been widely studied. The present article considers the parallel processing of such a tree and studies how to partition it for a homogeneous multiprocessor platform, where each processor is equipped with its own memory. We formally state the problem of partitioning the tree into subtrees, such that each subtree can be processed on a single processor (i.e., it must fit in memory), and the goal is to minimize the total resulting processing time. We prove that this problem is NP-complete, and we design polynomial-time heuristics to address it. An extensive set of simulations demonstrates the usefulness of these heuristics. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

27. Approximate NoC and Memory Controller Architectures for GPGPU Accelerators.

Author: Raparti, Venkata Yaswanth and Pasricha, Sudeep
Subjects: *DYNAMIC random access memory, *ARCHITECTURE, *MEMORY, *RANDOM access memory, *GRAPHICS processing units
Abstract: High interconnect bandwidth is crucial for achieving better performance in many-core GPGPU architectures that execute highly data parallel applications. The parallel warps of threads running on shader cores generate a high volume of read requests to the main memory due to the limited size of data caches at the shader cores. This leads to a scenarios with rapid arrival of an even larger volume of reply data from the DRAM, which creates a bottleneck at memory controllers (MCs) that send reply packets back to the requesting cores over the network-on-chip (NoC). Coping with such high volumes of data requires intelligent memory scheduling and innovative NoC architectures. To mitigate memory bottlenecks in GPGPUs, we first propose a novel approximate memory controller architecture (AMC) that reduces the DRAM latency by opportunistically exploiting row buffer locality and bank level parallelism in memory request scheduling, and leverages approximability of the reply data from DRAM, to reduce the number of reply packets injected into the NoC. To further realize high throughput and low energy communication in GPGPUs, we propose a low power, approximate NoC architecture (Dapper) that increases the utilization of the available network bandwidth by using single cycle overlay circuits for the reply traffic between MCs and shader cores. Experimental results show that Dapper and AMC together increase NoC throughput by up to 21 percent; and reduce NoC latency by up to 45.5 percent and energy consumed by the NoC and MC by up to 38.3 percent, with minimal impact on output accuracy, compared to state-of-the-art approximate NoC/MC architectures. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

28. REACT: Scalable and High-Performance Regular Expression Pattern Matching Accelerator for In-Storage Processing.

Author: Jeong, Won Seob, Lee, Changmin, Kim, Keunsoo, Yoon, Myung Kuk, Jeon, Won, Jung, Myoungsoo, and Ro, Won Woo
Subjects: *PATTERN matching, *SOLID state drives, *PARALLEL processing
Abstract: This article proposes REACT, a regular expression matching accelerator, which can be embedded in a modern Solid-State Drive (SSD) and a novel data access scheduling algorithm for high matching throughput. Specifically, REACT, including our data access scheduling algorithm, increases the utilization of SSD and the degree of internal memory parallelism for pattern matching processes. While the low-level flash exhibits long latency, modern SSDs in practice achieve high I/O performance by utilizing the massive internal parallelism at the system-level. However, exploiting the parallelism is limited for pattern matching since the sub-blocks, which constitute an input data and can be placed in multiple flash pages, should be tested in a sequence to process the input correctly. This limitation can induce low utilization of the accelerator. To address this challenge, the proposed REACT simultaneously processes multiple input streams with a parallel processing architecture to maximize matching throughput by hiding the long and irregular latency. The scheduling algorithm finds a data stream which requires a sub-block in closest time and prioritizes the access request to reduce the data stall of REACT. REACT achieves maximum 22.6 percent of matching throughput improvement on a 16-channel high-performance SSD compared to the accelerator without the proposed scheduling algorithm. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

29. cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs.

Author: Zhang, Tao, Liu, Xiao-Yang, Wang, Xiaodong, and Walid, Anwar
Subjects: *GRAPHICS processing units, *VIDEO compression, *PARALLEL processing, *DATA structures, *INFORMATION storage & retrieval systems, *LINEAR algebra
Abstract: Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations of t-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum $16.91 \times, 27.03 \times, 38.97 \times, 22.36 \times, 15.43 \times$ 16. 91 × , 27. 03 × , 38. 97 × , 22. 36 × , 15. 43 × speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum $9.80 \times$ 9. 80 × and $269.26 \times$ 269. 26 × speedups over multi-core CPU implementations. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

30. cuPC: CUDA-Based Parallel PC Algorithm for Causal Structure Learning on GPU.

Author: Zarebavani, Behrooz, Jafarinejad, Foad, Hashemi, Matin, and Salehkaleybar, Saber
Subjects: *PARALLEL algorithms, *MULTIVARIATE analysis, *GAUSSIAN distribution, *GRAPHICS processing units
Abstract: The main goal in many fields in the empirical sciences is to discover causal relationships among a set of variables from observational data. PC algorithm is one of the promising solutions to learn underlying causal structure by performing a number of conditional independence tests. In this paper, we propose a novel GPU-based parallel algorithm, called cuPC, to execute an order-independent version of PC. The proposed solution has two variants, cuPC-E and cuPC-S, which parallelize PC in two different ways for multivariate normal distribution. Experimental results show the scalability of the proposed algorithms with respect to the number of variables, the number of samples, and different graph densities. For instance, in one of the most challenging datasets, the runtime is reduced from more than 11 hours to about 4 seconds. On average, cuPC-E and cuPC-S achieve 500X and 1300X speedup, respectively, compared to serial implementation on CPU. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

31. Toward Designing Cost-Optimal Policies to Utilize IaaS Clouds with Online Learning.

Author: Wu, Xiaohu, Loiseau, Patrick, and Hyytia, Esa
Subjects: *ONLINE education, *COST control, *ON-demand computing, *UNITS of time, *COMPUTER simulation, *COMPUTER graphics
Abstract: Many businesses possess a small infrastructure that they can use for their computing tasks, but also often buy extra computing resources from clouds. Cloud vendors such as Amazon EC2 offer two types of purchase options: on-demand and spot instances. As tenants have limited budgets to satisfy their computing needs, it is crucial for them to determine how to purchase different options and utilize them (in addition to possible self-owned instances) in a cost-effective manner while respecting their response-time targets. In this paper, we propose a framework to design policies to allocate self-owned, on-demand and spot instances to arriving jobs. In particular, we propose a near-optimal policy to determine the number of self-owned instances and an optimal policy to determine the number of on-demand instances to buy and the number of spot instances to bid for at each time unit. Our policies rely on a small number of parameters and we use an online learning technique to infer their optimal values. Through numerical simulations, we show the effectiveness of our proposed policies, in particular that they achieve a cost reduction of up to 64.51 percent when spot and on-demand instances are considered and of up to 43.74 percent when self-owned instances are considered, compared to previously proposed or intuitive policies. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

32. Single Restart with Time Stamps for Parallel Task Processing with Known and Unknown Processors.

Author: Champati, Jaya Prakash and Liang, Ben
Subjects: *TIMESTAMPS, *PARALLEL processing, *HETEROGENEOUS computing, *HEURISTIC algorithms, *COMPUTER systems, *ALGORITHMS, *PARALLEL programming
Abstract: We study the problem of scheduling $n$ n tasks on $m+m^{\prime }$ m + m ' parallel processors, where the processing times on $m$ m processors are known while those on the remaining $m^{\prime }$ m ' processors are not known a priori. This semi-online model is an abstraction of certain heterogeneous computing systems, e.g., with the $m$ m known processors representing local CPU cores and the unknown processors representing remote servers with uncertain availability of computing cycles. Our objective is to minimize the makespan of all tasks. We initially focus on the case $m^{\prime }=1$ m ' = 1 and propose a semi-online algorithm termed Single Restart with Time Stamps (SRTS), which has time complexity $O(n \log n)$ O (n log n) . We derive its competitive ratio in comparison with the optimal offline solution. If the unknown processing times are deterministic, the competitive ratio of SRTS is shown to be either always constant or asymptotically constant in practice, respectively in cases where the processing times are independent and dependent on $m$ m . A similar result is obtained when the unknown processing times are random. Furthermore, extending the ideas of SRTS, we propose a heuristic algorithm termed SRTS-Multiple (SRTS-M) for the case $m^{\prime }>1$ m ' > 1 . Finally, where tasks arrive dynamically with unknown arrival times, we extend SRTS to Dynamic SRTS (DSRTS) and find its competitive ratio. Besides the proven competitive ratios, simulation results further suggest that SRTS and SRTS-M give superior performance on average over randomly generated task processing times, substantially reducing the makespan over the best known alternatives. Interestingly, the performance gain is more significant for task processing times sampled from heavy-tailed distributions. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

33. Data-Parallel Hashing Techniques for GPU Architectures.

Author: Lessley, Brenton and Childs, Hank
Subjects: *HASHING, *DATA structures, *COMPUTER graphics, *MACHINE learning, *GRAPHICS processing units
Abstract: Hash tables are a fundamental data structure for effectively storing and accessing sparse data, with widespread usage in domains ranging from computer graphics to machine learning. This study surveys the state-of-the-art research on data-parallel hashing techniques for emerging massively-parallel, many-core GPU architectures. This survey identifies key factors affecting the performance of different techniques and suggests directions for further research. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

34. Exploiting Parallelism and Vectorisation in Breadth-First Search for the Intel Xeon Phi.

Author: Paredes, Mireya, Riley, Graham, and Lujan, Mikel
Subjects: *ALGORITHMS, *GRAPH algorithms, *PARALLEL processing, *PARALLEL computers, *MODERN architecture, *SEARCH algorithms, *PARALLEL programming
Abstract: Modern applications generate massive amounts of data that is challenging to process or analyse. Graph algorithms have emerged as a solution for the analysis of such data because they can represent the entities participating in the generation of large-scale datasets in terms of vertices and their relationships in terms of edges. Graph analysis algorithms are used for finding patterns within these relationships, aiming to extract information to be further analysed. The breadth-first search (BFS) is one of the main graph search algorithms used for graph analysis and its optimisation has been widely researched using different parallel computers. However, the parallelisation of BFS has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. This paper investigates the optimisation of the BFS on the Xeon Phi (Knights Corner), a modern parallel architecture provided with an advanced vector processor supporting the AVX-512 instruction set, using a bespoke development framework integrated with the Graph 500 benchmark. In addition, to demonstrate portability, we show results for a direct port of the algorithms to a more recent version of the Xeon Phi (Knights Landing) and to a Skylake CPU which supports most of the AVX-512 instruction set. Optimised parallel versions of two high-level algorithms for BFS were created using vectorisation, starting with the conventional top-down BFS algorithm and, building on this, a hybrid BFS algorithm. On the KNC our best implementations result in speedups of 1.37x (top-down) and 1.37x (hybrid), for a one million vertices graph, compared to the state-of-the-art. On the KNL and Skylake, the performance is higher than on KNC. In addition, we show results of our best hybrid algorithm on real-world graphs from the SNAP datasets with speedups up to 1.3x on KNC. Performance on KNL and Skylake is again higher, demonstrating the robustness and portability of our algorithm. The hybrid BFS algorithm can be further used to speed up other graph analysis algorithms and the lessons learned from vectorisation can be applied to other algorithms targeting existing and future models of the Xeon Phi and other advanced vector architectures. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

35. Resource-Efficient Index Shard Replication in Large Scale Search Engines.

Author: Li, Yusen, Tang, Xueyan, Cai, Wentong, Tong, Jiancong, Liu, Xiaoguang, and Wang, Gang
Subjects: *SEARCH engines, *HEURISTIC algorithms, *INTERNET content, *INTEGER programming, *COMPUTER science, *PARALLEL processing
Abstract: With the rapid growth of the Web scale, large scale search engines have to set up a huge number of machines to place the index files of the Web contents. The index files are normally divided into smaller index shards which are often replicated so that queries can be processed in parallel. We observe from real systems that the index shard replication strategy could have a significant impact on the resource usage. In this paper, we investigate the index shard replication problem with the goal of minimizing the resource usage in search engine datacenters. We consider both the offline version and online version of the problem, and formulate the problems as non-linear integer programming problems. We propose several heuristic algorithms to approximate the optimal solution. The proposed algorithms are evaluated by extensive experiments using both synthetic data and real data from commercial search engines. The results demonstrate the effectiveness of the proposed algorithms. Our work also yields many insights about the impact of different input properties on the performance of each algorithm. We believe that this paper will provide valuable guidance to the design of the index shard replication strategy in practice. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

36. Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs.

Author: You, Yang, Zhang, Zhao, Hsieh, Cho-Jui, Demmel, James, and Keutzer, Kurt
Subjects: *SUPERCOMPUTERS, *CONFERENCE papers, *PARALLEL processing
Abstract: Since its creation, the ImageNet-1k benchmark set has played a significant role as a benchmark for ascertaining the accuracy of different deep neural net (DNN) models on the image classification problem. Moreover, in recent years it has also served as the principal benchmark for assessing different approaches to DNN training. Finishing a 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days. This training requires $10^{18}$1018 single precision operations in total. On the other hand, the world's current fastest supercomputer can finish $3 \times 10^{17}$3×1017 single precision operations per second (according to the Nov 2018 Top 500 results). If we can make full use of the computing capability of the fastest supercomputer, we should be able to finish the training in several seconds. Over the last two years, researchers have focused on closing this significant performance gap through scaling DNN training to larger numbers of processors. Most successful approaches to scaling ImageNet training have used the synchronous mini-batch stochastic gradient descent (SGD). However, to scale synchronous SGD one must also increase the batch size used in each iteration. Thus, for many researchers, the focus on scaling DNN training has translated into a focus on developing training algorithms that enable increasing the batch size in data-parallel synchronous SGD without losing accuracy over a fixed number of epochs. In this paper, we investigate supercomputers' capability of speeding up DNN training. Our approach is to use a large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on five neural networks: AlexNet, AlexNet-BN, GNMT, ResNet-50, and ResNet-50-v2 trained with large datasets while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from Goyal et al. [1] , our approach shows higher test accuracy on batch sizes that are larger than 16K. When we use the same baseline, our results are better than Goyal et al. for all the batch sizes (Fig. 20). Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe, Facebook's PyTorch, and Google's TensorFlow. The difference between this paper and the conference-version of our work [2] includes: (1) we implement our approach on Google's cloud Tensor Processing Unit (TPU) platform, which verifies our previous success on CPUs and GPUs. (2) we scale the batch size of ResNet-50-v2 to 32K and achieve 76.3 percent accuracy, which is better than the 75.3 percent accuracy achieved in our conference paper. (3) we apply our approach to Google's Neural Machine Translation (GNMT) application, which helps us to achieves 4× speedup on the cloud TPUs. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

37. Massively Parallel Tree Search for High-Dimensional Sphere Decoders.

Author: Nikitopoulos, Konstantinos, Georgis, Georgios, Jayawardena, Chathura, Chatzipanagiotis, Daniil, and Tafazolli, Rahim
Subjects: *SPHERES, *VERY large scale circuit integration, *MIMO systems, *MAGNITUDE (Mathematics), *ERROR rates
Abstract: The recent paradigm shift towards the transmission of large numbers of mutually interfering information streams, as in the case of aggressive spatial multiplexing, combined with requirements towards very low processing latency despite the frequency plateauing of traditional processors, initiates a need to revisit the fundamental maximum-likelihood (ML) and, consequently, the sphere-decoding (SD) detection problem. This work presents the design and VLSI architecture of MultiSphere; the first method to massively parallelize the tree search of large sphere decoders in a nearly-concurrent manner, without compromising their maximum-likelihood performance, and by keeping the overall processing complexity comparable to that of highly-optimized sequential sphere decoders. For a $10\times 10$10×10 MIMO spatially multiplexed system with 16-QAM modulation and 32 processing elements, our MultiSphere architecture can reduce latency by $29\times$29× against well-known sequential SDs, approaching the processing latency of linear detection methods, without compromising ML optimality. In MIMO multicarrier systems targeting exact ML decoding, MultiSphere achieves processing latency and hardware efficiency that are orders of magnitude improved compared to approaches employing one SD per subcarrier. In addition, for $16\!\times \!16$16×16 both "hard"- and "soft"-output MIMO systems, approximate MultiSphere versions are shown to achieve similar error rate performance with state-of-the art approximate SDs having akin parallelization properties, by using only one tenth of the processing elements, and to achieve up to approximately $9\!\times$9× increased energy efficiency. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

38. A Compiler for Agnostic Programming and Deployment of Big Data Analytics on Multiple Platforms.

Author: Di Martino, Beniamino, Esposito, Antonio, D'Angelo, Salvatore, Maisto, Salvatore Augusto, and Nacchia, Stefania
Subjects: *BIG data, *COMPILERS (Computer programs), *DISTRIBUTED algorithms
Abstract: To run proper Big Data Analytics, small and medium enterprises (SMEs) need to acquire expertise, hardware and software, which often translates to relevant initial investments for activities not directly connected to the company's business. To reduce such investments, the TOREADOR project proposes a Big Data Analytics framework which supports users in devising their own Big Data solutions by keeping the inherent costs at a minimum, and leveraging pre-existent knowledge and expertise. Among the objectives of the TOREADOR framework is supporting developers in parallelizing and deploying their Big Data algorithms, in order to develop their own analytics solutions. This paper describes the Code-Based approach, adopted within the TOREADOR framework to parallelize users' algorithms and deploy them on distributed platforms, via the annotation of parallelizable code portions with parallelization primitives. The approach, which relies on the guidance of Parallel Patterns to implement the parallelization, and on Skeletons to automatically build execution and deployment templates, is realized through a source-to-source Compiler, also described in the present paper. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

39. Privacy Regulation Aware Process Mapping in Geo-Distributed Cloud Data Centers.

Author: Zhou, Amelie Chi, Xiao, Yao, Gong, Yifan, He, Bingsheng, Zhai, Jidong, and Mao, Rui
Subjects: *SERVER farms (Computer network management), *NETWORK performance, *PRIVACY, *PARALLEL processing, *CLOUDS & the environment, *MACHINE learning
Abstract: Recently, various applications including data analytics and machine learning have been developed for geo-distributed cloud data centers. For those applications, the ways of mapping parallel processes to physical nodes (i.e., "process mapping") could significantly impact the performance of the applications because of non-uniform communication cost in geo-distributed environments. What's more, the different data privacy requirements in geo-distributed data centers pose additional constraints on process mapping solutions. While process mapping has been widely studied in grid/cluster environments, few of the existing studies have considered the problem in geo-distributed cloud environment, which is a challenging task due to the multi-level data privacy constraints, heterogeneous network performance and process failures. In this paper, we introduce the special privacy requirements in geo-distributed data centers and formulate the geo-distributed process mapping problem as an optimization problem with multiple constraints. We develop a new method to efficiently find good process mapping solutions to the problem. Experimental results on real clouds (including Amazon EC2 and Windows Azure) and simulations demonstrate that our proposed approach can achieve significant performance improvement compared to the state-of-the-art algorithms. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

40. Exploring GPU-Accelerated Routing for FPGAs.

Author: Shen, Minghua, Luo, Guojie, and Xiao, Nong
Subjects: *FIELD programmable gate arrays, *MOTHERBOARDS, *SUBGRAPHS, *PARALLEL algorithms, *GRAPHICS processing units
Abstract: Field Programmable Gate Arrays (FPGAs) are reconfigurable architectures able to provide a good balance between energy efficiency and flexibility with respect to CPUs and ASICs. The main drawback in using FPGAs, however, is their timing-consuming routing process, significantly hindering the designer productivity. An emerging solution to this problem is to accelerate the routing by parallelization. Existing attempts of parallelizing the FPGA routing either do not fully exploit the parallelism or suffer from an excessive quality loss. Massive parallelism using GPUs has the potential to solve this issue but faces non-trivial challenges. To cope with these challenges, this paper explores GPU-accelerated routing approach for FPGAs. We leverage the idea of problem size reduction by limiting the single-net routing in a small subgraph rather than in an entire graph, further enabling the GPU-friendly shortest path algorithm to be used in FPGA routing. We maintain the convergence after problem size reduction by using the dynamic expansion of the routing resource subgraph, where the routing region of subgraph will be progressively expanded to find a feasible solution to each net. In addition, we are based on a GPU platform to explore the fine-grained single-net parallel routing in three ways and propose a hybrid approach to combine the static and dynamic parallelization for better speedup in FPGA routing. To explore the coarse-grained multi-net parallelization, We propose a dynamic programming-based partitioning algorithm to parallelize the routing of multiple nets while generating the equivalent routing results as the original single-net routing. Experimental results show that our proposed approach can provide an average of about 21.53× speedup on a single GPU with a tolerable loss in the routing quality and maintain a scalable speedup on large-scale routing resource graphs. To our knowledge, this is the first work to demonstrate the effectiveness of GPU-accelerated routing for FPGAs. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

41. A Bi-layered Parallel Training Architecture for Large-Scale Convolutional Neural Networks.

Author: Chen, Jianguo, Li, Kenli, Bilal, Kashif, Zhou, Xu, Li, Keqin, and Yu, Philip S.
Subjects: *NEURAL circuitry, *PARALLEL computers, *DATA transmission systems, *SYNCHRONIZATION, *DEEP learning
Abstract: Benefitting from large-scale training datasets and the complex training network, Convolutional Neural Networks (CNNs) are widely applied in various fields with high accuracy. However, the training process of CNNs is very time-consuming, where large amounts of training samples and iterative operations are required to obtain high-quality weight parameters. In this paper, we focus on the time-consuming training process of large-scale CNNs and propose a Bi-layered Parallel Training (BPT-CNN) architecture in distributed computing environments. BPT-CNN consists of two main components: (a) an outer-layer parallel training for multiple CNN subnetworks on separate data subsets, and (b) an inner-layer parallel training for each subnetwork. In the outer-layer parallelism, we address critical issues of distributed and parallel computing, including data communication, synchronization, and workload balance. A heterogeneous-aware Incremental Data Partitioning and Allocation (IDPA) strategy is proposed, where large-scale training datasets are partitioned and allocated to the computing nodes in batches according to their computing power. To minimize the synchronization waiting during the global weight update process, an Asynchronous Global Weight Update (AGWU) strategy is proposed. In the inner-layer parallelism, we further accelerate the training process for each CNN subnetwork on each computer, where computation steps of convolutional layer and the local weight training are parallelized based on task-parallelism. We introduce task decomposition and scheduling strategies with the objectives of thread-level load balancing and minimum waiting time for critical paths. Extensive experimental results indicate that the proposed BPT-CNN effectively improves the training performance of CNNs while maintaining the accuracy. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

42. Solving Linear Diophantine Systems on Parallel Architectures.

Author: Zaitsev, Dmitry, Tomov, Stanimire, and Dongarra, Jack
Subjects: *DIOPHANTINE analysis, *PARALLEL processing, *LINEAR systems, *TASK analysis, *MATHEMATICAL models
Abstract: Solving linear Diophantine systems of equations is applied in discrete-event systems, model checking, formal languages and automata, logic programming, cryptography, networking, signal processing, and chemistry. For modeling discrete systems with Petri nets, a solution in non-negative integer numbers is required, which represents an intractable problem. For this reason, solving such kinds of tasks with significant speedup is highly appreciated. In this paper we design a new solver of linear Diophantine systems based on the parallel-sequential composition of the system clans. The solver is studied and implemented to run on parallel architectures using a two-level parallelization concept based on MPI and OpenMP. A decomposable system is usually represented by a sparse matrix; a minimal clan size of the decomposition restricts the granulation of the technique. MPI is applied for solving systems for clans using a parallel-sequential composition on distributed-memory computing nodes, while OpenMP is applied in solving a single indecomposable system on a single node using multiple cores. A dynamic task-dispatching subsystem is developed for distributing systems on nodes in the process of compositional solution. Computational speedups are obtained on a series of test examples, e.g., illustrating that the best value constitutes up to 45 times speedup obtained on 5 nodes with 20 cores each. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

43. Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer.

Author: Chen, Yuedan, Li, Kenli, Yang, Wangdong, Xiao, Guoqing, Xie, Xianghui, and Li, Tao
Subjects: *SPARSE matrices, *COMPUTER architecture, *PARALLEL processing, *SUPERCOMPUTERS, *KERNEL (Mathematics)
Abstract: General sparse matrix-sparse matrix multiplication (SpGEMM) is one of the fundamental linear operations in a wide variety of scientific applications. To implement efficient SpGEMM for many large-scale applications, this paper proposes scalable and optimized SpGEMM kernels based on COO, CSR, ELL, and CSC formats on the Sunway TaihuLight supercomputer. First, a multi-level parallelism design for SpGEMM is proposed to exploit the parallelism of over 10 millions cores and better control memory based on the special Sunway architecture. Optimization strategies, such as load balance, coalesced DMA transmission, data reuse, vectorized computation, and parallel pipeline processing, are applied to further optimize performance of SpGEMM kernels. Second, we thoroughly analyze the performance of the proposed kernels. Third, a performance-aware model for SpGEMM is proposed to select the most appropriate compressed storage formats for the sparse matrices that can achieve the optimal performance of SpGEMM on the Sunway. The experimental results show the SpGEMM kernels have good scalability and meet the challenge of the high-speed computing of large-scale data sets on the Sunway. In addition, the performance-aware model for SpGEMM achieves an absolute value of relative error rate of 8.31 percent on average when the kernels are executed in one single process and achieves 8.59 percent on average when the kernels are executed in multiple processes. It is proved that the proposed performance-aware model can perform at high accuracy and satisfies the precision of selecting the best formats for SpGEMM on the Sunway TaihuLight supercomputer. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

44. Integrating Concurrency Control in n-Tier Application Scaling Management in the Cloud.

Author: Wang, Qingyang, Chen, Hui, Zhang, Shungeng, Hu, Liting, and Palanisamy, Balaji
Subjects: *ELECTRONIC commerce, *RESOURCE management, *DATABASES, *WEB-based user interfaces, *CLOUD computing
Abstract: Scaling complex distributed systems such as e-commerce is an importance practice to simultaneously achieve high performance and high resource efficiency in the cloud. Most previous research focuses on hardware resource scaling to handle runtime workload variation. Through extensive experiments using a representative n-tier web application benchmark (RUBBoS), we demonstrate that scaling an n-tier system by adding or removing VMs without appropriately re-allocating soft resources (e.g., server threads and connections) may lead to significant performance degradation resulting from implicit change of request processing concurrency in the system, causing either over- or under-utilization of the critical hardware resource in the system. We build a concurrency-aware model that determines a near optimal soft resource allocation of each tier by combining some operational queuing laws and the fine-grained online measurement data of the system. We then develop a dynamic concurrency management (DCM) framework that integrates the concurrency-aware model to intelligently reallocate soft resources in the system during the system scaling process. We compare DCM with Amazon EC2-AutoScale, the state-of-the-art hardware only scaling management solution using six real-world bursty workload traces. The experimental results show that DCM achieves significantly shorter tail latency and higher throughput compared to Amazon EC2-AutoScale under all the workload traces. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

45. Ultra-Fast Bloom Filters using SIMD Techniques.

Author: Lu, Jianyuan, Wan, Ying, Li, Yang, Zhang, Chuwen, Dai, Huichen, Wang, Yi, Zhang, Gong, and Liu, Bin
Subjects: *INFORMATION filtering systems, *ENCODING, *NETWORK routers, *COMPUTER network resources, *MEMBERSHIP
Abstract: The network link speed is growing at an ever-increasing rate, which requires all network functions on routers/switches to keep pace. Bloom filter is a widely-used membership check data structure in networking applications. Correspondingly, it also faces the urgent demand of improving the performance in membership check speed. To this end, this paper proposes a new Bloom filter variant called Ultra-Fast Bloom Filters (UFBF), by leveraging the Single Instruction Multiple Data (SIMD) techniques. We make three improvements for UFBF to accelerate the membership check speed. First, we develop a novel hash computation algorithm which can compute multiple hash functions in parallel with the use of SIMD instructions. Second, we elaborate a Bloom filter’s bit-test process from sequential to parallel, enabling more bit-tests per unit time. Third, we improve the cache efficiency of membership check by encoding an element’s information to a small block so that it can fit into a cache-line. We further generalize UFBF, called c-UFBF, to make UFBF supporting large number of hash functions. Both theoretical analysis and extensive evaluations show that the UFBF greatly outperforms the state-of-the-art Bloom filter variants on membership check speed. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

46. Exploiting Parallelism for CNN Applications on 3D Stacked Processing-In-Memory Architecture.

Author: Wang, Yi, Chen, Weixuan, Yang, Jing, and Li, Tao
Subjects: *ARTIFICIAL neural networks, *ARTIFICIAL intelligence, *DATA transmission systems, *CACHE memory, *DYNAMIC programming
Abstract: Deep convolutional neural networks (CNNs) are widely adopted in intelligent systems with unprecedented accuracy but at the cost of a substantial amount of data movement. Although the emerging processing-in-memory (PIM) architecture seeks to minimize data movement by placing memory near processing elements, memory is still the major bottleneck in the entire system. The selection of hyper-parameters in the training of CNN applications requires over hundreds of kilobytes cache capacity for concurrent processing of convolutions. How to jointly explore the computation capability of the PIM architecture and the highly parallel property of neural networks remains a critical issue. This paper presents Para-Net, that exploits Parallelism for deterministic convolutional neural Networks on the PIM architecture. Para- Net achieves data-level parallelism for convolutions by fully utilizing the on-chip processing engine (PE) in PIM. The objective is to capture the characteristics of neural networks and present a hardware-independent design to jointly optimize the scheduling of both intermediate results and computation tasks. We formulate this data allocation problem as a dynamic programming model and obtain an optimal solution. To demonstrate the viability of the proposed Para-Net, we conduct a set of experiments using a variety of realistic CNN applications. The graph abstractions are obtained from deep learning framework Caffe. Experimental results show that Para-Net can significantly reduce processing time and improve cache efficiency compared to representative schemes. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

47. A Virtual Multi-Channel GPU Fair Scheduling Method for Virtual Machines.

Author: Tan, Huailiang, Tan, Yanjie, He, Xiaofei, Li, Kenli, and Li, Keqin
Subjects: *GRAPHICS processing units, *VIRTUAL machine systems, *MACHINE learning, *DEEP learning, *COMPUTER architecture
Abstract: In modern virtual computing environment, the 2D/3D rendering performance and parallel computing potential of GPU (graphics processing unit) must be fully exploited for multiple virtual machines (VMs). Existing GPU virtualization techniques are unable to take full advantage of a GPU's powerful 2D/3D hardware-accelerated graphics rendering performance or parallel computing potential, or it has not been considered that the internal resources of a GPU domain are fairly allocated between VMs with different performance requirements. Therefore, we propose a multi-channel GPU virtualization architecture (VMCG), model the corresponding credit allocating and transferring mechanisms, and redesign the virtual multi-channel GPU fair-scheduling algorithm. VMCG provides a separate V-Channel for each guest VM (DomU) that competes with other VMs for the same physical GPU resources, and each DomU submits command request blocks to its respective V-Channel according to the corresponding DomU ID. Through the virtual multi-channel GPU fair-scheduling algorithm, not only do multiple DomUs make full use of native GPU hardware acceleration, but the fairness of GPU resource allocation is significantly improved during GPU-intensive workloads from multiple DomUs running on the same host. Experimental results show that, for 2D/3D graphics applications, performance is close to 96 percent of that of the native GPU, performance is improved by approximately 500 percent for parallel computing applications, and GPU resource-allocation fairness is improved by approximately 60-80 percent. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

48. Comparative Analysis of Intra-Algorithm Parallel Multiobjective Evolutionary Algorithms: Taxonomy Implications on Bioinformatics Scenarios.

Author: Santander-Jimenez, Sergio and Vega-Rodriguez, Miguel A.
Subjects: *COMPARATIVE studies, *EVOLUTIONARY algorithms, *TAXONOMY, *BIOINFORMATICS, *PARETO distribution
Abstract: Parallelism has become a recurrent tool to support computational intelligence and, particularly, evolutionary algorithms in the solution of very complex optimization problems, especially in the multiobjective case. However, the selection of parallel evolutionary designs often represents a difficult question due to the multiple variables that must be considered to attain an accurate exploitation of hardware resources, along with their influence in solution quality. This work looks into this issue by conducting a comparative performance analysis of intra-algorithm parallel multiobjective evolutionary algorithms running on shared-memory configurations. We consider different design trends including A) generational approaches based on measurements of solution quality plus diversity, B) generational approaches based on measurements of solution quality exclusively, and C) non-generational approaches. Following these trends, a total of six representative algorithms are applied to tackle a challenging bioinformatics problem as a case study, phylogenetic reconstruction. Experimentation on real-world scenarios point out the main advantages and weaknesses of each design, outlining guidelines for the selection of methods according to the characteristics of the employed hardware, evolutionary properties, and the parallelism exploitation capabilities of the evaluated approaches. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

49. Adaptive Scheduling Parallel Jobs with Dynamic Batching in Spark Streaming.

Author: Cheng, Dazhao, Zhou, Xiaobo, Wang, Yu, and Jiang, Changjun
Subjects: *COMPUTER scheduling, *RESOURCE allocation, *PARALLEL processing, *STREAMING technology, *DATA analytics
Abstract: Today enterprises have massive stream data that require to be processed in real time due to data explosion in recent years. Spark Streaming as an emerging system is developed to process real time stream data analytics by using micro-batch approach. The unified programming model of Spark Steaming leads to some unique benefits over other traditional streaming systems, such as fast recovery from failures, better load balancing and resource usage. It treats the continuous stream as a series of micro-batches of data and continuously process these micro-batch jobs. However, efficient scheduling of micro-batch jobs to achieve high throughput and low latency is very challenging due to the complex data dependency and dynamism inherent in streaming workloads. In this paper, we propose A-scheduler, an adaptive scheduling approach that dynamically schedules parallel micro-batch jobs in Spark Streaming and automatically adjusts scheduling parameters to improve performance and resource efficiency. Specifically, A-scheduler dynamically schedules multiple jobs concurrently using different policies based on their data dependencies and automatically adjusts the level of job parallelism and resource shares among jobs based on workload properties. Furthermore, we integrate dynamic batching technique with A-Scheduler to further improve the overall performance of the customized Spark Streaming system. It relies on an expert fuzzy control mechanism to dynamically adjust the length of each batch interval in response to time-varying streaming workload and system processing rate. We implemented A-scheduler and evaluated it with a real-time security event processing workload. Our experimental results show that A-scheduler with dynamic batching can reduce end-to-end latency by 38 percent and meanwhile improve workload throughput and energy efficiency by 23 and 15 percent, respectively, compared to the default Spark Streaming scheduler. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

50. M-Oscillating: Performance Maximization on Temperature-Constrained Multi-Core Processors.

Author: Sha, Shi, Wen, Wujie, Ren, Shaolei, and Quan, Gang
Subjects: *OSCILLATIONS, *ELECTRONIC equipment, *MULTICORE processors, *HETEROGENEITY, *PARALLEL processing
Abstract: The ever-increasing computational demand drives modern electronic devices to integrate more processing elements for pursuing higher computing performance. However, the resulting soaring power density and potential thermal crisis constrain the system performance under a maximally allowed temperature. This paper analytically studies the throughput maximization problem of multi-core platforms under the peak temperature constraints. To take advantage of thermal heterogeneity of different cores for performance improvement, we propose to run each core with multiple speed levels and develop a schedule based on two novel concepts, i.e., the step-up schedule and the m-Oscillating schedule, for multi-core platforms. The proposed methodology can ensure the peak temperature guarantee with a significant improvement in computing throughput up to 89 percent, with an average improvement of 11 percent. Meanwhile, the computational time reduces orders of magnitude compared to the traditional exhaustive search-based approach. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

620 results on '"PARALLEL processing"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources