Descriptor: "computer architecture" / Journal: ieee transactions on parallel & distributed systems - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"computer architecture"' showing total 658 results

Start Over Descriptor "computer architecture" Journal ieee transactions on parallel & distributed systems

658 results on '"computer architecture"'

1. Floating Point Calculation of the Cube Function on FPGAs.

Author: Osorio, Roberto R.
Subjects: *MATHEMATICAL functions, *FIELD programmable gate arrays, *DIGITAL integrated circuits, *CUBES, *ARITHMETIC
Abstract: Specialized arithmetic units allow fast and efficient computation of lesser used mathematical functions. The overall impact of those units would be negligible in a general purpose processor, as added circuitry makes chips more complex despite most software would seldom make use of it. On the opposite side, custom computing machines are built for a specific task, and they can always benefit from specialized units if they are available. In this work, floating point architectures are proposed for computing the cube on Intel and Xilinx FPGAs. Those implementations reduce the cost and latency compared to using simple floating point multiplications and squarers. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

2. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors.

Author: Sun, Wei, Li, Ang, Geng, Tong, Stuijk, Sander, and Corporaal, Henk
Subjects: *MATRIX multiplications, *SPARSE matrices, *APPLICATION program interfaces, *GRAPHICS processing units
Abstract: Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs. Legacy wmma APIs are more easy-to-use but can only exploit limited features and power of Tensor Cores. Specifically, wmma APIs support fewer operand shapes and can not leverage the new sparse matrix multiplication feature of the newest Ampere Tensor Cores. However, the performance of current programming interface has not been well explored. Furthermore, the computation numeric behaviors of low-precision floating points (TF32, BF16, and FP16) supported by the newest Ampere Tensor Cores are also mysterious. In this paper, we explore the throughput and latency of current programming APIs. We also intuitively study the numeric behaviors of Tensor Cores MMA and profile the intermediate operations including multiplication, addition of inner product, and accumulation. All codes used in this work can be found in https://github.com/sunlex0717/DissectingTensorCores. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

3. swMPAS-A: Scaling MPAS-A to 39 Million Heterogeneous Cores on the New Generation Sunway Supercomputer.

Author: Hao, Xiaoyu, Fang, Tao, Chen, Junshi, Gu, Jun, Feng, Jiawang, An, Hong, and Zhao, Chun
Subjects: *HETEROGENEOUS computing, *SUPERCOMPUTERS, *SCIENTIFIC computing, *COMPUTER architecture, *SCALABILITY, *ATMOSPHERIC models, *PREDICTION models
Abstract: With the computing power of High-Performance Computing (HPC) systems having stepped into the exascale era, more complex problems can be solved with scientific applications on a large scale. However, due to the significant performance gap between computing nodes and storage subsystems, suboptimal design for the Input/Output (I/O) module will significantly impede the efficiency of scientific applications, especially for the ubiquitous atmosphere applications. Two-phase I/O implemented in N-to-1 mode creates a serious bottleneck that hinders the scalability for the Model for Prediction Across Scales-Atmosphere (MPAS-A) on the new generation Sunway supercomputer. To address the I/O problem, we apply a custom data reorganization method to enable N-to-M I/O mode to exploit the parallel file system's performance and limit the data transfer among MPI ranks to a restricted scope to alleviate communication overhead. Moreover, we have conducted several methods to accelerate the computations, including the redesign for tracer transport, a hybrid buffering scheme, and a three-level parallelization scheme, which allows MPAS-A to use all heterogeneous computing resources efficiently. Experimental results show admirable scalability and efficiency of our I/O method, which achieves speedups of 41× and 58.9× for input and output compared with the raw I/O method on 30,000 MPI ranks. By scaling MPAS-A to 39 million heterogeneous cores, we demonstrate the necessity of a well-constructed I/O module for a real-world atmosphere application. Speed tests show that our optimization methods obtain good results for computations, and MPAS-A achieves a speed of 0.82 Simulated Day per Hour (SDPH) and 0.76 parallel efficiency of strong scaling with 600,000 MPI ranks. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

4. A Novel Compute-Efficient Tridiagonal Solver for Many-Core Architectures.

Author: Liu, Kan and Xue, Wei
Subjects: *COMPUTER architecture, *GRAPHICS processing units
Abstract: The tridiagonal solver is an important kernel and is widely supported in mainstream numerical libraries. While parallel algorithms have been studied for many-core architectures, the performance of current algorithms and implementations is still hindered by input size sensitivity and cross-platform portability. In this paper, we propose a novel algorithm WM-pGE for the batched solution of diagonally dominant tridiagonal systems. The algorithm balances the key design objectives, including computation complexity, memory complexity, parallelism, and input size sensitivity, better than existing algorithms. Moreover, an elegant formulation is presented to show the implementation and cross-platform optimization without loss of efficiency and generality, by extracting the platform-dependent works into only four vector operators. The results from our batched tridiagonal experiments show that the proposed algorithm outperforms the prior work PCR-pThomas by 25% and 12% on NVIDIA Tesla V100 in single and double precision, respectively. On Intel KNL, our method achieves a 10% improvement in performance over PCR-pThomas in double precision. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

5. $TC-Stream$ T C - S t r e a m : Large-Scale Graph Triangle Counting on a Single Machine Using GPUs.

Author: Huang, Jianqiang, Wang, Haojie, Fei, Xiang, Wang, Xiaoying, and Chen, Wenguang
Subjects: *GRAPH algorithms, *SOLID state drives, *GRAPHICS processing units, *PARALLEL algorithms, *TRIANGLES, *ON-demand computing, *COUNTING, *SOCIAL network analysis
Abstract: In this paper, we build a $TC$ T C - $Stream$ S t r e a m , a high-performance graph processing system specific for a triangle counting algorithm on graph data with up to tens of billions of edges, which significantly exceeds the device memory capacity of Graphics Processing Units (GPUs). The triangle counting problem is a broad research topic in data mining and social network analysis in the graph processing field. As the scale of the graph data grows, a portion of the graph data must be loaded iteratively. In the existing literature, graphs with billions of edges need to be done distributively, which is cost-intensive. Also, many disk-based triangle counting systems are proposed for CPU architectures, but their tackling performances are inefficient. To solve the above problem, we propose $TC$ T C - $Stream$ S t r e a m , and it focuses on three issues: 1) For power-law graphs, because the amount of tasks of each vertex or edge is inconsistent, it is bound to cause different demands of computing and memory resources for different task types. We propose a parallel vertex approach and the reordering of vertices for graph data that can be placed in the GPU device memory to ensure the maximum workload balancing; 2) A binary-search-based set intersection method is designed to achieve the maximum parallelism in GPU; 3) For the graph data that exceeds the GPU device memory capacity, we develop a novel vertical partition algorithm to guarantee the independent computing on each partition so that the three computation processes, i.e., the computation on GPU, the data transmission between main memory of CPU and SSD, and the communication between the CPU and the GPU can be perfectly overlapped. Moreover, the $TC$ T C - $Stream$ S t r e a m optimizes edge-iterator models and benefits from multi-thread parallelism. Extensive experiments conducted on large-scale datasets showed that the $TC$ T C - $stream$ s t r e a m running on a single Tesla V100 GPU performs $2.4-6\times$ 2. 4 - 6 × and $1.8-4.4\times$ 1. 8 - 4. 4 × faster than the state-of-the-art single-machine in-memory triangle counting system and GPU-based triangle counting system, respectively, and achieves $2.4\times$ 2. 4 × faster than the state-of-the-art out-of-core distributed system PDTL running on an 8-node cluster when processing the graph data with 42.5 billion edges, which demonstrates the high performance and cost-effectiveness of the $TC$ T C - $Stream$ S t r e a m . [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

6. Auto-GNAS: A Parallel Graph Neural Architecture Search Framework.

Author: Chen, Jiamin, Gao, Jianliang, Chen, Yibo, Oloulade, Babatounde Moctard, Lyu, Tengfei, and Li, Zhao
Subjects: *GRAPH algorithms, *SEARCH algorithms, *GENETIC algorithms, *LINEAR acceleration, *PARALLEL programming, *COMPUTER architecture
Abstract: Graph neural networks (GNNs) have received much attention as GNNs have recently been successfully applied on non-euclidean data. However, artificially designed graph neural networks often fail to get satisfactory model performance for a given graph data. Graph neural architecture search effectively constructs the GNNs that achieve the expected model performance with the rise of automatic machine learning. The challenge is efficiently and automatically getting the optimal GNN architecture in a vast search space. Existing search methods serially evaluate the GNN architectures, severely limiting system efficiency. To solve these problems, we develop an Automatic Graph Neural Architecture Search framework (Auto-GNAS) with parallel estimation to implement an automatic graph neural search process that requires almost no manual intervention. In Auto-GNAS, we design the search algorithm with multiple genetic searchers. Each searcher can simultaneously use evaluation feedback information, information entropy, and search results from other searchers based on sharing mechanism to improve the search efficiency. As far as we know, this is the first work using parallel computing to improve the system efficiency of graph neural architecture search. According to the experiment on the real datasets, Auto-GNAS obtain competitive model performance and better search efficiency than other search algorithms. Since the parallel estimation ability of Auto-GNAS is independent of search algorithms, we expand different search algorithms based on Auto-GNAS for scalability experiments. The results show that Auto-GNAS with varying search algorithms can achieve nearly linear acceleration with the increase of computing resources. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

7. Predicting Throughput of Distributed Stochastic Gradient Descent.

Author: Li, Zhuojin, Paolieri, Marco, Golubchik, Leana, Lin, Sung-Han, and Yan, Wumo
Subjects: *OCCUPATIONAL training, *MULTICASTING (Computer networks), *ASYNCHRONOUS learning, *EMPLOYEE training, *FORECASTING, *COMPUTER architecture
Abstract: Training jobs of deep neural networks (DNNs) can be accelerated through distributed variants of stochastic gradient descent (SGD), where multiple nodes process training examples and exchange updates. The total throughput of the nodes depends not only on their computing power, but also on their networking speeds and coordination mechanism (synchronous or asynchronous, centralized or decentralized), since communication bottlenecks and stragglers can result in sublinear scaling when additional nodes are provisioned. In this paper, we propose two classes of performance models to predict throughput of distributed SGD: fine-grained models, representing many elementary computation/communication operations and their dependencies; and coarse-grained models, where SGD steps at each node are represented as a sequence of high-level phases without parallelism between computation and communication. Using a PyTorch implementation, real-world DNN models and different cloud environments, our experimental evaluation illustrates that, while fine-grained models are more accurate and can be easily adapted to new variants of distributed SGD, coarse-grained models can provide similarly accurate predictions when augmented with ad hoc heuristics, and their parameters can be estimated with profiling information that is easier to collect. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

8. ReHy: A ReRAM-Based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training.

Author: Jin, Hai, Liu, Cong, Liu, Haikun, Luo, Ruikun, Xu, Jiahong, Mao, Fubing, and Liao, Xiaofei
Subjects: *NONVOLATILE random-access memory, *CONVOLUTIONAL neural networks, *BLOOD pressure testing machines, *MATRIX multiplications, *LOGIC design
Abstract: Processing-In-Memory(PIM) has emerged as a high-performance and energy-efficient computing paradigm for accelerating convolutional neural network (CNN) applications. Resistive random access memory (ReRAM) has been widely used in PIM architectures due to its extremely high efficiency for accelerating matrix-vector multiplications through analog computing. However, because CNN training usually requires high-precision computation in the backward propagation (BP) stage, the limited precision of analog PIM accelerators impedes their adoption in CNN training. In this article, we propose ReHy, a hybrid PIM accelerator to support CNN training in ReRAM arrays. It is composed of Analog PIM (APIM) and Digital PIM (DPIM) modules. ReHy uses APIM to accelerate the feed-forward propagation (FP) stage for high performance, and DPIM to process the BP stage for high accuracy. We exploit the capability of ReRAM for Boolean logic operations to design the DPIM architecture. Particularly, we design floating-point multiplication and addition operators to support matrix multiplications in ReRAM arrays. We also propose a performance model to offload high-precision matrix multiplications to DPIM according to the data parallelism. Experimental results show that ReHy can speed up CNN training by 48.8× and 2.4×, and reduce energy consumption by 35.1× and 2.33×, compared with CPU/GPU architectures (baseline) and the state-of-the-art FloatPIM, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

9. Heterogeneous Systolic Array Architecture for Compact CNNs Hardware Accelerators.

Author: Xu, Rui, Ma, Sheng, Wang, Yaohua, Guo, Yang, Li, Dongsheng, and Qiao, Yuran
Subjects: *CONVOLUTIONAL neural networks, *FLEXIBLE structures
Abstract: Compact convolutional neural networks have become a hot research topic. However, we find that the systolic array accelerators are extremely inefficient in dealing with compact models, especially when processing depthwise convolutional layers in the neural networks. To make systolic arrays more efficient for compact convolutional neural networks, we propose the heterogeneous systolic array (HeSA) architecture. It introduces heterogeneous processing elements that support multiple dataflows, which can further exploit the reuse data chance of depthwise convolutional layers and without changing the structure of the naÃ¯ve systolic array. By increasing the utilization rate of processing elements in the array, the HeSA improves the performance, throughput, and energy efficiency compared to the standard baseline. In addition, we design the flexible buffer structure for the HeSA. Through configuring it, the HeSA can allocate bandwidth flexibly to maintaining high performance and low communication cost. Based on our evaluation with typical workloads, the HeSA improves the utilization rate of the computing resource in depthwise convolutional layers by 4.5× - 11.2× and acquires 1.6 - 3.1× total performance speedup compared to the standard systolic array architecture. In the large-scale array design, the HeSA can reduce the data traffic by 40% while maintaining the same performance as the scaling-out method. By improving the on-chip data reuse opportunities and reducing data traffic, the HeSA saves over 20% in energy consumption. Meanwhile, the area of the HeSA is basically unchanged compared to the baseline due to its simple design. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

10. Critique of “MemXCT: Memory-Centric X-Ray CT Reconstruction With Massive Parallelization” by SCC Team From the University of Texas at Austin.

Author: Davis, Brock, Paez, Juan, Gaither, Jack, and Garcia, Joe A.
Subjects: *COMPUTED tomography, *VIRTUAL machine systems, *X-rays, *GRAPHICS processing units, *MICROSOFT Azure (Computing platform), *COMPUTER workstation clusters
Abstract: This report describes The University of Texas Student Cluster Competition team’s effort to reproduce the results of “MemXCT: memory-centric X-ray CT reconstruction with massive parallelization” (Hidayetoğlu et al., 2019). The article details a new memory-centric approach that reconstructs X-ray computed tomography (XCT) from noisy raw data. In our reproduction experiments, we utilized Microsoft Azure’s CycleCloud tool to provision, orchestrate, and manage our computing cluster in the cloud. In particular, we scheduled and benchmarked reconstruction workloads using Azure’s CPU-based HC44rs and GPU-based NC12s v2 virtual machine (VM) types to evaluate the scalability properties of the reconstruction approach and the performance differences between architectures. The HC44rs VMs contained 44 Intel Xeon Platinum cores, while the NC12s v2 VM was equipped with two NVIDIA P100 GPUs. We used a recent version of Intel’s compiler stack with the MKL library for our CPU code along with CUDA 11.1 on GPUs. Overall, our results confirm the findings of the original article, demonstrating similar acceleration on GPUs and scalability properties on CPUs. Digital artifacts from these experiments are available at: 10.5281/zenodo.5598108 [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

11. Adaptive Resource Efficient Microservice Deployment in Cloud-Edge Continuum.

Author: Fu, Kaihua, Zhang, Wei, Chen, Quan, Zeng, Deze, and Guo, Minyi
Subjects: *REINFORCEMENT learning, *RESOURCE allocation, *COMPUTER architecture, *FUNCTIONAL analysis, *RESOURCE management, *BANDWIDTHS
Abstract: User-facing services are now evolving towards the microservice architecture where a service is built by connecting multiple microservice stages. Since the entire service is heavy, the microservice architecture shows the opportunity to only offload some microservice stages to the edge devices that are close to the end users. However, emerging techniques often result in the violation of Quality-of-Service (QoS) of microservice-based services in cloud-edge continuum, as they do not consider the communication overhead or the resource contention between microservices and external co-located tasks. We propose Nautilus, a runtime system that effectively deploys microservice-based user-facing services in cloud-edge continuum. Nautilus ensures the QoS of microservice-based user-facing services while minimizing the required computational resources, which is comprised of a communication-aware microservice mapper, a contention-aware resource manager and an IO-sensitive and load-aware microservice migration scheduler. The mapper divides the microservice graph into multiple partitions based on the communication overhead and maps the partitions to appropriate nodes. On each node, the resource manager determines the optimal resource allocation for its microservices based on reinforcement learning that may capture the complex contention behaviors. Once the microservices are suffered from external IO pressure, the IO-sensitive microservice scheduler migrates the critical one to idle nodes. Furthermore, when the load of microservices changes dynamically, the load-aware microservice scheduler migrates microservices from busy nodes to idle ones to ensure the QoS goal of the entire service. Our experimental results show that Nautilus can guarantee the required QoS target under external shared resources contention while the state-of-the-art suffers from QoS violations. Meanwhile, Nautilus reduces the computational resource usage by 23.9% and the network bandwidth usage by 53.4%, while achieving the required 99%-ile latency. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

12. Efficient and Automated Deployment Architecture for OpenStack in TianHe SuperComputing Environment.

Author: Jiang, Bingting, Tang, Zhuo, Xiao, Xiong, Yao, Jing, Cao, Ronghui, and Li, Kenli
Subjects: *ON-demand computing, *GLOBAL Financial Crisis, 2008-2009, *COMPUTER workstation clusters, *VACCINE development, *COMPUTER architecture
Abstract: Recently, with the large-scale outbreak of the global financial crisis and public safety incidents (such as COVID-19), high-performance computing has been widely applied to risk prediction, vaccine development, and other fields. In scenarios where high-performance computing infrastructure responds to the instantaneous explosion of computing demands, a crucial issue is to provide large-scale flexible allocation and adjustment of computing capability by rapidly constructing computing clusters. Existing large-scale computing cluster deployment solutions usually utilize source code deployment or other deployment tools. The great challenge of existing deployment methods is to reduce excessive image distribution time and refrain from configuration defects. In this article, we design an intelligent distributed registry deployment (IDRD) architecture based on the OpenStack cloud platform, which adaptively places distributed image repositories using the containerized deployment of multiple registries. We propose a server load priority algorithm to solve multiple registries placement problems in IDRD. Furthermore, we devise a clustering algorithm based on demand density that can optimize the global performance of IDRD and improve large-scale cluster load balancing capabilities, which has been implemented in the TianHe Supercomputing environment. Extensive experimental results demonstrate that IDRD can effectively reduce 30% − 50% of the distribution time of component images and significantly improve the efficiency of large-scale cluster deployment. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

13. Exploring Data Analytics Without Decompression on Embedded GPU Systems.

Author: Pan, Zaifeng, Zhang, Feng, Zhou, Yanliang, Zhai, Jidong, Shen, Xipeng, Mutlu, Onur, and Du, Xiaoyong
Subjects: *GRAPHICS processing units, *COMPUTER architecture, *ENERGY consumption, *RANDOM access memory
Abstract: With the development of computer architecture, even for embedded systems, GPU devices can be integrated, providing outstanding performance and energy efficiency to meet the requirements of different industries, applications, and deployment environments. Data analytics is an important application scenario for embedded systems. Unfortunately, due to the limitation of the capacity of the embedded device, the scale of problems handled by the embedded system is limited. In this paper, we propose a novel data analytics method, called G-TADOC, for efficient text analytics directly on compression on embedded GPU systems. A large amount of data can be compressed and stored in embedded systems, and can be processed directly in the compressed state, which greatly enhances the processing capabilities of the systems. Particularly, G-TADOC has three innovations. First, a novel fine-grained thread-level workload scheduling strategy for GPU threads has been developed, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, a GPU thread-safe memory pool has been developed to handle inconsistency with low synchronization overheads. Third, a sequence-support strategy is provided to maintain high GPU parallelism while ensuring sequence information for lossless compression. Moreover, G-TADOC involves special optimizations for embedded GPUs, such as utilizing the CPU-GPU shared unified memory. Experiments show that G-TADOC provides 13.2× average speedup compared to the state-of-the-art TADOC. G-TADOC also improves performance-per-cost by 2.6× and energy efficiency by 32.5× over TADOC. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

14. SaPus: Self-Adaptive Parameter Update Strategy for DNN Training on Multi-GPU Clusters.

Author: Zhang, Zhaorui and Wang, Choli
Subjects: *GRAPHICS processing units, *BOTTLENECKS (Manufacturing), *COMMUNICATION strategies
Abstract: Parameter server architecture has been identified as an efficient framework for scaling DNNs training on clusters. For large-scale deployment, communication becomes the bottleneck, and the parameter updating strategy strongly impacts the training performance and accuracy. Recent state-of-art solutions have adopted the local SGD approach, which enables workers to update their local version of models and only aggregate them to update the global parameters after finishing a number of iterations, to alleviate heavy communication pressure on the parameter server and improving the training performance. We identify three limitations of these works. First, these works do not provide an approach for determining when the worker is to update the parameter with the server under asynchronous communication strategies that can guarantee the training performance. Second, local SGD suffers from the problem of unbounded gradient delay. Previous work works well for a short delay while can not guarantee the performance with an increase of gradient delay. Third, they do not consider the system performance when determining the update interval of the local SGD, including the CPU, memory, and network, which affects the training performance extremely. We provide a self-adaptive parameter updating strategy called SaPus, which allows each worker to detect their training results through quantification of the accumulated gradient updates and determine when to update the parameter with the server adaptively and individually. Theoretical lower and upper bound of the update interval is also provided. We also propose a weighted aggregation algorithm based on a global-loss window, which is used to collect the most recent loss value of other workers to calculate a weight for the accumulated gradients of each worker to solve the unbounded delay problem in asynchronous local SGD. To increase the robustness of our parameter updating strategy, a performance model is built to provide a resource-aware lower bound for the update interval. Extensive experimental results generated on GPU cluster indicate that our model improves the training performance of DNNs, achieving up to $66.67\%$ 66. 67 % speedup as compared with state-of-art solutions. Further, results show the CPU utilization of server dropped by up to $81.1\%$ 81. 1 % and network bandwidth usage reduced to less than $1~Gbps$ 1 G b p s on an average during the training. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

15. Compiler-Assisted Compaction/Restoration of SIMD Instructions.

Author: Cebrian, Juan M., Balem, Thibaud, Barredo, Adrian, Casas, Marc, Moreto, Miquel, Ros, Alberto, and Jimborean, Alexandra
Subjects: *HIGH performance computing, *COMPACTING, *SUPERCOMPUTERS, *COMPUTER systems, *ENERGY consumption, *COMPUTER architecture
Abstract: Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This article proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent dense vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29 percent and reduces dynamic energy by up to 24.2 percent on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6 percent performance and 14 percent energy improvements for the same configuration and applications. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. Repurposing GPU Microarchitectures with Light-Weight Out-Of-Order Execution.

Author: Iliakis, Konstantinos, Xydis, Sotirios, and Soudris, Dimitrios
Subjects: *GRAPHICS processing units, *PARALLEL processing, *VERNACULAR architecture, *COMPUTER architecture
Abstract: GPU is the dominant platform for accelerating general-purpose workloads due to its computing capacity and cost-efficiency. GPU applications cover an ever-growing range of domains. To achieve high throughput, GPUs rely on massive multi-threading and fast context switching to overlap computations with memory operations. We observe that among the diverse GPU workloads, there exists a significant class of kernels that fail to maintain a sufficient number of active warps to hide the latency of memory operations, and thus suffer from frequent stalling. We argue that the dominant Thread-Level Parallelism model is not enough to efficiently accommodate the variability of modern GPU applications. To address this inherent inefficiency, we propose a novel micro-architecture with lightweight Out-Of-Order execution capability enabling Instruction-Level Parallelism to complement the conventional Thread-Level Parallelism model. To minimize the hardware overhead, we carefully design our extension to highly re-use the existing micro-architectural structures and study various design trade-offs to contain the overall area and power overhead, while providing improved performance. We show that the proposed architecture outperforms traditional platforms by 23 percent on average for low-occupancy kernels, with an area and power overhead of 1.29 and 10.05 percent, respectively. Finally, we establish the potential of our proposal as a micro-architecture alternative by providing 16 percent speedup over a wide collection of 60 general-purpose kernels. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

17. Critique of “Planetary Normal Mode Computation: Parallel Algorithms, Performance, and Reproducibility” by SCC Team From National Tsing Hua University.

Author: Sun, Wei-Fang, Chen, Hung-Hsin, Lin, Shao-Fu, Lin, Yuan-Ching, Wu, Jing-Wei, Lin, En-Te, and Chou, Jerry
Subjects: *PLANETARY interiors, *SCHOOL contests, *SCALABILITY, *STUDENT activities, *PARALLEL algorithms, *COMPUTER architecture
Abstract: As a special activity of the Student Cluster Competition at SC19 conference, we made an attempt to reproduce the scalability evaluations of a highly paralleled polynomial filtering eigensolver for computing planetary interior normal modes. Our experiments were conducted on a Mars dataset using a small scale 4-node cluster with Intel Skylake CPU architecture, while the original article’s were conducted on a Moon dataset using a large scale 256-node supercomputer with Intel CPU Skylake and KNL architectures. This article shares our experiences and observations from our reproducibility activity and discusses our findings on three main sections: the weak scalability, the strong scalability, and the relationships between variables. The results of weak scalability and strong scalability were successfully reproduced. But due to the differences on the problem scale, input dataset, and system architecture, different behaviors regarding the polynomial degree were observed. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

18. Overlapping Communication With Computation in Parameter Server for Scalable DL Training.

Author: Wang, Shaoqi, Pi, Aidi, Zhou, Xiaobo, Wang, Jun, and Xu, Cheng-Zhong
Subjects: *DEEP learning, *SCALABILITY, *GREEDY algorithms, *STATISTICAL decision making, *COMPUTER architecture, *PROBLEM solving
Abstract: Scalability of distributed deep learning (DL) training with parameter server (PS) architecture is often communication constrained in large clusters. There are recent efforts that use a layer by layer strategy to overlap gradient communication with backward computation so as to reduce the impact of communication constraint on the scalability. However, the approaches could bring significant overhead in gradient communication. Meanwhile, they cannot be effectively applied to the overlap between parameter communication and forward computation. In this article, we propose and develop iPart, a novel approach that partitions communication and computation in various partition sizes to overlap gradient communication with backward computation and parameter communication with forward computation. iPart formulates the partitioning decision as an optimization problem and solves it based on a greedy algorithm to derive communication and computation partitions. We implement iPart in the open-source DL framework BigDL and perform evaluations with various DL workloads. Experimental results show that iPart improves the scalability of a cluster of 72 nodes by up to 94 percent over the default PS and 52 percent over the layer by layer strategy. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

19. Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods.

Author: Asri, Mochamad, Malhotra, Dhairya, Wang, Jiajun, Biros, George, John, Lizy K., and Gerstlauer, Andreas
Subjects: *APPLICATION-specific integrated circuits, *FAST multipole method, *DYNAMIC random access memory, *HARDWARE, *HIGH performance computing
Abstract: In this article, we study performance and energy saving benefits of hardware acceleration under different hardware configurations and usage scenarios for a state-of-the-art Fast Multipole Method (FMM), which is a popular N-body method. We use a dedicated Application Specific Integrated Circuit (ASIC) to accelerate General Matrix-Matrix Multiply (GEMM) operations. FMM is widely used in applications and is representative example of the workload for many HPC applications. We compare architectures that integrate the GEMM ASIC next to, in or near main memory with an on-chip coupling aimed at minimizing or avoiding repeated round-trip transfers through DRAM for communication between accelerator and CPU. We study tradeoffs using detailed and accurately calibrated x86 CPU, accelerator and DRAM simulations. Our results show that simply moving accelerators closer to the chip does not necessarily lead to performance/energy gains. We demonstrate that, while careful software blocking and on-chip placement optimizations can reduce DRAM accesses by 2X over a naive on-chip integration, these dramatic savings in DRAM traffic do not automatically translate into significant total energy or runtime savings. This is chiefly due to the application characteristics, the high idle power and effective hiding of memory latencies in modern systems. Only when more aggressive co-optimizations such as software pipelining and overlapping are applied, additional performance and energy savings can be unlocked by 37 and 35 percent respectively over baseline acceleration. When similar optimizations (pipelining and overlapping) are applied with an off-chip integration, on-chip integration delivers up to 20 percent better performance and 17 percent less total energy consumption than off-chip integration. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

20. A Hybrid Fuzzy Convolutional Neural Network Based Mechanism for Photovoltaic Cell Defect Detection With Electroluminescence Images.

Author: Ge, Chunpeng, Liu, Zhe, Fang, Liming, Ling, Huading, Zhang, Aiping, and Yin, Changchun
Subjects: *FUZZY neural networks, *CONVOLUTIONAL neural networks, *BUILDING-integrated photovoltaic systems, *PHOTOVOLTAIC cells, *SOLAR cells, *MANUFACTURING processes, *FUZZY logic, *CAMERAS
Abstract: In the intelligent manufacturing process of solar photovoltaic (PV) cells, the automatic defect detection system using the Industrial Internet of Things (IIoT) smart cameras and sensors cooperated in IIoT has become a promising solution. Many works have been devoted to defect detection of PV cells in a data-driven way. However, because of the subjectivity and fuzziness of human annotation, the data contains a high quantity of noise and unpredictable uncertainties, which creates great difficulties in automatic defect detection. To address this problem, we propose a novel architecture named fuzzy convolution, which integrates fuzzy logic and convolution operations at microscopic level. Combining the proposed fuzzy convolution with the regular convolution, we build a network called Hybrid Fuzzy Convolutional Neural Network (HFCNN). Compared with convolutional neural networks (CNNs), HFCNN can address the uncertainties of PV cell data to improve the accuracy with fewer parameters, making it possible to apply our method in smart cameras. Experimental results on a public dataset show the superiority of our proposed method compared with CNNs. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

21. A Distributed Framework for EA-Based NAS.

Author: Ye, Qing, Sun, Yanan, Zhang, Jixin, and Lv, Jiancheng
Subjects: *EVOLUTIONARY algorithms, *INDIVIDUAL needs, *TRAINING needs, *BOTTLENECKS (Manufacturing), *COMPUTER architecture, *SOCIAL networks
Abstract: Evolutionary Algorithms (EA) are widely applied in Neural Architecture Search (NAS) and have achieved appealing results. Different EA-based NAS algorithms may utilize different encoding schemes for network representation, while they have the same workflow. Specifically, the first step is the initialization of the population with different encoding schemes, and the second step is the evaluation of the individuals by the fitness function. Then, the EA-based NAS algorithm executes evolution operations, e.g., selection, mutation, and crossover, to eliminate weak individuals and generate more competitive ones. Lastly, evolution continues until the max generation and the best neural architectures will be chosen. Because each individual needs complete training and validation on the target dataset, the EA-based NAS always consumes significant computation and time inevitably, which results in the bottleneck of this approach. To ameliorate this issue, this article proposes a distributed framework to boost the computation of the EA-based NAS algorithm. This framework is a server/worker model where the server distributes individuals requested by the computing nodes and collects the validated individuals and hosts the evolution operations. Meanwhile, the most time-consuming phase (i.e., individual evaluation) of the EA-based NAS is allocated to the computing nodes, which send requests asynchronously to the server and evaluate the fitness values of the individuals. Additionally, a new packet structure of the message delivered in the cluster is designed to encapsulate various network representations and support different EA-based NAS algorithms. We design an EA-based NAS algorithm as a case to investigate the efficiency of the proposed framework. Extensive experiments are performed on an illustrative cluster with different scales, and the results reveal that the framework can achieve a nearly linear reduction of the search time with the increase of the computational nodes. Furthermore, the length of the exchanged messages among the cluster is tiny, which benefits the framework expansion. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

22. iMLBench: A Machine Learning Benchmark Suite for CPU-GPU Integrated Architectures.

Author: Zhang, Chenyang, Zhang, Feng, Guo, Xiaoguang, He, Bingsheng, Zhang, Xiao, and Du, Xiaoyong
Subjects: *MACHINE learning, *GRAPHICS processing units, *CENTRAL processing units, *ENERGY consumption, *MACHINE performance
Abstract: Utilizing heterogeneous accelerators, especially GPUs, to accelerate machine learning tasks has shown to be a great success in recent years. GPUs bring huge performance improvements to machine learning and greatly promote the widespread adoption of machine learning. However, the discrete CPU-GPU architecture design with high PCIe transmission overhead decreases the GPU computing benefits in machine learning training tasks. To overcome such limitations, hardware vendors release CPU-GPU integrated architectures with shared unified memory. In this article, we design a benchmark suite for machine learning training on CPU-GPU integrated architectures, called iMLBench, covering a wide range of machine learning applications and kernels. We mainly explore two features on integrated architectures: 1) zero-copy, which means that the PCIe overhead has been eliminated for machine learning tasks and 2) co-running, which means that the CPU and the GPU co-run together to process a single machine learning task. Our experimental results on iMLBench show that the integrated architecture brings an average 7.1× performance improvement over the original implementations. Specifically, the zero-copy design brings 4.65× performance improvement, and co-running brings 1.78× improvement. Moreover, integrated architectures exhibit promising results from both performance-per-dollar and energy perspectives, achieving 6.50× performance-price ratio while 4.06× energy efficiency over discrete GPUs. The benchmark is open-sourced at https://github.com/ChenyangZhang-cs/iMLBench. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

23. Accelerating Federated Learning Over Reliability-Agnostic Clients in Mobile Edge Computing Systems.

Author: Wu, Wentai, He, Ligang, Lin, Weiwei, and Mao, Rui
Subjects: *MOBILE computing, *EDGE computing, *COMPUTER systems, *ELECTRONIC data processing, *COMPUTER architecture
Abstract: Mobile Edge Computing (MEC), which incorporates the Cloud, edge nodes, and end devices, has shown great potential in bringing data processing closer to the data sources. Meanwhile, Federated learning (FL) has emerged as a promising privacy-preserving approach to facilitating AI applications. However, it remains a big challenge to optimize the efficiency and effectiveness of FL when it is integrated with the MEC architecture. Moreover, the unreliable nature (e.g., stragglers and intermittent drop-out) of end devices significantly slows down the FL process and affects the global model’s quality in such circumstances. In this article, a multi-layer federated learning protocol called HybridFL is designed for the MEC architecture. HybridFL adopts two levels (the edge level and the cloud level) of model aggregation enacting different aggregation strategies. Moreover, in order to mitigate stragglers and end device drop-out, we introduce regional slack factors into the stage of client selection performed at the edge nodes using a probabilistic approach without identifying or probing the state of end devices (whose reliability is agnostic). We demonstrate the effectiveness of our method in modulating the proportion of clients selected and present the convergence analysis for our protocol. We have conducted extensive experiments with machine learning tasks in different scales of MEC system. The results show that HybridFL improves the FL training process significantly in terms of shortening the federated round length, speeding up the global model’s convergence (by up to 12×) and reducing end device energy consumption (by up to 58 percent). [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

24. Reproducibility: Performance Evaluation of MemXCT on Azure CycleCloud Platform.

Author: Liu, Yuchen, Meng, Yixuan, Xu, Kaiyuan, Xu, Zijun, Wu, Tianyuan, Yang, Yiwei, and Yin, Shu
Subjects: *IMAGE reconstruction algorithms, *GRAPHICS processing units, *COMPUTER architecture
Abstract: Memory-Centric X-ray Computational Tomography(CT) is an iterative reconstruction technique that trades compute simplifications with higher memory accesses. MemXCT implements a sparse matrix-vector multiplication(SpMV) with multi-stage buffering and two-level pseudo-Hilbert ordering for optimization. Motivated by the need to validate conclusions from previous work, we reproduce the numerical results, the algorithm’s performance, and the scaling behavior of the algorithms as the number of MPI processes increases on Azure. Digital artifacts from these experiments are available at: 10.5281/zenodo.5598108 [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

25. Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints.

Author: Wong, Alvaro, Heymann, Elisa, Rexachs, Dolores, and Luque, Emilio
Subjects: *FAULT-tolerant computing, *COMPUTER architecture, *MIDDLEWARE
Abstract: Compute node failures are becoming a normal event for many long-running and scalable MPI applications. Keeping within the MPI standards and applying some of the methods developed so far in terms of fault tolerance, we developed a methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture. To do this, we developed the ULSC $^{2}$ 2 -RADIC middleware that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi-coordinated environment. We performed experimental results using scientific applications and the NAS Parallel Benchmarks to assess the overhead and also the functionality in case of a node failure. We evaluated the computational cost of the semi-coordinated checkpoints compared with the coordinated checkpoints. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

26. Towards Higher Performance and Robust Compilation for CGRA Modulo Scheduling.

Author: Zhao, Zhongyuan, Sheng, Weiguang, Wang, Qin, Yin, Wenzhi, Ye, Pengfei, Li, Jinchao, and Mao, Zhigang
Subjects: *CONSTRAINT algorithms, *TIME management, *ALGORITHMS, *ENERGY consumption, *SCHEDULING, *SOLAR stills
Abstract: Coarse-Grained Reconfigurable Architectures (CGRA) is a promising solution for accelerating computation intensive tasks due to its good trade-off in energy efficiency and flexibility. One of the challenging research topic is how to effectively deploy loops onto CGRAs within acceptable compilation time. Modulo scheduling (MS) has shown to be efficient on deploying loops onto CGRAs. Existing CGRA MS algorithms still suffer from the challenge of mapping loop with higher performance under acceptable compilation time, especially mapping large and irregular loops onto CGRAs with limited computational and routing resources. This is mainly due to the under utilization of the available buffer resources on CGRA, unawareness of critical mapping constraints and time consuming method of solving temporal and spatial mapping. This article focus on improving the performance and compilation robustness of the modulo scheduling mapping algorithm for CGRAs. We decomposes the CGRA MS problem into the temporal and spatial mapping problem and reorganize the processes inside these two problems. For the temporal mapping problem, we provide a comprehensive and systematic mapping flow that includes a powerful buffer allocation algorithm, and efficient interconnection & computational constraints solving algorithms. For the spatial mapping problem, we develop a fast and stable spatial mapping algorithm with backtracking and reordering mechanism. Our MS mapping algorithm is able to map loops onto CGRA with higher performance and faster compilation time. Experiment results show that given the same compilation time budget, our mapping algorithm generates higher compilation success rate. Among the successfully compiled loops, our approach can improve 5.4 to 14.2 percent performance and takes x24 to x1099 less compilation time in average comparing with state-of-the-art CGRA mapping algorithms. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

27. CURE: A High-Performance, Low-Power, and Reliable Network-on-Chip Design Using Reinforcement Learning.

Author: Wang, Ke and Louri, Ahmed
Subjects: *NETWORK routers, *ERROR correction (Information theory), *REINFORCEMENT learning, *CURING, *DEEP learning, *ENERGY consumption, *COMPUTER architecture
Abstract: We propose CURE, a deep reinforcement learning (DRL)-based NoC design framework that simultaneously reduces network latency, improves energy-efficiency, and tolerates transient errors and permanent faults. CURE has several architectural innovations and a DRL-based hardware controller to manage design complexity and optimize trade-offs. First, in CURE, we propose reversible multi-function adaptive channels (RMCs) to reduce NoC power consumption and network latency. Second, we implement a new fault-secure adaptive error correction hardware in each router to enhance reliability for both transient errors and permanent faults. Third, we propose a router power-gating and bypass design that powers off NoC components to reduce power and extend chip lifespan. Further, for the complex dynamic interactions of these techniques, we propose using DRL to train a proactive control policy to provide improved fault-tolerance, reduced power consumption, and improved performance. Simulation using the PARSEC benchmark shows that CURE reduces end-to-end packet latency by 39 percent, improves energy efficiency by 92 percent, and lowers static and dynamic power consumption by 24 and 38 percent, respectively, over conventional solutions. Using mean-time-to-failure, we show that CURE is 7.7× more reliable than the conventional NoC design. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

28. Approximate NoC and Memory Controller Architectures for GPGPU Accelerators.

Author: Raparti, Venkata Yaswanth and Pasricha, Sudeep
Subjects: *DYNAMIC random access memory, *ARCHITECTURE, *MEMORY, *RANDOM access memory, *GRAPHICS processing units
Abstract: High interconnect bandwidth is crucial for achieving better performance in many-core GPGPU architectures that execute highly data parallel applications. The parallel warps of threads running on shader cores generate a high volume of read requests to the main memory due to the limited size of data caches at the shader cores. This leads to a scenarios with rapid arrival of an even larger volume of reply data from the DRAM, which creates a bottleneck at memory controllers (MCs) that send reply packets back to the requesting cores over the network-on-chip (NoC). Coping with such high volumes of data requires intelligent memory scheduling and innovative NoC architectures. To mitigate memory bottlenecks in GPGPUs, we first propose a novel approximate memory controller architecture (AMC) that reduces the DRAM latency by opportunistically exploiting row buffer locality and bank level parallelism in memory request scheduling, and leverages approximability of the reply data from DRAM, to reduce the number of reply packets injected into the NoC. To further realize high throughput and low energy communication in GPGPUs, we propose a low power, approximate NoC architecture (Dapper) that increases the utilization of the available network bandwidth by using single cycle overlay circuits for the reply traffic between MCs and shader cores. Experimental results show that Dapper and AMC together increase NoC throughput by up to 21 percent; and reduce NoC latency by up to 45.5 percent and energy consumed by the NoC and MC by up to 38.3 percent, with minimal impact on output accuracy, compared to state-of-the-art approximate NoC/MC architectures. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

29. Thread-Level Locking for SIMT Architectures.

Author: Gao, Lan, Xu, Yunlong, Wang, Rui, Luan, Zhongzhi, Yu, Zhibin, and Qian, Depei
Subjects: *GRAPHICS processing units, *ARCHITECTURE
Abstract: As more emerging applications are moving to GPUs, thread-level synchronization has become a requirement. However, GPUs only provide warp-level and thread-block-level rather than thread-level synchronization. Moreover, it is highly possible to cause live-locks by using CPU synchronization mechanisms to implement thread-level synchronization for GPUs. In this article, we first propose a software-based thread-level synchronization mechanism called lock stealing for GPUs to avoid live-locks. We then describe how to implement our lock stealing algorithm in mutual exclusive locks and readers-writer locks with high performance. Finally, by putting it all together, we develop a thread-level locking library (TLLL) for commercial GPUs. To evaluate TLLL and show its general applicability, we use it to implement six widely used programs. We compare TLLL against the state-of-the-art ad-hoc GPU synchronization, GPU software transactional memory (STM), and CPU hardware transactional memory (HTM), respectively. The results show that, compared with the ad-hoc GPU synchronization for Delaunay mesh refinement (DMR), TLLL improves the performance by 22 percent on average on a GTX970 GPU, and shows up to 11 percent of performance improvement on a Volta V100 GPU. Moreover, it significantly reduces the required memory size. Such low memory consumption enables DMR to successfully run on the GTX970 GPU with the 10-million mesh size, and the V100 GPU with the 40-million mesh size, with which the ad-hoc synchronization can not run successfully. In addition, TLLL outperforms the GPU STM by 65 percent, and the CPU HTM (running on a Xeon E5-2620 v4 CPU with 16 hardware threads) by 43 percent on average. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

30. gMig: Efficient vGPU Live Migration with Overlapped Software-Based Dirty Page Verification.

Author: Lu, Qiumin, Zheng, Xiao, Ma, Jiacheng, Dong, Yaozu, Qi, Zhengwei, Yao, Jianguo, He, Bingsheng, and Guan, Haibing
Subjects: *GRAPHICS processing units, *COMPUTER architecture
Abstract: This paper introduces gMig, an open-source and practical vGPU live migration solution for full virtualization. Taking the advantage of the dirty pattern of GPU workloads, gMig presents the One-Shot Pre-Copy mechanism combined with the hashing based Software Dirty Page technique to achieve efficient vGPU live migration. Particularly, we propose three core techniques for gMig: 1) Dynamic Graphics Address Remapping, which parses and manipulates GPU commands to adjust the address mapping and adapt to a different environment after migration, 2) Software Dirty Page, which utilizes a hashing based approach with sampling pre-filtering to detect page modification, overcomes the commodity GPU's hardware limitation, and speeds up the migration by only sending the dirtied pages, 3) Overlapped Migration Process, which significantly compresses the hanging overhead by overlapping the dirty page verification and transmission concurrently. Our evaluation shows that gMig achieves GPU live migration with an average downtime of 302 ms on Windows and 119 ms on Linux. With the help of Software Dirty Page, the number of GPU pages transferred during the downtime is effectively reduced by up to 80.0 percent. The design of sampling filter and overlapped processing can bring about further 30.0 and 10.0 percent improvements in page processing. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

31. cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs.

Author: Zhang, Tao, Liu, Xiao-Yang, Wang, Xiaodong, and Walid, Anwar
Subjects: *GRAPHICS processing units, *VIDEO compression, *PARALLEL processing, *DATA structures, *INFORMATION storage & retrieval systems, *LINEAR algebra
Abstract: Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations of t-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum $16.91 \times, 27.03 \times, 38.97 \times, 22.36 \times, 15.43 \times$ 16. 91 × , 27. 03 × , 38. 97 × , 22. 36 × , 15. 43 × speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum $9.80 \times$ 9. 80 × and $269.26 \times$ 269. 26 × speedups over multi-core CPU implementations. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

32. Exploring New Opportunities to Defeat Low-Rate DDoS Attack in Container-Based Cloud Environment.

Author: Li, Zhi, Jin, Hai, Zou, Deqing, and Yuan, Bin
Subjects: *DENIAL of service attacks, *CLOUDS & the environment, *QUEUING theory, *QUALITY of service, *COMPUTER crimes
Abstract: DDoS attacks are rampant in cloud environments and continually evolve into more sophisticated and intelligent modalities, such as low-rate DDoS attacks. But meanwhile, the cloud environment is also developing in constant. Now container technology and microservice architecture are widely applied in cloud environment and compose container-based cloud environment. Comparing with traditional cloud environments, the container-based cloud environment is more lightweight in virtualization and more flexible in scaling service. Naturally, a question that arises is whether these new features of container-based cloud environment will bring new possibilities to defeat DDoS attacks. In this paper, we establish a mathematical model based on queueing theory to analyze the strengths and weaknesses of the container-based cloud environment in defeating low-rate DDoS attack. Based on this, we propose a dynamic DDoS mitigation strategy, which can dynamically regulate the number of container instances serving for different users and coordinate the resource allocation for these instances to maximize the quality of service. And extensive simulations and testbed-based experiments demonstrate our strategy can make the limited system resources be utilized sufficiently to maintain the quality of service acceptable and defeat DDoS attack effectively in the container-based cloud environment. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

33. FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures.

Author: Lan, Haidong, Meng, Jintao, Hundt, Christian, Schmidt, Bertil, Deng, Minwen, Wang, Xiaoning, Liu, Weiguo, Qiao, Yu, and Feng, Shengzhong
Subjects: *ARTIFICIAL neural networks, *ARCHITECTURE, *MATRIX multiplications, *ARM, *DEEP learning
Abstract: Deep Learning is ubiquitous in a wide field of applications ranging from research to industry. In comparison to time-consuming iterative training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, lower latency and higher computation efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN – a fast inference library for ARM CPUs – targeting the performance ceiling of mobile devices. FeatherCNN employs three key techniques: 1) A highly efficient TensorGEMM (generalized matrix multiplication) routine is applied to accelerate Winograd convolution on ARM CPUs, 2) General layer optimization based on custom high performance kernels improves both the computational efficiency and locality of memory access patterns for non-Winograd layers. 3) The framework design emphasizes joint layer-wise optimization using layer fusion to remove redundant calculations and memory movements. Performance evaluation reveals that FeatherCNN significantly outperforms state-of-the-art libraries. A forward propagation pass of VGG-16 on a 64-core ARM server is 48, 14, and 12 times faster than Caffe using OpenBLAS, Caffe2 using Eigen, and NNPACK, respectively. In addition, FeatherCNN is 3.19 times faster than the recently released TensorFlow Lite library on an iPhone 7 plus. In terms of GEMM performance, FeatherCNN achieves 14.8 and 39.0 percent higher performance than Apple's Accelerate framework on an iPhone 7 plus and Eigen on a Samsung Galaxy S8, respectively. The source code of FeatherCNN library is publicly available at https://github.com/tencent/feathercnn. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

34. Optimizing Finite Volume Method Solvers on Nvidia GPUs.

Author: Xu, Jingheng, Yang, Guangwen, Fu, Haohuan, Luk, Wayne, Gan, Lin, Shi, Wen, Xue, Wei, Yang, Chao, Jiang, Yong, and He, Conghui
Subjects: *FINITE volume method, *CACHE memory, *GRAPHICS processing units
Abstract: As scientific applications are increasingly ported to GPUs to benefit from both the powerful computing capacity and high throughput, accelerating explicit solvers for GPU-based finite volume methods is gaining more and more attention. In this paper, based on the detailed analysis of the FVM algorithm, we present a set of novel optimization methods, including the explicit data cache mechanism, optimal global memory loading strategy, as well as the inner-thread rescheduling method, which derives a suitable mapping from the solver algorithm to the underlying GPU hardware architecture, so as to remarkably improve the solving performance of structured mesh based FVM. We demonstrate the impact of our tuning techniques on two widely-used atmospheric dynamic kernels (3-D Euler and 2-D SWE) on five kinds of mainstream GPU platforms, and make a detailed analysis of the different tuning methodologies so as to demonstrate how to select the proper tuning strategy to different applications on various GPU platforms. Specifically, 93.9x speedup is achieved for the 3D Euler solver on Nvidia V100 over one 12-core Intel E5-2697 (v2) CPU, which is a 77 percent improvement compared with the original speedup without adopting the tuning techniques presented in this work. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

35. ISEE: An Intelligent Scene Exploration and Evaluation Platform for Large-Scale Visual Surveillance.

Author: Li, Da, Zhang, Zhang, Yu, Kai, Huang, Kaiqi, and Tan, Tieniu
Subjects: *HETEROGENEOUS computing, *VIDEO surveillance, *ANALYTIC functions, *DISTRIBUTED computing, *MIDDLEWARE, *VIDEO processing, *BIG data
Abstract: Intelligent video surveillance (IVS) is always an interesting research topic to utilize visual analysis algorithms for exploring richly structured information from big surveillance data. However, existing IVS systems either struggle to utilize computing resources adequately to improve the efficiency of large-scale video analysis, or present a customized system for specific video analytic functions. It still lacks of a comprehensive computing architecture to enhance efficiency, extensibility and flexibility of IVS system. Moreover, it is also an open problem to study the effect of the combinations of multiple vision modules on the final performance of end applications of IVS system. Motivated by these challenges, we develop an Intelligent Scene Exploration and Evaluation (ISEE) platform based on a heterogeneous CPU-GPU cluster and some distributed computing tools, where Spark Streaming serves as the computing engine for efficient large-scale video processing and Kafka is adopted as a middle-ware message center to decouple different analysis modules flexibly. To validate the efficiency of the ISEE and study the evaluation problem on composable systems, we instantiate the ISEE for an end application on person retrieval with three visual analysis modules, including pedestrian detection with tracking, attribute recognition and re-identification. Extensive experiments are performed on a large-scale surveillance video dataset involving 25 camera scenes, totally 587 hours 720p synchronous videos, where a two-stage question-answering procedure is proposed to measure the performance of execution pipelines composed of multiple visual analysis algorithms based on millions of attribute-based and relationship-based queries. The case study of system-level evaluations may inspire researchers to improve visual analysis algorithms and combining strategies from the view of a scalable and composable system in the future. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

36. SEALDB: An Efficient LSM-tree Based KV Store on SMR Drives with Sets and Dynamic Bands.

Author: Yao, Ting, Tan, Zhihu, Wan, Jiguang, Huang, Ping, Zhang, Yiwen, Xie, Changsheng, and He, Xubin
Subjects: *REFUSE collection, *RETAIL stores, *COMPUTER architecture, *SERVER farms (Computer network management)
Abstract: Key-value (KV) stores play an increasingly critical role in supporting diverse large-scale applications in modern data centers hosting terabytes of KV items which even might reside on a single server due to virtualization purposes. The combination of the ever-growing volume of KV items and storage/application consolidation is driving a trend of high storage density for KV stores. Shingled Magnetic Recording (SMR) represents a promising technology for increasing disk capacity, which however comes with the increased complexity of handling random writes. To take the best advantages of SMR drives, applications are expected to work in an SMR-friendly way. In this work, we present SEALDB, a Log-Structured Merge tree (LSM-tree) based key-value store that is specifically optimized for SMR drives via avoiding random writes and the corresponding write amplification on SMR drives. First, for LSM-trees, SEALDB collects and groups participating data of each compaction into sets. Using a set as the basic unit for compactions, SEALDB improves compaction efficiency by reducing random I/Os. Second, SEALDB creates variable sized bands on original HM-SMR drives, named dynamic bands. Dynamic bands store sets in an SMR-friendly way to eliminate the auxiliary write amplification from SMR drives. Third, SEALDB employs two light-weight garbage collection (GC) policies to further improve the space efficiency. We demonstrate the advantages of SEALDB via extensive experiments with various workloads. Overall, SEALDB delivers impressive performance compared with LevelDB, e.g., $3.42\times$3.42×/$2.65\times$2.65× faster for random writes (without or with GCs), and $3.96\times$3.96× faster for sequential reads. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

37. Coordinated DMA: Improving the DRAM Access Efficiency for Matrix Multiplication.

Author: Ma, Sheng, Liu, Zhong, Chen, Shenggang, Huang, Libo, Guo, Yang, Wang, Zhiying, and Zhang, Meidi
Subjects: *MATRIX multiplications, *DYNAMIC random access memory, *GRAPHICS processing units, *RANDOM access memory, *ENERGY consumption
Abstract: High performance implementation of matrix multiplication is essential for scientific computing. The memory access procedure is quite possible to be the bottleneck of matrix multiplication. The widely used GotoBLAS GEMM implementation divides the integral matrix into several partitions to be assigned to different cores for parallelization. Traditionally, each core deploys a DMA transfer to access its own partition in the DRAM memory. However, deploying an independent DMA transfer for each core cannot efficiently exploit the inter-core locality. Also, multiple concurrent DMA transfers interfere with each other, further reducing the DRAM access efficiency. We observe that the same row of neighboring partitions is in the same DRAM page, which means that there is significant locality inherent in the address layout. We propose the coordinated DMA to efficiently exploit the locality. It invokes one transfer to serve all cores and moves data in a row-major manner to improve the DRAM access efficiency. Compared with a baseline design, the coordinated DMA improves the bandwidth by 84.8 percent and reduces DRAM energy consumption by 43.1 percent for micro-benchmarks. It achieves higher performance for the GEMM and Linpack benchmark. With much less hardware costs, the coordinated DMA significantly outperforms an out-of-order memory controller. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

38. JSensor: A Parallel Simulator for Huge Wireless Sensor Networks Applications.

Author: Silva, Matheus Leonidas, Junior, Lincoln N. Santos, Aquino, Andre L. L., and Lima, Joubert de Castro
Subjects: *WIRELESS sensor networks, *APPLICATION program interfaces, *COMPUTER architecture
Abstract: This paper presents JSensor, a parallel general purpose simulator which enables huge simulations of Wireless Sensor Networks applications. Its main advantages are: i) to have a simple API with few classes to be extended, allowing easy prototyping and validation of WSNs applications and protocols; ii) to enable transparent and reproducible simulations, regardless of the number of threads of the parallel kernel; and iii) to scale over multi-core computer architectures, allowing simulations of more realistic applications. JSensor is a parallel event-driven simulator which executes according to event timers. The simulation elements, nodes, application, and events, can send messages, process task or move around the simulated environment. The mentioned environment follows a grid structure of extensible spatial cells. The results demonstrated that JSensor scales well, precisely it achieved a speedup of 7.45 with 16 threads in a machine with 16 cores (eight physical and eight virtual cores), and comparative evaluations versus OMNeT++ showed that the presented solution could be 43 times faster. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

39. Parallelizing Word2Vec in Shared and Distributed Memory.

Author: Ji, Shihao, Satish, Nadathur, Li, Sheng, and Dubey, Pradeep K.
Subjects: *MODERN architecture, *DATA structures, *SOURCE code, *COMPUTER workstation clusters, *REPRODUCIBLE research, *COMPUTER architecture, *MULTICORE processors
Abstract: Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. [1] , [2] have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with "Hogwild" updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose "HogBatch" by improving reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a computer cluster, and demonstrate good strong scalability up to 32 nodes. The new algorithm is particularly suitable for modern multi-core/many-core architectures, especially Intel's latest Knights Landing processors, and allows us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge. We released the source code for reproducible research and general usage. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

40. A Performance Model for GPU Architectures that Considers On-Chip Resources: Application to Medical Image Registration.

Author: Wu, Junhao, Yang, Xuan, Zhang, Zhengrui, Chen, Guoliang, and Mao, Rui
Subjects: *GRAPHICS processing units, *IMAGE registration, *DIAGNOSTIC imaging, *ARCHITECTURE
Abstract: Graphics processing units (GPUs) have become extremely important devices for accelerating computing performance in many applications. However, there have been few accurate models to estimate the performance of such applications running on modern GPUs. In this paper, we propose a performance model to estimate the execution times for massively parallel programs running on NVIDIA GPUs, one that takes on-chip resources and cost of data transfer between CPU and GPU into consideration. Four different GPUs with different architectures were used to evaluate our model. We demonstrated the effectiveness of the proposed model by applying it to various tasks in medical image registration. Experiments have demonstrated that by capturing on-chip GPU resources and data transfer time with our model, we were able to obtain a more accurate prediction of the actual running time, compared to the traditional model. Moreover, by using the optimal value of the block size parameter, estimated by our model, to accelerate the landmark tracking task on GPU devices, speedups of approximately 80×, 100×, 200× and 800×, on the C2050, K20c, M5000 and P100 can be achieved, making it possible to track massive numbers of landmarks and thereby improving the registration accuracy. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

41. An Efficient Hybrid I/O Caching Architecture Using Heterogeneous SSDs.

Author: Salkhordeh, Reza, Hadizadeh, Mostafa, and Asadi, Hossein
Subjects: *HARD disks, *COMPUTER systems, *WORKLOAD of computer networks, *ENERGY consumption, *COMPUTER storage devices
Abstract: Storage subsystem is considered as the performance bottleneck of computer systems in data-intensive applications. Solid-State Drives (SSDs) are emerging storage devices which unlike Hard Disk Drives (HDDs), do not have mechanical parts and therefore, have superior performance compared to HDDs. Due to the high cost of SSDs, entirely replacing HDDs with SSDs is not economically justified. Additionally, SSDs can endure a limited number of writes before failing. To mitigate the shortcomings of SSDs while taking advantage of their high performance, SSD caching is practiced in both academia and industry. Previously proposed caching architectures have only focused on either performance or endurance and neglected to address both parameters in suggested architectures. Moreover, the cost, reliability, and power consumption of such architectures is not evaluated. This paper proposes a hybrid I/O caching architecture that while offers higher performance than previous studies, it also improves power consumption with a similar budget. The proposed architecture uses DRAM, Read-Optimized SSD (RO-SSD), and Write-Optimized SSD (WO-SSD) in a three-level cache hierarchy and tries to efficiently redirect read requests to either DRAM or RO-SSD while sending writes to WO-SSD. To provide high reliability, dirty pages are written to at least two devices which removes any single point of failure. The power consumption is also managed by reducing the number of accesses issued to SSDs. The proposed architecture reconfigures itself between performance- and endurance-optimized policies based on the workload characteristics to maintain an effective tradeoff between performance and endurance. We have implemented the proposed architecture on a server equipped with industrial SSDs and HDDs. The experimental results show that as compared to state-of-the-art studies, the proposed architecture improves performance and power consumption by an average of 8 and 28 percent, respectively, and reduces the cost by 5 percent while increasing the endurance cost by 4.7 percent and negligible reliability penalty. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

42. A Bi-layered Parallel Training Architecture for Large-Scale Convolutional Neural Networks.

Author: Chen, Jianguo, Li, Kenli, Bilal, Kashif, Zhou, Xu, Li, Keqin, and Yu, Philip S.
Subjects: *NEURAL circuitry, *PARALLEL computers, *DATA transmission systems, *SYNCHRONIZATION, *DEEP learning
Abstract: Benefitting from large-scale training datasets and the complex training network, Convolutional Neural Networks (CNNs) are widely applied in various fields with high accuracy. However, the training process of CNNs is very time-consuming, where large amounts of training samples and iterative operations are required to obtain high-quality weight parameters. In this paper, we focus on the time-consuming training process of large-scale CNNs and propose a Bi-layered Parallel Training (BPT-CNN) architecture in distributed computing environments. BPT-CNN consists of two main components: (a) an outer-layer parallel training for multiple CNN subnetworks on separate data subsets, and (b) an inner-layer parallel training for each subnetwork. In the outer-layer parallelism, we address critical issues of distributed and parallel computing, including data communication, synchronization, and workload balance. A heterogeneous-aware Incremental Data Partitioning and Allocation (IDPA) strategy is proposed, where large-scale training datasets are partitioned and allocated to the computing nodes in batches according to their computing power. To minimize the synchronization waiting during the global weight update process, an Asynchronous Global Weight Update (AGWU) strategy is proposed. In the inner-layer parallelism, we further accelerate the training process for each CNN subnetwork on each computer, where computation steps of convolutional layer and the local weight training are parallelized based on task-parallelism. We introduce task decomposition and scheduling strategies with the objectives of thread-level load balancing and minimum waiting time for critical paths. Extensive experimental results indicate that the proposed BPT-CNN effectively improves the training performance of CNNs while maintaining the accuracy. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

43. Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer.

Author: Chen, Yuedan, Li, Kenli, Yang, Wangdong, Xiao, Guoqing, Xie, Xianghui, and Li, Tao
Subjects: *SPARSE matrices, *COMPUTER architecture, *PARALLEL processing, *SUPERCOMPUTERS, *KERNEL (Mathematics)
Abstract: General sparse matrix-sparse matrix multiplication (SpGEMM) is one of the fundamental linear operations in a wide variety of scientific applications. To implement efficient SpGEMM for many large-scale applications, this paper proposes scalable and optimized SpGEMM kernels based on COO, CSR, ELL, and CSC formats on the Sunway TaihuLight supercomputer. First, a multi-level parallelism design for SpGEMM is proposed to exploit the parallelism of over 10 millions cores and better control memory based on the special Sunway architecture. Optimization strategies, such as load balance, coalesced DMA transmission, data reuse, vectorized computation, and parallel pipeline processing, are applied to further optimize performance of SpGEMM kernels. Second, we thoroughly analyze the performance of the proposed kernels. Third, a performance-aware model for SpGEMM is proposed to select the most appropriate compressed storage formats for the sparse matrices that can achieve the optimal performance of SpGEMM on the Sunway. The experimental results show the SpGEMM kernels have good scalability and meet the challenge of the high-speed computing of large-scale data sets on the Sunway. In addition, the performance-aware model for SpGEMM achieves an absolute value of relative error rate of 8.31 percent on average when the kernels are executed in one single process and achieves 8.59 percent on average when the kernels are executed in multiple processes. It is proved that the proposed performance-aware model can perform at high accuracy and satisfies the precision of selecting the best formats for SpGEMM on the Sunway TaihuLight supercomputer. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

44. Portable Programming with RAPID.

Author: Angstadt, Kevin, Wadden, Jack, Weimer, Westley, and Skadron, Kevin
Subjects: *PROGRAMMING languages, *FIELD programmable analog arrays, *COMPUTER architecture, *LOGIC circuits, *MACHINE theory
Abstract: As the hardware found within data centers becomes more heterogeneous, it is important to allow for efficient execution of algorithms across architectures. We present RAPID, a high-level programming language and combined imperative and declarative model for functionally- and performance-portable execution of sequential pattern-matching applications across CPUs, GPUs, Field-Programmable Gate Arrays (FPGAs), and Micron’s D480 AP. RAPID is clear, maintainable, concise, and efficient both at compile and run time. Language features, such as code abstraction and parallel control structures, map well to pattern-matching problems, providing clarity and maintainability. For generation of efficient runtime code, we present algorithms to convert RAPID programs into finite automata. Our empirical evaluation of applications in the ANMLZoo benchmark suite demonstrates that the automata processing paradigm provides an abstraction that is portable across architectures. We evaluate RAPID programs against custom, baseline implementations previously demonstrated to be significantly accelerated. We also find that RAPID programs are much shorter in length, are expressible at a higher level of abstraction than their handcrafted counterparts, and yield generated code that is often more compact. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

45. Exploiting Parallelism for CNN Applications on 3D Stacked Processing-In-Memory Architecture.

Author: Wang, Yi, Chen, Weixuan, Yang, Jing, and Li, Tao
Subjects: *ARTIFICIAL neural networks, *ARTIFICIAL intelligence, *DATA transmission systems, *CACHE memory, *DYNAMIC programming
Abstract: Deep convolutional neural networks (CNNs) are widely adopted in intelligent systems with unprecedented accuracy but at the cost of a substantial amount of data movement. Although the emerging processing-in-memory (PIM) architecture seeks to minimize data movement by placing memory near processing elements, memory is still the major bottleneck in the entire system. The selection of hyper-parameters in the training of CNN applications requires over hundreds of kilobytes cache capacity for concurrent processing of convolutions. How to jointly explore the computation capability of the PIM architecture and the highly parallel property of neural networks remains a critical issue. This paper presents Para-Net, that exploits Parallelism for deterministic convolutional neural Networks on the PIM architecture. Para- Net achieves data-level parallelism for convolutions by fully utilizing the on-chip processing engine (PE) in PIM. The objective is to capture the characteristics of neural networks and present a hardware-independent design to jointly optimize the scheduling of both intermediate results and computation tasks. We formulate this data allocation problem as a dynamic programming model and obtain an optimal solution. To demonstrate the viability of the proposed Para-Net, we conduct a set of experiments using a variety of realistic CNN applications. The graph abstractions are obtained from deep learning framework Caffe. Experimental results show that Para-Net can significantly reduce processing time and improve cache efficiency compared to representative schemes. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

46. Improving the Performance and Energy Efficiency of GPGPU Computing through Integrated Adaptive Cache Management.

Author: Kim, Kyu Yeun, Park, Jinsu, and Baek, Woongki
Subjects: *GRAPHICS processing units, *CACHE memory, *ENERGY consumption, *COMPUTER memory management, *COMPUTER storage capacity
Abstract: Hardware caches are widely employed in GPGPUs to achieve higher performance and energy efficiency. Incorporating hardware caches in GPGPUs, however, does not immediately guarantee enhanced performance and energy efficiency due to high cache contention and thrashing. To address the inefficiency of GPGPU caches, various adaptive techniques (e.g., warp limiting) have been proposed. However, relatively little work has been done in the context of creating an architectural framework that tightly integrates adaptive cache management techniques and investigating their effectiveness and interaction. To bridge this gap, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. IACM integrates the state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1 and 61.9 percent on average, respectively), achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4 percent at maximum and 7.7 percent on average), and delivers significant performance and energy-efficiency gains over the baseline GPGPU architecture enhanced with advanced architectural technologies. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

47. Hardware Accelerated Semantic Declarative Memory Systems through CUDA and MapReduce.

Author: Edmonds, Mark, Atahary, Tanvir, Douglass, Scott, and Taha, Tarek
Subjects: *EXPLICIT memory, *CUDA (Computer architecture), *INFORMATION retrieval, *SEMANTIC networks (Information theory), *SERVICE-oriented architecture (Computer science)
Abstract: Declarative memory enables cognitive agents to effectively store and retrieve factual memory in real-time. Increasing the capacity of a real-time agent's declarative memory increases an agent's ability to interact intelligently with its environment but requires a scalable retrieval system. This work represents an extension of the Accelerated Declarative Memory (ADM) system, referred to as Hardware Accelerated Declarative Memory (HADM), to execute retrievals on a GPU. HADM also presents improvements over ADM's CPU execution and considers critical behavior for indefinitely running declarative memories. The negative effects of a constant maximum associative strength are considered, and mitigating solutions are proposed. HADM utilizes a GPU to process the entire semantic network in parallel during retrievals, yielding significantly faster declarative retrievals. The resulting GPU-accelerated retrievals show an average speedup of approximately 70 times over the previous Service Oriented Architecture Declarative Memory (soaDM) implementation and an average speedup of approximately 5 times over ADM. HADM is the first GPU-accelerated declarative memory system in existence. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

48. A Virtual Multi-Channel GPU Fair Scheduling Method for Virtual Machines.

Author: Tan, Huailiang, Tan, Yanjie, He, Xiaofei, Li, Kenli, and Li, Keqin
Subjects: *GRAPHICS processing units, *VIRTUAL machine systems, *MACHINE learning, *DEEP learning, *COMPUTER architecture
Abstract: In modern virtual computing environment, the 2D/3D rendering performance and parallel computing potential of GPU (graphics processing unit) must be fully exploited for multiple virtual machines (VMs). Existing GPU virtualization techniques are unable to take full advantage of a GPU's powerful 2D/3D hardware-accelerated graphics rendering performance or parallel computing potential, or it has not been considered that the internal resources of a GPU domain are fairly allocated between VMs with different performance requirements. Therefore, we propose a multi-channel GPU virtualization architecture (VMCG), model the corresponding credit allocating and transferring mechanisms, and redesign the virtual multi-channel GPU fair-scheduling algorithm. VMCG provides a separate V-Channel for each guest VM (DomU) that competes with other VMs for the same physical GPU resources, and each DomU submits command request blocks to its respective V-Channel according to the corresponding DomU ID. Through the virtual multi-channel GPU fair-scheduling algorithm, not only do multiple DomUs make full use of native GPU hardware acceleration, but the fairness of GPU resource allocation is significantly improved during GPU-intensive workloads from multiple DomUs running on the same host. Experimental results show that, for 2D/3D graphics applications, performance is close to 96 percent of that of the native GPU, performance is improved by approximately 500 percent for parallel computing applications, and GPU resource-allocation fairness is improved by approximately 60-80 percent. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

49. Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory.

Author: Yin, Shouyi, Tang, Shibin, Lin, Xinhan, Ouyang, Peng, Tu, Fengbin, Liu, Leibo, Zhao, Jishen, Xu, Cong, Li, Shuangcheng, Xie, Yuan, and Wei, Shaojun
Subjects: *DEEP learning, *NEURAL circuitry, *ARTIFICIAL intelligence, *VIDEO surveillance, *MACHINE learning
Abstract: Recent advances in deep learning (DL) have stimulated increasing interests in neural networks (NN). From the perspective of operation type and network architecture, deep neural networks can be categorized into full convolution-based neural network (ConvNet), recurrent neural network (RNN), and fully-connected neural network (FCNet). Different types of neural networks are usually cascaded and combined as a hybrid neural network (Hybrid-NN) to complete real-life cognitive tasks. Such hybrid-NN implementation is memory-intensive with large number of memory accesses, hence the performance of hybrid-NN is often limited by the insufficient memory bandwidth. A “3D + 2.5D” integration system, which integrates a high-bandwidth 3D stacked DRAM side-by-side with a highly-parallel neural processing unit (NPU) on a silicon interposer, overcomes the bandwidth bottleneck in hybrid-NN acceleration. However, intensive concurrent 3D DRAM accesses produced by the NPU lead to a serious thermal problem in 3D DRAM. In this paper, we propose a neural processor called Parana for hybrid-NN acceleration in consideration of thermal problem of 3D DRAM. Parana solves the thermal problem of 3D memory by optimizing both the total number of memory accesses and memory accessing behaviors. For memory accessing behaviors, Parana balances the memory bandwidth by spatial division mapping hybrid-NN onto computing resources, which efficiently avoids that masses of memory accesses are issued in a short time period. To reduce the total number of memory accesses, we design a new NPU architecture and propose a memory-oriented tiling and scheduling mechanism to exploit the maximum utilization of on-chip buffer. Experimental results show that Parana reduces the peak temperature by up to 54.72 $^\circ$ C and the steady temperature by up to 32.27 $^\circ$ C over state-of-the-art accelerators with 3D memory without performance degradation. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

50. PHAST - A Portable High-Level Modern C++ Programming Library for GPUs and Multi-Cores.

Author: Peccerillo, Biagio and Bartolini, Sandro
Subjects: *GRAPHICS processing units, *MICROCOMPUTER workstations (Computers), *SOFTWARE compatibility, *C++, *COMPLEXITY (Philosophy)
Abstract: A decade after the beginning of the many-core era, multi-core CPU and GPU architectures are everywhere, from mobile devices up to high-performance workstations and servers. To this day, programmers willing to harness their power need to express their code via languages and frameworks that often lack of expressivity and high-level abstractions. These solutions, despite allowing users to reach unprecedented performance, can still be a hampering factor for productivity and portability. In this paper we propose PHAST, a modern C++, STL-like, single-source programming library and approach based on multi-dimensional dynamic containers and multi-layered functors that can be targeted on NVIDIA GPUs and multi-core CPUs. Its main purpose is to let programmers write code once for different architectures at a high level of abstraction, to reach high-performance while allowing fine parameter tuning and not shielding code from low-level target-specific optimizations. To assess the value of our proposal, we consider benchmarks from different application domains, and we evaluate their PHAST implementations against CUDA, OpenCL, Kokkos, and SYCL ones from both performance and productivity points of view. We show that PHAST can significantly reduce code complexity metrics while reaching very good performance. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Region

Database

658 results on '"computer architecture"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources