5,379 results on '"dataflow"'
Search Results
202. Heterogeneous Design in Functional DIF
- Author
-
Plishker, William, Sane, Nimish, Kiemb, Mary, Bhattacharyya, Shuvra S., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Bereković, Mladen, editor, Dimopoulos, Nikitas, editor, and Wong, Stephan, editor
- Published
- 2008
- Full Text
- View/download PDF
203. Memory-Centric Hardware Synthesis from Dataflow Models
- Author
-
Fischaber, Scott, McAllister, John, Woods, Roger, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Bereković, Mladen, editor, Dimopoulos, Nikitas, editor, and Wong, Stephan, editor
- Published
- 2008
- Full Text
- View/download PDF
204. Streaming Systems in FPGAs
- Author
-
Neuendorffer, Stephen, Vissers, Kees, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Bereković, Mladen, editor, Dimopoulos, Nikitas, editor, and Wong, Stephan, editor
- Published
- 2008
- Full Text
- View/download PDF
205. Verified Lustre Normalization with Node Subsampling
- Author
-
Basile Pesin, Paul Jeanmaire, Marc Pouzet, Timothy Bourke, Université Paris sciences et lettres (PSL), Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Parallélisme de Kahn Synchrone ( Parkas), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), BPI 'Programme d'Investissements d'Avenir' - ES3CAP, ANR-19-CE25-0014,FidelR,FidelR(2019), Département d'informatique - ENS Paris (DI-ENS), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Département d'informatique - ENS Paris (DI-ENS), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), École normale supérieure - Paris (ENS-PSL), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL)
- Subjects
CCS Concepts: • Software and its engineering → Formal language definitions ,Normalization (statistics) ,interactive theorem proving ,Correctness ,Computer science ,Semantics (computer science) ,Dataflow ,• Computer systems organization → Embedded software stream languages ,02 engineering and technology ,computer.software_genre ,Software verification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,computer.programming_language ,[INFO.INFO-PL]Computer Science [cs]/Programming Languages [cs.PL] ,Assembly language ,Programming language ,Lustre (programming language) ,Proof assistant ,[INFO.INFO-LO]Computer Science [cs]/Logic in Computer Science [cs.LO] ,020207 software engineering ,verified compilation ,Compilers ,Hardware and Architecture ,TheoryofComputation_LOGICSANDMEANINGSOFPROGRAMS ,[INFO.INFO-ES]Computer Science [cs]/Embedded Systems ,Compiler ,computer ,Software - Abstract
Dataflow languages allow the specification of reactive systems by mutually recursive stream equations, functions, and boolean activation conditions called clocks. Lustre and Scade are dataflow languages for programming embedded systems. Dataflow programs are compiled by a succession of passes. This article focuses on the normalization pass which rewrites programs into the simpler form required for code generation. Vélus is a compiler from a normalized form of Lustre to CompCert’s Clight language. Its specification in the Coq interactive theorem prover includes an end-to-end correctness proof that the values prescribed by the dataflow semantics of source programs are produced by executions of generated assembly code. We describe how to extend Vélus with a normalization pass and to allow subsampled node inputs and outputs. We propose semantic definitions for the unrestricted language, divide normalization into three steps to facilitate proofs, adapt the clock type system to handle richer node definitions, and extend the end-to-end correctness theorem to incorporate the new features. The proofs require reasoning about the relation between static clock annotations and the presence and absence of values in the dynamic semantics. The generalization of node inputs requires adding a compiler pass to ensure the initialization of variables passed in function calls.
- Published
- 2021
- Full Text
- View/download PDF
206. Enhancing the Utilization of Processing Elements in Spatial Deep Neural Network Accelerators
- Author
-
Seok-Bum Ko and Mohammadreza Asadikouhanjani
- Subjects
050210 logistics & transportation ,0209 industrial biotechnology ,Speedup ,Dataflow ,Least slack time scheduling ,Computer science ,business.industry ,Deep learning ,Reference design ,05 social sciences ,02 engineering and technology ,Computer Graphics and Computer-Aided Design ,020901 industrial engineering & automation ,Network on a chip ,Computer engineering ,0502 economics and business ,Bandwidth (computing) ,Overhead (computing) ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,Software - Abstract
Equipping mobile platforms with deep learning applications is very valuable. Providing healthcare services in remote areas, improving privacy, and lowering needed communication bandwidth are the advantages of such platforms. Designing an efficient computation engine enhances the performance of these platforms while running deep neural networks (DNNs). Energy-efficient DNN accelerators use skipping sparsity and early negative output feature detection to prune the computations. Spatial DNN accelerators in principle can support computation-pruning techniques compared to other common architectures, such as systolic arrays. These accelerators need a separate data distribution fabric like buses or trees with support for high bandwidth to run the mentioned techniques efficiently and avoid network on chip (NoC)-based stalls. Spatial designs suffer from divergence and unequal work distribution. Therefore, applying computation-pruning techniques into a spatial design, which is even equipped with an NoC that supports high bandwidth for the processing elements (PEs), still causes stalls inside the computation engine. In a spatial architecture, the PEs that perform their tasks earlier have a slack time compared to others. In this article, we propose an architecture with a negligible area overhead based on sharing the scratchpads in a novel way between the PEs to use the available slack time caused by applying computation-pruning techniques or the used NoC format. With the use of our dataflow, a spatial engine can benefit from computation-pruning and data reuse techniques more efficiently. When compared to the reference design, our proposed method achieves a speedup of $\times 1.24$ and an energy efficiency of $\times 1.18$ per inference.
- Published
- 2021
- Full Text
- View/download PDF
207. EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks
- Author
-
Dawen Xu, Ying Wang, Lei He, Huawei Li, Xiaowei Li, Shengwen Liang, and Cheng Liu
- Subjects
Speedup ,Artificial neural network ,Computer science ,Dataflow ,Graph theory ,02 engineering and technology ,Parallel computing ,Data structure ,020202 computer hardware & architecture ,Theoretical Computer Science ,Memory management ,Computational Theory and Mathematics ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Hardware acceleration ,Overhead (computing) ,Software - Abstract
Graph neural networks (GNNs) emerge as a powerful approach to process non-euclidean data structures and have been proved powerful in various application domains such as social networks and e-commerce. While such graph data maintained in real-world systems can be extremely large and sparse, thus employing GNNs to deal with them requires substantial computational and memory overhead, which induces considerable energy and resource cost on CPUs and GPUs. In this article, we present a specialized accelerator architecture, EnGN, to enable high-throughput and energy-efficient processing of large-scale GNNs. The proposed EnGN is designed to accelerate the three key stages of GNN propagation, which is abstracted as common computing patterns shared by typical GNNs. To support the key stages simultaneously, we propose the ring-edge-reduce(RER) dataflow that tames the poor locality of sparsely-and-randomly connected vertices, and the RER PE-array to practice RER dataflow. In addition, we utilize a graph tiling strategy to fit large graphs into EnGN and make good use of the hierarchical on-chip buffers through adaptive computation reordering and tile scheduling. Overall, EnGN achieves performance speedup by 1802.9X, 19.75X, and 2.97X and energy efficiency by 1326.35X, 304.43X, and 6.2X on average compared to CPU, GPU, and a state-of-the-art GCN accelerator HyGCN, respectively.
- Published
- 2021
- Full Text
- View/download PDF
208. Analytical Performance Estimation for Large-Scale Reconfigurable Dataflow Platforms
- Author
-
Ce Guo, Ryota Yasudo, Jose G. F. Coutinho, Ana Lucia Varbanescu, Wayne Luk, Hideharu Amano, and Tobias Becker
- Subjects
General Computer Science ,Scale (ratio) ,Computer engineering ,Computer science ,Dataflow ,Performance estimation ,Path (graph theory) ,Field-programmable gate array - Abstract
Next-generation high-performance computing platforms will handle extreme data- and compute-intensive problems that are intractable with today’s technology. A promising path in achieving the next leap in high-performance computing is to embrace heterogeneity and specialised computing in the form of reconfigurable accelerators such as FPGAs, which have been shown to speed up compute-intensive tasks with reduced power consumption. However, assessing the feasibility of large-scale heterogeneous systems requires fast and accurate performance prediction. This article proposes Performance Estimation for Reconfigurable Kernels and Systems (PERKS), a novel performance estimation framework for reconfigurable dataflow platforms. PERKS makes use of an analytical model with machine and application parameters for predicting the performance of multi-accelerator systems and detecting their bottlenecks. Model calibration is automatic, making the model flexible and usable for different machine configurations and applications, including hypothetical ones. Our experimental results show that PERKS can predict the performance of current workloads on reconfigurable dataflow platforms with an accuracy above 91%. The results also illustrate how the modelling scales to large workloads, and how performance impact of architectural features can be estimated in seconds.
- Published
- 2021
- Full Text
- View/download PDF
209. Dynamic Dataflow Scheduling and Computation Mapping Techniques for Efficient Depthwise Separable Convolution Acceleration
- Author
-
Longjun Liu, Xuchong Zhang, Hang Wang, Nanning Zheng, Jie Ren, Hongbin Sun, and Baoting Li
- Subjects
Hardware architecture ,Dataflow ,Computer science ,Reference design ,System on a chip ,Parallel computing ,Electrical and Electronic Engineering ,Field-programmable gate array ,Convolutional neural network ,Convolution ,Scheduling (computing) - Abstract
Depthwise separable convolution (DSC) has become one of the essential structures for lightweight convolutional neural networks. Nevertheless, its hardware architecture has not received much attention. Several previous hardware designs incur either high off-chip memory traffic or large on-chip memory usage, and hence have deficiency in terms of hardware efficiency as well as performance. This paper proposes two efficient dynamic design techniques, i.e. adaptive row-based dataflow scheduling and adaptive computation mapping, to achieve a much better trade-off between hardware efficiency and performance for DSC-based lightweight CNN accelerator. The effectiveness and efficiency of the proposed dynamic design techniques have been extensively evaluated using six DSC-based lightweight CNNs. Compared with the reference architectures, the simulation results show the proposed architectural techniques can at least reduce on-chip buffer size by 50.4% and improve the performance of convolution calculation by $1.18\times $ while maintaining the minimum off-chip memory traffic. MobileNetV2 is implemented on Zynq UltraScale+ ZCU102 SoC FPGA, and the results show the proposed accelerator can achieve 381.7 frames per second (fps), which is $1.43\times $ of the reference design, and it can save about 36.3% on-chip buffer size compared with the reference design, while maintaining the same off-chip memory traffic.
- Published
- 2021
- Full Text
- View/download PDF
210. Evolutionary Multi-Objective Model Compression for Deep Neural Networks
- Author
-
Rick Siow Mong Goh, Liangli Zhen, Joey Tianyi Zhou, Tao Luo, Miqing Li, and Zhehui Wang
- Subjects
education.field_of_study ,Dataflow ,business.industry ,Deep learning ,Population ,Energy consumption ,Theoretical Computer Science ,Computer engineering ,Artificial Intelligence ,Pruning (decision trees) ,Artificial intelligence ,Language translation ,education ,Quantization (image processing) ,business ,Efficient energy use - Abstract
While deep neural networks (DNNs) deliver state-of-the-art accuracy on various applications from face recognition to language translation, it comes at the cost of high computational and space complexity, hindering their deployment on edge devices. To enable efficient processing of DNNs in inference, a novel approach, called Evolutionary Multi-Objective Model Compression (EMOMC), is proposed to optimize energy efficiency (or model size) and accuracy simultaneously. Specifically, the network pruning and quantization space are explored and exploited by using architecture population evolution. Furthermore, by taking advantage of the orthogonality between pruning and quantization, a two-stage pruning and quantization co-optimization strategy is developed, which considerably reduces time cost of the architecture search. Lastly, different dataflow designs and parameter coding schemes are considered in the optimization process since they have a significant impact on energy consumption and the model size. Owing to the cooperation of the evolution between different architectures in the population, a set of compact DNNs that offer trade-offs on different objectives (e.g., accuracy, energy efficiency and model size) can be obtained in a single run. Unlike most existing approaches designed to reduce the size of weight parameters with no significant loss of accuracy, the proposed method aims to achieve a trade-off between desirable objectives, for meeting different requirements of various edge devices. Experimental results demonstrate that the proposed approach can obtain a diverse population of compact DNNs that are suitable for a broad range of different memory usage and energy consumption requirements. Under negligible accuracy loss, EMOMC improves the energy efficiency and model compression rate of VGG-16 on CIFAR-10 by a factor of more than 8 9. X and 2.4 X, respectively.
- Published
- 2021
- Full Text
- View/download PDF
211. nZESPA: A Near-3D-Memory Zero Skipping Parallel Accelerator for CNNs
- Author
-
Palash Das and Hemangee K. Kapoor
- Subjects
Hybrid Memory Cube ,Exploit ,Computer science ,Dataflow ,Feature extraction ,02 engineering and technology ,Energy consumption ,Parallel computing ,Computer Graphics and Computer-Aided Design ,Convolutional neural network ,020202 computer hardware & architecture ,Parallel processing (DSP implementation) ,Encoding (memory) ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Software - Abstract
Convolutional neural networks (CNNs) are one of the most popular machine learning tools for computer vision. The ubiquitous use in several applications with its high computation-cost has made it lucrative for optimization through accelerated architecture. State-of-the-art has either exploited the parallelism of CNNs, or eliminated computations through sparsity or used near-memory processing (NMP) to accelerate the CNNs. We introduce NMP-fully sparse architecture, which acquires all three capabilities. The proposed architecture is parallel and hence processes the independent CNN tasks concurrently. To exploit the sparsity, the proposed system employs a dataflow, namely, Near-3D-Memory Zero Skipping Parallel dataflow or nZESPA dataflow. This dataflow maintains the compressed-sparse encoding of data that skips all ineffectual zero-valued computations of CNNs. We design a custom accelerator which employs the nZESPA dataflow. The grids of nZESPA modules are integrated into the logic layer of the hybrid memory cube. This integration saves a significant amount of off-chip communications while implementing the concept of NMP. We compare the proposed architecture with three other architectures which either do not exploit sparsity (NMP-dense) or do not employ NMP (traditional-fully sparse) or do not include both (traditional-dense). The proposed system outperforms the baselines in terms of performance and energy consumption while executing CNN inference.
- Published
- 2021
- Full Text
- View/download PDF
212. Fine Grain Distributed Implementation of a Dataflow Language with Provable Performances
- Author
-
Gautier, Thierry, Roch, Jean-Louis, Wagner, Frédéric, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Rangan, C. Pandu, editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Shi, Yong, editor, van Albada, Geert Dick, editor, Dongarra, Jack, editor, and Sloot, Peter M. A., editor
- Published
- 2007
- Full Text
- View/download PDF
213. Anàlisi del rendiment dels mapeigs GEMM en arquitectures de CPU
- Author
-
Gallego Jené, Martina, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Georgia Institute of Technology, Abadal Cavallé, Sergi, and Krishna, Tushar
- Subjects
Neural networks (Computer science) ,Enginyeria de la telecomunicació::Telemàtica i xarxes d'ordinadors [Àrees temàtiques de la UPC] ,GEMM ,flujo de datos ,tiling ,Xarxes neuronals (Informàtica) ,dataflow ,DNN - Abstract
Deep Neural Networks (DNN) have revolutionized the scene of Machine learning (ML) with their ability to process large amounts of data, which allow them to model complex relationships and make predictions across multiple and diverse fields such as e-commerce, medical and entertainment. With such great advantages comes a great computational cost. One of the operations that represents a large fraction of the computational workload of most types of DNNs, including Convolutional Neural Networks (CNNs) or Graph Neural Networks (GNNs), are General Matrix-Matrix Multiplications (GEMMs). Nowadays, most of the research on improving the speed and efficiency of GEMMs is focused on accelerators and GPUs, but we have to take into account that CPUs are still extensively present and sometimes underutilized in modern computing systems, especially in data centers. As a result, this thesis focuses on the performance analysis of GEMMs in CPU systems. In particular, this study is explores the impact of different tiling, dataflow, and partitioning techniques on the data reuse and data movement in a set of CPU architectures. The results obtained from these simulations have shown that multi-core architectures are faster than single-core ones due to the high degree of parallelism in GEMMs. Also, we have observed that the impact of tiling on the memory accesses and runtime is not uniform across the different matrix dimensions. Finally, we have observed that the input stationary dataflow has, in general, better performance than the output stationary, however, we have observed that there are some exceptions. Les Xarxes Neuronals Profundes (DNNs) han revolucionat el panorama del Machine Learning (ML) amb la seva capacitat de processar grans quantitats de dades, cosa que els permet modelar relacions complexes i fer prediccions en múltiples i diversos camps com el comerç electrònic, la medicina i l'entreteniment. La seva aplicació aporta grans avantatges, però implica un gran cost per la capacitat computacional necessària. Una de les operacions que representa una gran part de la càrrega de treball computacional de la majoria de tipus de DNNs, incloent-hi les Xarxes Neuronals Convolucionals (CNNs) o les Xarxes Neuronals Gràfiques (GNNs), són les Multiplicacions Matricials Generals (General Martix Multiplication GEMMs). En l'actualitat, la major part de la investigació per millorar la velocitat i l'eficiència de les GEMMs se centra en acceleradors i GPUs, però cal tenir en compte que les CPUs continuen estant àmpliament presents i de vegades infrautilitzades en els sistemes de computació moderns, especialment en els centres de dades. Per això, aquest treball se centra en l?anàlisi del rendiment de les GEMM en sistemes de CPU. En particular, aquest estudi explora l'impacte de diferents tècniques basades en la reutilització i el moviment de dades que consisteixen en la divisió de matrius, la modificació del flux de dades i la partició en diferents nuclis en un conjunt d'arquitectures de CPU. Els resultats obtinguts d'aquestes simulacions han demostrat que les arquitectures multicore tenen el millor rendiment a causa de l'alt grau de paral·lelisme de les GEMM. A més, hem observat que l'impacte de la divisió de les matrius en els accessos a la memòria i el temps d'execució no és uniforme en les diferents dimensions de les matrius. Finalment, hem observat que quan les dades d'entrada són estacionàries, en general, el flux de dades té millor rendiment que quan les dades de sortida són estacionàries, tot i que hem observat també que hi ha algunes excepcions.
- Published
- 2022
214. Towards a Message Broker Free FaaS for Distributed Dataflow Applications
- Author
-
Fortier, Patrik, Mouel, Frederic, Ponge, Julien, Dynamic Software and Distributed Systems (DYNAMID), CITI Centre of Innovation in Telecommunications and Integration of services (CITI), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria), and Red Hat Inc.
- Subjects
FaaS Multi-tier Programming Macro Programming Distributed Systems Dataflow ,Multi-tier Programming ,Macro Program- ming ,Macro Programming ,Distributed Systems ,FaaS ,Dataflow ,[INFO]Computer Science [cs] - Abstract
International audience; We present an extended implementation of Dyninka, a framework to prototype FaaS-based distributed dataflow applications. Its programming model gathers the definition and the composition of services within a single file using the multi-tier programming paradigm, and compiles them into multiple services to be deployed on cloud computing infrastructure. Our framework is built without a gateway or a messaging platform. Services communicate directly with each other within the cloud abstracted infrastructure. As a result, we emancipate ourselves from message brokers and reduce the network and computation overheads introduced by other FaaS frameworks such as OpenFaaS. We validated our approach on a Fog computing scenario with limited resources and several load profiles. Our framework shows better stability, throughput, and a reduced overhead compared to OpenFaaS.
- Published
- 2022
- Full Text
- View/download PDF
215. Automated Generation and Evaluation of Dataflow-Based Test Data for Object-Oriented Software
- Author
-
Oster, Norbert, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Reussner, Ralf, editor, Mayer, Johannes, editor, Stafford, Judith A., editor, Overhage, Sven, editor, Becker, Steffen, editor, and Schroeder, Patrick J., editor
- Published
- 2005
- Full Text
- View/download PDF
216. The ontology-based modeling and evolution of digital twin for assembly workshop
- Author
-
Wei Wang, Yong Yu, Qiangwei Bao, Gang Zhao, and Sheng Dai
- Subjects
Process (engineering) ,Computer science ,business.industry ,Dataflow ,Mechanical Engineering ,Tracing ,Ontology (information science) ,Field (computer science) ,Industrial and Manufacturing Engineering ,Task (project management) ,Computer Science Applications ,Resource (project management) ,Control and Systems Engineering ,Software engineering ,business ,Software ,Production system - Abstract
Digital twin (DT) technology has been entrusted with the tasks of modeling and monitoring of the product, process and production system. Moreover, the development of semantic modeling and digital perception provides the feasibility for the application of DT in the manufacturing industry. However, the application of DT technology in assembly workshop modeling and management is immature for the discreteness of assembly process, diversity of assembly resource and complexity of dataflow in the assembly task execution. A method of ontology-based modeling and evolution of DT for the assembly workshop is proposed to deal with this situation. Firstly, the ontology-based modeling method is given for the assembly resource and process. By instantiating in the ontology, resources and processes can be involved in the modeling and evolution of the DT workshop. Secondly, the DT assembly workshop framework is introduced with the detailed discussions of dataflow mapping, DT evolution, storage and tracing of historical data generated during the operation of the workshop. In addition, a case study is illustrated to show the entire process of construction and evolution of DT modeled on an experimental field, indicating the feasibility and validity of the method proposed.
- Published
- 2021
- Full Text
- View/download PDF
217. In‐Memory Computation
- Author
-
Albert Chun-Chen Liu and Oscar Ming Kin Law
- Subjects
Hardware_MEMORYSTRUCTURES ,Hybrid Memory Cube ,Artificial neural network ,Dataflow ,business.industry ,Computer science ,Computation ,Deep learning ,Parallel computing ,Bottleneck ,Artificial intelligence ,Cache ,business ,Data transmission - Abstract
To overcome the deep learning memory challenge, the in‐memory computation is proposed. This chapter introduces several memory‐centric Processor‐in‐Memory architectures, Neurocube, Tetris, and NeuroStream accelerators. Georgia Institute of Technologies Neurocube accelerator integrates the parallel neural processing unit with the high‐density 3D memory package Hybrid Memory Cube (HMC) to resolve the memory bottleneck. Neurocube accelerator applies the memory centric neural computing approach for data‐driven computation. Stanford University Tetris accelerator dapts MIT Eyeriss Row Stationary dataflow with additional 3D memory–HMC to optimize the memory access for in‐memory computation. Tetris accelerator implements in‐memory accumulation to eliminate half of the ofmaps memory access and TSV data transfer. University of Bologna NeuroStream accelerator is derived from Processor‐inMemory architecture. The key features of the NeuroStream accelerator are listed as follows: NeuroCluster frequency, NeuroStream per cluster, Instruction cache per core, and Scratchpad per cluster. The roofline plot is used to illustrate NeuroStream processor performance.
- Published
- 2021
- Full Text
- View/download PDF
218. Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks
- Author
-
Yang Guo, Yaohua Wang, Rui Xu, Sheng Ma, and Xinhai Chen
- Subjects
010302 applied physics ,Speedup ,business.industry ,Dataflow ,Computer science ,Systolic array ,02 engineering and technology ,Energy consumption ,01 natural sciences ,Convolutional neural network ,020202 computer hardware & architecture ,Convolution ,Transmission (telecommunications) ,Hardware and Architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Hardware acceleration ,business ,Software ,Computer hardware ,Information Systems - Abstract
The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control and dataflow, hardware accelerators with the systolic array can calculate traditional convolution very efficiently. However, this advantage also brings new challenges to the systolic array. When computing special types of convolution, such as the small-scale convolution or depthwise convolution, the processing element (PE) utilization rate of the array decreases sharply. The main reason is that the simple architecture design limits the flexibility of the systolic array. In this article, we design a configurable multi-directional systolic array (CMSA) to address these issues. First, we added a data path to the systolic array. It allows users to split the systolic array through configuration to speed up the calculation of small-scale convolution. Second, we redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies. This allows users to switch the dataflow of the PE array to speed up the calculation of depthwise convolution. In addition, unlike other works, we only make a few changes and modifications to the existing systolic array architecture. It avoids additional hardware overheads and can be easily deployed in application scenarios that require small systolic arrays such as mobile terminals. Based on our evaluation, CMSA can increase the PE utilization rate by up to 1.6 times compared to the typical systolic array when running the last layers of ResNet-18. When running depthwise convolution in MobileNet, CMSA can increase the utilization rate by up to 14.8 times. At the same time, CMSA and the traditional systolic arrays are similar in area and energy consumption.
- Published
- 2021
- Full Text
- View/download PDF
219. Compliant geo-distributed data processing in action
- Author
-
Kaustubh Beedkar, David Brekardin, Volker Markl, and Jorge-Arnulfo Quiané-Ruiz
- Subjects
Data processing ,Action (philosophy) ,Work (electrical) ,Computer science ,Human–computer interaction ,Dataflow ,Movement (music) ,General Engineering ,Dimension (data warehouse) - Abstract
In this paper we present our work on compliant geo-distributed data processing. Our work focuses on the new dimension of dataflow constraints that regulate the movement of data across geographical or institutional borders. For example, European directives may regulate transferring only certain information fields (such as non personal information) or aggregated data. Thus, it is crucial for distributed data processing frameworks to consider compliance with respect to dataflow constraints derived from these regulations. We have developed a compliance-based data processing framework, which (i) allows for the declarative specification of dataflow constraints, (ii) determines if a query can be translated into a compliant distributed query execution plan, and (iii) executes the compliant plan over distributed SQL databases. We demonstrate our framework using a geo-distributed adaptation of the TPC-H benchmark data. Our framework provides an interactive dashboard, which allows users to specify dataflow constraints, and analyze and execute compliant distributed query execution plans.
- Published
- 2021
- Full Text
- View/download PDF
220. An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks
- Author
-
Jun Lin, Zhongfeng Wang, Xiaoru Xie, and Jinghe Wei
- Subjects
Computational complexity theory ,Computer science ,Dataflow ,business.industry ,020208 electrical & electronic engineering ,Bandwidth (signal processing) ,02 engineering and technology ,Parallel computing ,Convolutional neural network ,0202 electrical engineering, electronic engineering, information engineering ,System on a chip ,Electrical and Electronic Engineering ,Field-programmable gate array ,business ,Electrical efficiency ,Digital signal processing - Abstract
Designing hardware accelerators for convolutional neural networks (CNNs) has recently attracted tremendous attention. Plenty of existing accelerators are built for dense CNNs or structured sparse CNNs. By contrast, unstructured sparse CNNs can achieve higher compression ratio with equivalent accuracy. However, their corresponding hardware implementations generally suffer from load imbalance and conflict access to on-chip buffers, which results in under utilization of processing elements (PEs). To tackle these issues, we propose a hardware/power-efficient and highly flexible architecture to support both unstructured and structured sparse CNNs with various configurations. Firstly, we propose an efficient weight reordering algorithm to preprocess compressed weights and balance the workload of PEs. Secondly, an adaptive on-chip dataflow, namely hybrid parallel (HP) dataflow, is introduced to promote weight reuse. Thirdly, the partial fusion scheme, which was first introduced in one of our prior works, is incorporated as the off-chip dataflow. Benefited from dataflow optimizations, the repetitive data exchanges between on-chip buffers and external memories are significantly reduced. We implement the design on the Intel Arria10 SX660 platform and evaluate with MobileNet-v2, ResNet-50, and ResNet-18 on ImageNet dataset. Compared to existing sparse accelerators on FPGAs, the proposed accelerator can achieve $1.35\sim 1.81\times $ improvement in power efficiency with the same sparsity. Compared to prior dense accelerators, this accelerator can achieve an improvement of $1.92\sim 5.84\times $ in DSP efficiency.
- Published
- 2021
- Full Text
- View/download PDF
221. A Binary Translation Framework for Automated Hardware Generation
- Author
-
Joao M. P. Cardoso, João Bispo, João Canas Ferreira, and Nuno Paulino
- Subjects
MicroBlaze ,Dataflow ,Computer science ,business.industry ,Binary translation ,Symmetric multiprocessor system ,Energy consumption ,Reconfigurable computing ,Hardware and Architecture ,Retargeting ,Electrical and Electronic Engineering ,Field-programmable gate array ,business ,Software ,Computer hardware - Abstract
As applications move to the edge, efficiency in computing power and power/energy consumption is required. Heterogeneous computing promises to meet these requirements through application-specific hardware accelerators. Runtime adaptivity might be of paramount importance to realize the potential of hardware specialization, but further study is required on workload retargeting and offloading to reconfigurable hardware. This article presents our framework for the exploration of both offloading and hardware generation techniques. The framework is currently able to process instruction sequences from MicroBlaze, ARMv8, and riscv32imaf binaries, and to represent them as Control and Dataflow Graphs for transformation to implementations of hardware modules. We illustrate the framework’s capabilities for identifying binary sequences for hardware translation with a set of 13 benchmarks.
- Published
- 2021
- Full Text
- View/download PDF
222. Watermarks in stream processing systems
- Author
-
Slava Chernyak, Daniel Mills, Fabian Hueske, Kenneth Knowles, Kathryn Knight, Dan Sotolongo, Tyler Akidau, and Edmon Begoli
- Subjects
Stream processing ,Semantics (computer science) ,Dataflow ,Computer science ,Programming language ,business.industry ,General Engineering ,Cloud computing ,computer.software_genre ,business ,computer - Abstract
Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks , a key tool for reasoning about temporal completeness in infinite streams. First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches: • Computing a single correct answer, as in notifications. • Reasoning about a lack of data, as in dip detection. • Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models. • Safely and punctually garbage collecting obsolete inputs and intermediate state. • Surfacing a reliable signal of overall pipeline health . Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.
- Published
- 2021
- Full Text
- View/download PDF
223. A Robust Deep-Learning-Enabled Trust-Boundary Protection for Adversarial Industrial IoT Environment
- Author
-
Shamsul Huda, Victor Hugo C. de Albuquerque, Mohammad Mehedi Hassan, and Md. Rafiul Hassan
- Subjects
Trust boundary ,Artificial neural network ,Computer Networks and Communications ,Computer science ,business.industry ,Dataflow ,Distributed computing ,Deep learning ,020206 networking & telecommunications ,02 engineering and technology ,Attack surface ,Computer Science Applications ,Attack model ,Hardware and Architecture ,Robustness (computer science) ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Information Systems - Abstract
In recent years, trust-boundary protection has become a challenging problem in Industrial Internet of Things (IIoT) environments. Trust boundaries separate IIoT processes and data stores in different groups based on user access privilege. Points where dataflow intersects with the trust boundary are becoming entry points for attackers. Attackers use various model skewing and intelligent techniques to generate adversarial/noisy examples that are indistinguishable from natural data. Many of the existing machine-learning (ML)-based approaches attempt to circumvent this problem. However, owing to an extremely large attack surface in the IIoT network, capturing a true distribution during training is difficult. The standard generative adversarial network (GAN) commonly generates adversarial examples for training using randomly sampled noise. However, the distribution of noisy inputs of GAN largely differs from actual distribution of data in IIoT networks and shows less robustness against adversarial attacks. Therefore, in this article, we propose a downsampler-encoder-based cooperative data generator that is trained using an algorithm to ensure better capture of the actual distribution of attack models for the large IIoT attack surface. The proposed downsampler-based data generator is alternatively updated and verified during training using a deep neural network discriminator to ensure robustness. This guarantees the performance of the generator against input sets with a high noise level at time of training and testing. Various experiments are conducted on a real IIoT testbed data set. Experimental results show that the proposed approach outperforms conventional deep learning and other ML techniques in terms of robustness against adversarial/noisy examples in the IIoT environment.
- Published
- 2021
- Full Text
- View/download PDF
224. System-Technology Codesign of 3-D NAND Flash-Based Compute-in-Memory Inference Engine
- Author
-
Wonbo Shim and Shimeng Yu
- Subjects
Computer engineering. Computer hardware ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,Dataflow ,Inference ,NAND gate ,Chip ,Electronic, Optical and Magnetic Materials ,Resistive random-access memory ,deep neural network (DNN) ,TK7885-7895 ,Hardware and Architecture ,Static random-access memory ,Electrical and Electronic Engineering ,Inference engine ,compute-in-memory (CIM) ,business ,hardware accelerator ,Throughput (business) ,3-D NAND ,Computer hardware - Abstract
Due to its ultrahigh density and commercially matured fabrication technology, 3-D NAND flash memory has been proposed as an attractive candidate of inference engine for deep neural network (DNN) workloads. However, the peripheral circuits require to be modified with conventional 3-D NAND flash to enable compute-in-memory (CIM), and the chip architectures need to be redesigned for an optimized dataflow. In this work, we present a design of 3-D NAND-CIM accelerator based on the macro parameters from an industry-grade prototype chip. The DNN inference performance is evaluated using the DNN+NeuroSim framework. To exploit the ultrahigh density of 3-D NAND flash, both inputs and weights mapping strategies are introduced to improve the throughput. The benchmarking on the VGG network was performed across the technological candidates for CIM, including SRAM, resistive random access memory (RRAM), and 3-D NAND. Compared to the similar designs with SRAM or RRAM, the result shows that the 3-D NAND-based CIM design can achieve not only 17%–24% chip size but also 1.9–2.7 times more competitive energy efficiency for 8-bit precision inference. Inference accuracy drop induced by 3-D NAND string current drift and variation is also investigated. No accuracy degradation by current variation was observed with the proposed input mapping scheme, while accuracy drops sensitive to the current drift, which implies that some compensation schemes are needed to maintain the inference accuracy.
- Published
- 2021
- Full Text
- View/download PDF
225. The System for Transforming the Code of Dataflow Programs into Imperative
- Author
-
Vladimir S. Vasilev, Alexander I. Legalov, and Sergey V. Zykov
- Subjects
0209 industrial biotechnology ,Source code ,Computer science ,Dataflow ,program analysis ,media_common.quotation_subject ,transformation of programs ,02 engineering and technology ,Information technology ,computer.software_genre ,Set (abstract data type) ,symbols.namesake ,020901 industrial engineering & automation ,Program analysis ,0202 electrical engineering, electronic engineering, information engineering ,media_common ,Programming language ,intermediate program representations ,typing ,Dataflow programming ,T58.5-58.64 ,Imperative programming ,Transformation (function) ,symbols ,020201 artificial intelligence & image processing ,computer ,Von Neumann architecture ,dataflow parallel programming - Abstract
Functional dataflow programming languages are designed to create parallel portable programs. The source code of such programs is translated into a set of graphs that reflect information and control dependencies. The main way of their execution is interpretation, which does not allow to perform calculations efficiently on real parallel computing systems and leads to poor performance. To run programs directly on existing computing systems, you need to use specific optimization and transformation methods that take into account the features of both the programming language and the architecture of the system. Currently, the most common is the Von Neumann architecture, however, parallel programming for it in most cases is carried out using imperative languages with a static type system. For different architectures of parallel computing systems, there are various approaches to writing parallel programs. The transformation of dataflow parallel programs into imperative programs allows to form a framework of imperative code fragments that directly display sequential calculations. In the future, this framework can be adapted to a specific parallel architecture. The paper considers an approach to performing this type of transformation, which consists in allocating fragments of dataflow parallel programs as templates, which are subsequently replaced by equivalent fragments of imperative languages. The proposed transformation methods allow generating program code, to which various optimizing transformations can be applied in the future, including parallelization taking into account the target architecture.
- Published
- 2021
226. Distributed temporal graph analytics with GRADOOP
- Author
-
Martin Junghanns, Erhard Rahm, Matthias Täschner, Lucas Schons, Timo Adameit, Gomez Kevin A, Philip Fritzsche, Christopher Rost, and Lukas Christ
- Subjects
Power graph analysis ,Structure (mathematical logic) ,Theoretical computer science ,Computer science ,business.industry ,Dataflow ,Data model ,Hardware and Architecture ,Analytics ,Scalability ,Pattern matching ,business ,Bitemporal Modeling ,MathematicsofComputing_DISCRETEMATHEMATICS ,Information Systems - Abstract
Temporal property graphs are graphs whose structure and properties change over time. Temporal graph datasets tend to be large due to stored historical information, asking for scalable analysis capabilities. We give a complete overview of Gradoop, a graph dataflow system for scalable, distributed analytics of temporal property graphs which has been continuously developed since 2005. Its graph model TPGM allows bitemporal modeling not only of vertices and edges but also of graph collections. A declarative analytical language called GrALa allows analysts to flexibly define analytical graph workflows by composing different operators that support temporal graph analysis. Built on a distributed dataflow system, large temporal graphs can be processed on a shared-nothing cluster. We present the system architecture of Gradoop, its data model TPGM with composable temporal graph operators, like snapshot, difference, pattern matching, graph grouping and several implementation details. We evaluate the performance and scalability of selected operators and a composed workflow for synthetic and real-world temporal graphs with up to 283 M vertices and 1.8 B edges, and a graph lifetime of about 8 years with up to 20 M new edges per year. We also reflect on lessons learned from the Gradoop effort.
- Published
- 2021
- Full Text
- View/download PDF
227. Theoretical and applied aspects of library automation: Analyzing Russian Science Citation Index
- Subjects
Bibliometric analysis ,Impact factor ,Computer science ,Dataflow ,05 social sciences ,Science Citation Index ,Subject (documents) ,General Medicine ,050905 science studies ,World Wide Web ,Information and Communications Technology ,Information system ,Library automation ,0509 other social sciences ,050904 information & library sciences - Abstract
The 20-year document flow in library automation is analyzed based on the Russian Science Citation Index. The study goal was to evaluate the state of library automation as reflected in the document flow in library and information studies. The stages of library automation are specified; selection principles for related publications retrieval and analysis are characterized. The dynamics and statistical data of the dataflow are presented; sustainable interest toward the problem is revealed. Based on the 2- and 5-year impact factor, 10 most productive journals on the subject are identified, among them “Scientific and technical libraries” and “Bibliotekovedenie” [Russian journal of Library Science] journals. Organizations leading in library automation are introduced. The sectoral and content structures of the array of interest are characterized; the current trends in library automation related to expanding ALIS functionality and implementation of information and communication technologies are identified. The ranked list of publications on individual automated library and information systems is included.
- Published
- 2021
- Full Text
- View/download PDF
228. A Real-Time Architecture for Pruning the Effectual Computations in Deep Neural Networks
- Author
-
Hyuk-Jae Lee, Lakshminarayanan Gopalakrishnan, Mohammadreza Asadikouhanjani, Hao Zhang, and Seok-Bum Ko
- Subjects
Speedup ,Artificial neural network ,Computer science ,Dataflow ,Reference design ,020208 electrical & electronic engineering ,Sorting ,02 engineering and technology ,Parallel computing ,Filter (video) ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Pruning (decision trees) ,Electrical and Electronic Engineering - Abstract
Integrating Deep Neural Networks (DNNs) into the Internet of Thing (IoT) devices could result in the emergence of complex sensing and recognition tasks that support a new era of human interactions with surrounding environments. However, DNNs are power-hungry, performing billions of computations in terms of one inference. Spatial DNN accelerators in principle can support computation-pruning techniques compared to other common architectures such as systolic arrays. Energy-efficient DNN accelerators skip bit-wise or word-wise sparsity in the input feature maps (ifmaps) and filter weights which means ineffectual computations are skipped. However, there is still room for pruning the effectual computations without reducing the accuracy of DNNs. In this paper, we propose a novel real-time architecture and dataflow by decomposing multiplications down to the bit level and pruning identical computations in spatial designs while running benchmark networks. The proposed architecture prunes identical computations by identifying identical bit values available in both ifmaps and filter weights without changing the accuracy of benchmark networks. When compared to the reference design, our proposed design achieves an average per layer speedup of $\times 1.4$ and an energy efficiency of $\times 1.21$ per inference while maintaining the accuracy of benchmark networks.
- Published
- 2021
- Full Text
- View/download PDF
229. Dataflow-Aware Macro Placement Based on Simulated Evolution Algorithm for Mixed-Size Designs
- Author
-
You-Lun Deng, Jai-Ming Lin, Jia-Jian Chen, Po-Chen Lu, and Ya-Chu Yang
- Subjects
Very-large-scale integration ,Computer science ,Dataflow ,02 engineering and technology ,Data structure ,020202 computer hardware & architecture ,Constraint (information theory) ,Image stitching ,Hardware and Architecture ,Simulated annealing ,Hardware_INTEGRATEDCIRCUITS ,0202 electrical engineering, electronic engineering, information engineering ,Electrical and Electronic Engineering ,Routing (electronic design automation) ,Macro ,Algorithm ,Software - Abstract
This article proposes a novel approach to handle macro placement. Previous works usually apply the simulated annealing (SA) algorithm to handle this problem. However, the SA-based approaches usually have difficulty in handling preplaced macros and require longer runtime. To resolve these problems, we propose a macro placement procedure based on the corner stitching data structure and then apply an efficient and effective simulated evolution algorithm to further refine placement results. In order to relieve local routing congestion, we propose to expand areas of movable macros according to the design hierarchy before applying the macro placement algorithm. Finally, we extend our macro placement methodology to consider dataflow constraint so that dataflow-related macros can be placed at close locations. The experimental results show that our approach obtains a better solution than a previous macro placement algorithm and a tool. Besides, placement quality can be further improved when the dataflow constraint is considered.
- Published
- 2021
- Full Text
- View/download PDF
230. A Dataflow Language (AVON) as an Architecture Description Language (ADL)
- Author
-
Deb, Ashoke, Kleinjohann, Bernd, editor, Gao, Guang R., editor, Kopetz, Hermann, editor, Kleinjohann, Lisa, editor, and Rettberg, Achim, editor
- Published
- 2004
- Full Text
- View/download PDF
231. A Network Scheduling Method Based on Segmented Constraints for Convergence of Time-Sensitive Networking and Industrial Wireless Networks
- Author
-
Min Wei, Chang Liu, Jin Wang, and Shujie Yang
- Subjects
industrial wireless network ,time-sensitive networking ,scheduling method ,greedy algorithm ,dataflow ,Computer Networks and Communications ,Hardware and Architecture ,Control and Systems Engineering ,Signal Processing ,Electrical and Electronic Engineering - Abstract
In industrial applications, it is necessary to select different types of networks according to different communication requirements. To meet this requirement, a converged network of wired and wireless networks is frequently employed. Notably, fulfilling the end-to-end transmission requirements of converged networks is challenging. As a solution, converged-network scheduling methods have proved valuable. In this paper, a network scheduling method for the convergence of industrial wireless networks and time-sensitive networks is proposed. Additionally, the proposed method is tested and verified. The results show that the end-to-end average transmission delay is reduced and the jitter is acceptable.
- Published
- 2023
- Full Text
- View/download PDF
232. Teaching Programming Broadly and Deeply: The Kernel Language Approach
- Author
-
Van Roy, Peter, Haridi, Seif, Cassel, Lillian, editor, and Reis, Ricardo A., editor
- Published
- 2003
- Full Text
- View/download PDF
233. Understanding the design-space of sparse/dense multiphase GNN dataflows on spatial accelerators
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. IDEAI-UPC - Intelligent Data sciEnce and Artificial Intelligence Research Group, Garg, Raveesh, Qin, Eric, Muñoz Martínez, Francisco, Guirado Liñan, Robert, Jain, Akshay, Abadal Cavallé, Sergi, Abellán Miguel, José Luis, Acacio Sánchez, Manuel E., Alarcón Cot, Eduardo José, Rajamanickam, Sivasankaran, Krishna, Tushar, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. IDEAI-UPC - Intelligent Data sciEnce and Artificial Intelligence Research Group, Garg, Raveesh, Qin, Eric, Muñoz Martínez, Francisco, Guirado Liñan, Robert, Jain, Akshay, Abadal Cavallé, Sergi, Abellán Miguel, José Luis, Acacio Sánchez, Manuel E., Alarcón Cot, Eduardo José, Rajamanickam, Sivasankaran, and Krishna, Tushar
- Abstract
Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and memory characteristics that come from an interplay between dense and sparse phases of computations, the emergence of recon-figurable dataflow (aka spatial) accelerators offers promise for acceleration by mapping optimized dataflows (i.e., computation order and parallelism) for both phases. The goal of this work is to characterize and understand the design-space of dataflow choices for running GNNs on spatial accelerators in order for mappers or design-space exploration tools to optimize the dataflow based on the workload. Specifically, we propose a taxonomy to describe all possible choices for mapping the dense and sparse phases of GNN inference, spatially and temporally over a spatial accelerator, capturing both the intra-phase dataflow and the inter-phase (pipelined) dataflow. Using this taxonomy, we do deep-dives into the cost and benefits of several dataflows and perform case studies on implications of hardware parameters for dataflows and value of flexibility to support pipelined execution., Parts of this work were supported through a fellowship by NEC Laboratories Europe, Project grant PID2020-112827GB-I00 funded by MCIN/AEI/ 10.13039/501100011033, RTI2018-098156-B-C53 (MCIU/AEI/FEDER,UE) and grant 20749/FPI/18 from Fundación Séneca., Peer Reviewed, Postprint (author's final draft)
- Published
- 2022
234. The Digital Earth SMART monitoring concept and tools
- Author
-
Bouwer, L.M., Dransch, D., Ruhnke, R., Rechid, D., Frickenhaus, S., Greinert, J., Koedel, Uta, Dietrich, Peter, Fischer, P., Bundke, U., Burwicz-Galerne, E., Haas, A., Herrarte, I., Haroon, A., Jegen, M., Kalbacher, Thomas, Kennert, M., Korf, T., Kunkel, R., Kwok, Ching Yin, Mahnke, C., Nixdorf, Erik, Paasche, Hendrik, González Ávalos, E., Petzold, A., Rohs, S., Wagner, R., Walter, A., Bouwer, L.M., Dransch, D., Ruhnke, R., Rechid, D., Frickenhaus, S., Greinert, J., Koedel, Uta, Dietrich, Peter, Fischer, P., Bundke, U., Burwicz-Galerne, E., Haas, A., Herrarte, I., Haroon, A., Jegen, M., Kalbacher, Thomas, Kennert, M., Korf, T., Kunkel, R., Kwok, Ching Yin, Mahnke, C., Nixdorf, Erik, Paasche, Hendrik, González Ávalos, E., Petzold, A., Rohs, S., Wagner, R., and Walter, A.
- Abstract
Reliable data are the base of all scientific analyses, interpretations and conclusions. Evaluating data in a smart way speeds up the process of interpretation and conclusion and highlights where, when and how additionally acquired data in the field will support knowledge gain. An extended SMART monitoring concept is introduced which includes SMART sensors, DataFlows, MetaData and Sampling approaches and tools. In the course of the Digital Earth project, the meaning of SMART monitoring has significantly evolved. It stands for a combination of hard- and software tools enhancing the traditional monitoring approach where a SMART monitoring DataFlow is processed and analyzed sequentially on the way from the sensor to a repository into an integrated analysis approach. The measured values itself, its metadata, and the status of the sensor, and additional auxiliary data can be made available in real time and analyzed to enhance the sensor output concerning accuracy and precision. Although several parts of the four tools are known, technically feasible and sometimes applied in Earth science studies, there is a large discrepancy between knowledge and our derived ambitions and what is feasible and commonly done in the reality and in the field.
- Published
- 2022
235. Performance Analysis of GEMM Mappings in CPU Architectures
- Author
-
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Georgia Institute of Technology, Abadal Cavallé, Sergi, Krishna, Tushar, Gallego Jené, Martina, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Georgia Institute of Technology, Abadal Cavallé, Sergi, Krishna, Tushar, and Gallego Jené, Martina
- Abstract
Deep Neural Networks (DNN) have revolutionized the scene of Machine learning (ML) with their ability to process large amounts of data, which allow them to model complex relationships and make predictions across multiple and diverse fields such as e-commerce, medical and entertainment. With such great advantages comes a great computational cost. One of the operations that represents a large fraction of the computational workload of most types of DNNs, including Convolutional Neural Networks (CNNs) or Graph Neural Networks (GNNs), are General Matrix-Matrix Multiplications (GEMMs). Nowadays, most of the research on improving the speed and efficiency of GEMMs is focused on accelerators and GPUs, but we have to take into account that CPUs are still extensively present and sometimes underutilized in modern computing systems, especially in data centers. As a result, this thesis focuses on the performance analysis of GEMMs in CPU systems. In particular, this study is explores the impact of different tiling, dataflow, and partitioning techniques on the data reuse and data movement in a set of CPU architectures. The results obtained from these simulations have shown that multi-core architectures are faster than single-core ones due to the high degree of parallelism in GEMMs. Also, we have observed that the impact of tiling on the memory accesses and runtime is not uniform across the different matrix dimensions. Finally, we have observed that the input stationary dataflow has, in general, better performance than the output stationary, however, we have observed that there are some exceptions., Les Xarxes Neuronals Profundes (DNNs) han revolucionat el panorama del Machine Learning (ML) amb la seva capacitat de processar grans quantitats de dades, cosa que els permet modelar relacions complexes i fer prediccions en múltiples i diversos camps com el comerç electrònic, la medicina i l'entreteniment. La seva aplicació aporta grans avantatges, però implica un gran cost per la capacitat computacional necessària. Una de les operacions que representa una gran part de la càrrega de treball computacional de la majoria de tipus de DNNs, incloent-hi les Xarxes Neuronals Convolucionals (CNNs) o les Xarxes Neuronals Gràfiques (GNNs), són les Multiplicacions Matricials Generals (General Martix Multiplication GEMMs). En l'actualitat, la major part de la investigació per millorar la velocitat i l'eficiència de les GEMMs se centra en acceleradors i GPUs, però cal tenir en compte que les CPUs continuen estant àmpliament presents i de vegades infrautilitzades en els sistemes de computació moderns, especialment en els centres de dades. Per això, aquest treball se centra en l?anàlisi del rendiment de les GEMM en sistemes de CPU. En particular, aquest estudi explora l'impacte de diferents tècniques basades en la reutilització i el moviment de dades que consisteixen en la divisió de matrius, la modificació del flux de dades i la partició en diferents nuclis en un conjunt d'arquitectures de CPU. Els resultats obtinguts d'aquestes simulacions han demostrat que les arquitectures multicore tenen el millor rendiment a causa de l'alt grau de paral·lelisme de les GEMM. A més, hem observat que l'impacte de la divisió de les matrius en els accessos a la memòria i el temps d'execució no és uniforme en les diferents dimensions de les matrius. Finalment, hem observat que quan les dades d'entrada són estacionàries, en general, el flux de dades té millor rendiment que quan les dades de sortida són estacionàries, tot i que hem observat també que hi ha algunes excepcions.
- Published
- 2022
236. Multithread Accelerators on FPGAs: A Dataflow-Based Approach
- Author
-
Francesco Ratto and Stefano Esposito and Carlo Sau and Luigi Raffo and Francesca Palumbo, Ratto, Francesco, Esposito, Stefano, Sau, Carlo, Raffo, Luigi, Palumbo, Francesca, Francesco Ratto and Stefano Esposito and Carlo Sau and Luigi Raffo and Francesca Palumbo, Ratto, Francesco, Esposito, Stefano, Sau, Carlo, Raffo, Luigi, and Palumbo, Francesca
- Abstract
Multithreading is a well-known technique for general-purpose systems to deliver a substantial performance gain, raising resource efficiency by exploiting underutilization periods. With the increase of specialized hardware, resource efficiency became fundamental to master the introduced overhead of such kind of devices. In this work, we propose a model-based approach for designing specialized multithread hardware accelerators. This novel approach exploits dataflow models of applications and tagged tokens to let the resulting hardware support concurrent threads without the need to replicate the whole accelerator. Assessment is carried out over different versions of an accelerator for a compute-intensive step of modern video coding algorithms, under several feeding configurations. Results highlight that the proposed multithread accelerators achieve a valuable tradeoff: saving computational resources with respect to replicated parallel single-thread accelerators, while guaranteeing shorter waiting, response, and elaboration time than a unique single-thread accelerator multiplexed in time.
- Published
- 2022
- Full Text
- View/download PDF
237. Hierarchical Dataflow Modeling of Iterative Applications.
- Author
-
Hyesun Hong, Hyunok Oh, and Soonhoi Ha
- Subjects
DATA flow computing ,ELECTRONIC data processing ,ITERATIVE methods (Mathematics) ,NUMERICAL analysis ,NEURAL circuitry - Abstract
Even though dataflow models are good at exploiting task-level parallelism of an application, it is difficult to exploit the parallelism of loop structures since they are not explicitly specified in existent dataflow models. To overcome this drawback, we propose a novel extension to the SDF model, called SDF/L graph, specifying the loop structures explicitly in a hierarchical fashion. With a given SDF/L graph specification and the mapping and scheduling information, an application can be automatically parallelized on a multicore system. The enhanced expression capability by the proposed extension is verified with two applications, k-means clustering and deep neural network application. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
238. StaccatoLab: A Programming and Execution Model for Large‐scale Dataflow Computing
- Author
-
Kees van Berkel
- Subjects
Scale (ratio) ,Computer science ,Dataflow ,Dataflow programming ,Parallel computing ,Execution model - Published
- 2021
- Full Text
- View/download PDF
239. Evaluation of the Exact Throughput of a Synchronous DataFlow Graph
- Author
-
Alix Munier Kordon and Bruno Bodin
- Subjects
Schedule ,Computer science ,Dataflow ,Iterative method ,020206 networking & telecommunications ,02 engineering and technology ,Parallel computing ,computer.software_genre ,Theoretical Computer Science ,Set (abstract data type) ,Task (computing) ,Hardware and Architecture ,Control and Systems Engineering ,Modeling and Simulation ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Compiler ,Throughput (business) ,computer ,Information Systems - Abstract
Synchronous DataFlow Graph (SDFG in short) is a formalism frequently considered in electronic design and software compilers to model communications between components with different rates. The development of efficient algorithms to evaluate the maximum throughput of SDFGs is a challenging question. This paper presents a mathematical framework to perform schedulability analysis and to compute the maximum throughput of SDFGs. This work focuses on strictly K-Periodic schedules for which a fixed set of execution times coupled with a period are associated with each task and define a schedule of every task executions. This class of schedules can always reach maximal throughput; we present an algorithm that computes the exact maximum throughput by iteratively generating K-periodic schedules until we reach optimality. The complexity of this iterative algorithm is studied by using the well-established benchmarking suite SDF3, and compared against the most common throughput analysis techniques. We show several orders of magnitude improvement over state-of-the-art both in terms of computation time, and size of the final schedules.
- Published
- 2021
- Full Text
- View/download PDF
240. I-Scheduler: Iterative scheduling for distributed stream processing systems
- Author
-
Leila Eskandari, David Eyers, Jason Mair, and Zhiyi Huang
- Subjects
Stream processing ,Computer Networks and Communications ,Hardware and Architecture ,Computer science ,Dataflow ,Graph partition ,Throughput ,Parallel computing ,Directed acyclic graph ,Software ,Scheduling (computing) - Abstract
Task allocation in Data Stream Processing Systems (DSPSs) has a significant impact on performance metrics such as data processing latency and system throughput. An application processed by DSPSs can be represented as a Directed Acyclic Graph (DAG), where each vertex represents a task and the edges show the dataflow between the tasks. Task allocation can be defined as the assignment of the vertices in the DAG to the physical compute nodes such that the data movement between the nodes is minimised. Finding an optimal task placement for DSPSs is NP-hard. Thus, approximate scheduling approaches are required to improve the performance of DSPSs. In this paper, we propose a heuristic scheduling algorithm which reliably and efficiently finds highly communicating tasks by exploiting graph partitioning algorithms and a mathematical optimisation software package. We evaluate the communication cost of our method using three micro-benchmarks, showing that we can achieve results that are close to optimal. We further compare our scheduler with two popular existing schedulers, R-Storm and Aniello et al.’s ‘Online scheduler’ using two real-world applications. Our experimental results show that our proposed scheduler outperforms R-Storm, increasing throughput by up to 30%, and improves on the Online scheduler by 20%–86% as a result of finding a more efficient schedule. 1
- Published
- 2021
- Full Text
- View/download PDF
241. SEIZE: Runtime Inspection for Parallel Dataflow Systems
- Author
-
Carlo Zaniolo, Matteo Interlandi, Youfu Li, Wei Wang, and Fotis Psallidas
- Subjects
Computer science ,Dataflow ,Programming language ,media_common.quotation_subject ,Query optimization ,computer.software_genre ,Scheduling (computing) ,Computational Theory and Mathematics ,Debugging ,Hardware and Architecture ,Computer cluster ,Signal Processing ,Spark (mathematics) ,Programming paradigm ,Overhead (computing) ,computer ,media_common - Abstract
Many Data-Intensive Scalable Computing (DISC) Systems provide easy-to-use functional APIs, and efficient scheduling and execution strategies allowing users to build concise data-parallel programs. In these systems, data transformations are concealed by exposed APIs, and intermediate execution states are masked under dataflow transitions. Consequently, many crucial features and optimizations (e.g., debugging, data provenance, runtime skew detection), which require runtime datafow states, are not well-supported. Inspired by our experience in implementing features and optimizations over DISC systems, we present $\mathsf {SEIZE}$ SEIZE , a unified framework that enables dataflow inspection—wiretapping the data-path with listening logic—in MapReduce-style programming model. We generalize our lessons learned by providing a set of primitives defining dataflow inspection, orchestration options for different inspection granularities, and operator decomposition and dataflow punctuation strategy for dataflow intervention. We demonstrate the generality and flexibility of the approach by deploying $\mathsf {SEIZE}$ SEIZE in both Apache Spark and Apache Flink, and by implementing a prototype runtime query optimizer for Spark. Our experiments show that, the overhead introduced by the inspection logic is most of the time negligible (less than 5 percent in Spark and 10 percent in Flink).
- Published
- 2021
- Full Text
- View/download PDF
242. Probeware for the Modern Era: IoT Dataflow System Design for Secondary Classrooms
- Author
-
Sherry Hsi, Seth Van Doren, and Leslie G. Bondaryk
- Subjects
Multimedia ,Distributed database ,Computer science ,Dataflow ,business.industry ,Interface (computing) ,05 social sciences ,General Engineering ,050301 education ,Dataflow programming ,Cloud computing ,computer.software_genre ,Computer Science Applications ,Education ,Variety (cybernetics) ,ComputingMilieux_COMPUTERSANDEDUCATION ,Systems design ,0501 psychology and cognitive sciences ,The Internet ,business ,0503 education ,computer ,050107 human factors - Abstract
Sensor systems have the potential to make abstract science phenomena concrete for K–12 students. Internet of Things (IoT) sensor systems provide a variety of benefits for modern classrooms, creating the opportunity for global data production, orienting learners to the opportunities and drawbacks of distributed sensor and control systems, and reducing classroom hardware burden by allowing many students to “listen” to the same data stream. To date, few robust IoT classroom systems have emerged, partially due to lack of appropriate curriculum and student-accessible interfaces, and partially due to lack of classroom-compliant server technology. In this article, we present an architecture and sensor kit system that addresses issues of sensor ubiquity, acquisition clarity, data transparency, reliability, and security. The system has a dataflow programming interface to support both science practices and computational data practices, exposing the movement of data through programs and data files. The IoT Dataflow System supports authentic uses of computational tools for data production through this distributed cloud-based system, overcoming a variety of implementation challenges specific to making programs run for arbitrary duration on a variety of sensors. In practice, this system provides a number of unique yet unexplored educational opportunities. Early results show promise for Dataflow as a valuable learning technology from research conducted in a high school classroom.
- Published
- 2021
- Full Text
- View/download PDF
243. SymPas: Symbolic Program Slicing
- Author
-
Ying-Zhou Zhang
- Subjects
Computer science ,Programming language ,Dataflow ,Context (language use) ,computer.software_genre ,Slicing ,Computer Science Applications ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,Virtual machine ,Program Dependence Graph ,Benchmark (computing) ,Program slicing ,Graph (abstract data type) ,computer ,Software - Abstract
Program slicing is a technique for simplifying programs by focusing on selected aspects of their behavior. Current mainstream static slicing methods operate on dependence graph PDG (program dependence graph) or SDG (system dependence graph), but these friendly graph representations may be a bit expensive for some users. In this paper we attempt to study a light-weight approach of static program slicing, called Symbolic Program Slicing (SymPas), which works as a dataflow analysis on LLVM (low-level virtual machine). In our SymPas approach, slices are stored in symbolic forms, not in procedures being re-analyzed (cf. procedure summaries). Instead of re-analyzing a procedure multiple times to find its slices for each callling context, we calculate a single symbolic slice which can be instantiated at call sites avoiding re-analysis; SymPas is implemented with LLVM to perform slicing on LLVM intermediate representation (IR). For comparison, we systematically adapt IFDS (interprocedural finite distributive subset) analysis and the SDG-based slicing method (SDG-IFDS) to statically slice IR programs. Evaluated on open-source and benchmark programs, our backward SymPas shows a factor-of-6 reduction in time cost and a factor-of-4 reduction in space cost, compared with backward SDG-IFDS, thus being more efficient. In addition, the result shows that after studying slices from 66 programs, ranging up to 336 800 IR instructions in size, SymPas is highly size-scalable.
- Published
- 2021
- Full Text
- View/download PDF
244. Code-size-aware Scheduling of Synchronous Dataflow Graphs on Multicore Systems
- Author
-
Rizos Sakellariou and Mingze Ma
- Subjects
business.industry ,Dataflow ,Heuristic (computer science) ,Computer science ,02 engineering and technology ,Parallel computing ,Code size ,020202 computer hardware & architecture ,Scheduling (computing) ,Reduction (complexity) ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Multicore systems ,020201 artificial intelligence & image processing ,business ,Throughput (business) ,Software ,Digital signal processing - Abstract
Synchronous dataflow graphs are widely used to model digital signal processing and multimedia applications. Self-timed execution is an efficient methodology for the analysis and scheduling of synchronous dataflow graphs. In this article, we propose a communication-aware self-timed execution approach to solve the problem of scheduling synchronous dataflow graphs on multicore systems with communication delays. Based on this communication-aware self-timed execution approach, four communication-aware scheduling algorithms are proposed using different allocation rules. Furthermore, a code-size-aware mapping heuristic is proposed and jointly used with a proposed scheduling algorithm to reduce the code size of SDFGs on multicore systems. The proposed scheduling algorithms are experimentally evaluated and found to perform better than existing algorithms in terms of throughput and runtime for several applications. The experiments also show that the proposed code-size-aware mapping approach can achieve significant code size reduction with limited throughput degradation in most cases.
- Published
- 2021
- Full Text
- View/download PDF
245. Cooperative Coevolution-based Design Space Exploration for Multi-mode Dataflow Mapping
- Author
-
Ke Tang, Xiaofen Lu, Bo Yuan, and Xin Yao
- Subjects
education.field_of_study ,Theoretical computer science ,Cooperative coevolution ,Fitness approximation ,Design space exploration ,Dataflow ,Computer science ,Population ,02 engineering and technology ,020202 computer hardware & architecture ,Hardware and Architecture ,Genetic algorithm ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Electronic design automation ,education ,Software - Abstract
Some signal processing and multimedia applications can be specified by synchronous dataflow (SDF) models. The problem of SDF mapping to a given set of heterogeneous processors has been known to be NP-hard and widely studied in the design automation field. However, modern embedded applications are becoming increasingly complex with dynamic behaviors changes over time. As a significant extension to the SDF, the multi-mode dataflow (MMDF) model has been proposed to specify such an application with a finite number of behaviors (or modes) and each behavior (mode) is represented by an SDF graph. The multiprocessor mapping of an MMDF is far more challenging as the design space increases with the number of modes. Instead of using traditional genetic algorithm (GA)-based design space exploration (DSE) method that encodes the design space as a whole, this article proposes a novel cooperative co-evolutionary genetic algorithm (CCGA)-based framework to efficiently explore the design space by a new problem-specific decomposition strategy in which the solutions of node mapping for each individual mode are assigned to an individual population. Besides, a problem-specific local search operator is introduced as a supplement to the global search of CCGA for further improving the search efficiency of the whole framework. Furthermore, a fitness approximation method and a hybrid fitness evaluation strategy are applied for reducing the time consumption of fitness evaluation significantly. The experimental studies demonstrate the advantage of the proposed DSE method over the previous GA-based method. The proposed method can obtain an optimization result with 2×−3× better quality using less (1/2−1/3) optimization time.
- Published
- 2021
- Full Text
- View/download PDF
246. BOND
- Author
-
Zhichao Cao, Xiaolong Zheng, Wei Gong, and Qiang Ma
- Subjects
Computer Networks and Communications ,Computer science ,Dataflow ,business.industry ,Node (networking) ,020206 networking & telecommunications ,02 engineering and technology ,Data loss ,Bottleneck ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Network performance ,Routing (electronic design automation) ,Hidden Markov model ,business ,Wireless sensor network ,Computer network - Abstract
In a large-scale wireless sensor network, hundreds and thousands of sensors sample and forward data back to the sink periodically. In two real outdoor deployments GreenOrbs and CitySee, we observe that some bottleneck nodes strongly impact other nodes’ data collection and thus degrade the whole network performance. To figure out the importance of a node in the process of data collection, system manager is required to understand interactive behaviors among the parent and child nodes. So we present a management tool BOND (BOttleneck Node Detector), which explains the concept of Node Dependence to characterize how much a node relies on each of its parent nodes, and also models the routing process as a Hidden Markov Model and then uses a machine learning approach to learn the state transition probabilities in this model. Moreover, BOND can predict the network dataflow if some nodes are added or removed to avoid data loss and flow congestion in network redeployment. We implement BOND on real hardware and deploy it in an outdoor network system. The extensive experiments show that Node Dependence indeed help to explore the hidden bottleneck nodes in the network, and BOND infers the Node Dependence with an average accuracy of more than 85%.
- Published
- 2021
- Full Text
- View/download PDF
247. HIGH-LEVEL TOOLS FOR TRANSLATION OF C-APPLICATIONS INTO APPLICATIONS IN DATAFLOW LANGUAGE COLAMO
- Author
-
А.А. Gulenok, Ilya I. Levin, S.A. Dudko, V.A. Gudkov, Alexey I. Dordopulo, and A. V. Bovkun
- Subjects
Computer science ,Dataflow ,Programming language ,Translation (geometry) ,computer.software_genre ,computer - Published
- 2021
- Full Text
- View/download PDF
248. SNAP: An Efficient Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference
- Author
-
Ching-En Lee, Chester Liu, Jie-Fang Zhang, Zhengya Zhang, Yakun Sophia Shao, and Stephen W. Keckler
- Subjects
Pointwise ,Artificial neural network ,Dataflow ,Computer science ,business.industry ,Deep learning ,020208 electrical & electronic engineering ,02 engineering and technology ,Chip ,Computational science ,Convolution ,Reduction (complexity) ,CMOS ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,Throughput (business) - Abstract
Recent developments in deep neural network (DNN) pruning introduces data sparsity to enable deep learning applications to run more efficiently on resource- and energy-constrained hardware platforms. However, these sparse models require specialized hardware structures to exploit the sparsity for storage, latency, and efficiency improvements to the full extent. In this work, we present the sparse neural acceleration processor (SNAP) to exploit unstructured sparsity in DNNs. SNAP uses parallel associative search to discover valid weight (W) and input activation (IA) pairs from compressed, unstructured, sparse W and IA data arrays. The associative search allows SNAP to maintain a 75% average compute utilization. SNAP follows a channel-first dataflow and uses a two-level partial sum (psum) reduction dataflow to eliminate access contention at the output buffer and cut the psum writeback traffic by 22 $\times $ compared with state-of-the-art DNN accelerator designs. SNAP’s psum reduction dataflow can be configured in two modes to support general convolution (CONV) layers, pointwise CONV, and fully connected layers. A prototype SNAP chip is implemented in a 16-nm CMOS technology. The 2.3-mm2 test chip is measured to achieve a peak effectual efficiency of 21.55 TOPS/W (16 b) at 0.55 V and 260 MHz for CONV layers with 10% weight and activation densities. Operating on a pruned ResNet-50 network, the test chip achieves a peak throughput of 90.98 frames/s at 0.80 V and 480 MHz, dissipating 348 mW.
- Published
- 2021
- Full Text
- View/download PDF
249. An efficient scheduling algorithm for dataflow architecture using loop-pipelining
- Author
-
Li Wenming, Meng Wu, Xiaochun Ye, Da Wang, Li Yi, Hao Zhang, Dongrui Fan, and Rui Xue
- Subjects
Loop optimization ,Information Systems and Management ,Multicast ,Dataflow ,business.industry ,Computer science ,Node (networking) ,Instruction scheduling ,Supercomputer ,Computer Science Applications ,Theoretical Computer Science ,Scheduling (computing) ,Network congestion ,Network on a chip ,Software ,Computer architecture ,Artificial Intelligence ,Control and Systems Engineering ,business ,Dataflow architecture - Abstract
Dataflow architecture has native advantages in achieving high instruction parallelism and power efficiency for today’s emerging applications such as high performance computing and deep neural network. In dataflow computing, the execution of instructions is driven by data, so the data transfer efficiency of the network on chip (NoC) is a key factor affecting performance. However, the NoC performance degrades due to the increasing use of multicast communications in many applications. The existing dataflow architecture instruction scheduling algorithms do not optimize multicast communication between the instruction and its successor instructions, so the routing paths of many multicast packets have forks which cause bandwidth waste and potential network congestion. We propose a sharing path awareness (SPA) algorithm to optimize multicast communication in the dataflow architecture. The algorithm shares the routing paths from the instruction to its child node to reduce the NoC bandwidth waste through the instruction scheduler. For applications using software iteration, we further extend the loop optimization to the SPA algorithm to sufficiently exploit instruction-level parallelism. Compared with the state-of-the-art algorithm, we show that the SPA algorithm achieves 20.21% average performance improvement and 15.11% energy consumption reduction for our experimental workloads.
- Published
- 2021
- Full Text
- View/download PDF
250. A Technology-Scalable Architecture for Fast Clocks and High ILP
- Author
-
Sankaralingam, Karthikeyan, Nagarajan, Ramadass, Burger, Doug, Keckler, Stephen W., Lee, Gyungho, editor, and Yew, Pen-Chung, editor
- Published
- 2001
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.