Descriptor: "dataflow" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"dataflow"' showing total 5,379 results

Start Over Descriptor "dataflow"

5,379 results on '"dataflow"'

201. Graphic Rendering Considered as a Compilation Chain

Author: Tissoires, Benjamin, Conversy, Stéphane, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Graham, T. C. Nicholas, editor, and Palanque, Philippe, editor
Published: 2008
Full Text: View/download PDF

202. Heterogeneous Design in Functional DIF

Author: Plishker, William, Sane, Nimish, Kiemb, Mary, Bhattacharyya, Shuvra S., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Bereković, Mladen, editor, Dimopoulos, Nikitas, editor, and Wong, Stephan, editor
Published: 2008
Full Text: View/download PDF

203. Memory-Centric Hardware Synthesis from Dataflow Models

Author: Fischaber, Scott, McAllister, John, Woods, Roger, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Bereković, Mladen, editor, Dimopoulos, Nikitas, editor, and Wong, Stephan, editor
Published: 2008
Full Text: View/download PDF

204. Streaming Systems in FPGAs

Author: Neuendorffer, Stephen, Vissers, Kees, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Bereković, Mladen, editor, Dimopoulos, Nikitas, editor, and Wong, Stephan, editor
Published: 2008
Full Text: View/download PDF

205. Verified Lustre Normalization with Node Subsampling

Author: Basile Pesin, Paul Jeanmaire, Marc Pouzet, Timothy Bourke, Université Paris sciences et lettres (PSL), Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Parallélisme de Kahn Synchrone ( Parkas), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), BPI 'Programme d'Investissements d'Avenir' - ES3CAP, ANR-19-CE25-0014,FidelR,FidelR(2019), Département d'informatique - ENS Paris (DI-ENS), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Département d'informatique - ENS Paris (DI-ENS), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), École normale supérieure - Paris (ENS-PSL), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL)
Subjects: CCS Concepts: • Software and its engineering → Formal language definitions, Normalization (statistics), interactive theorem proving, Correctness, Computer science, Semantics (computer science), Dataflow, • Computer systems organization → Embedded software stream languages, 02 engineering and technology, computer.software_genre, Software verification, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, computer.programming_language, [INFO.INFO-PL]Computer Science [cs]/Programming Languages [cs.PL], Assembly language, Programming language, Lustre (programming language), Proof assistant, [INFO.INFO-LO]Computer Science [cs]/Logic in Computer Science [cs.LO], 020207 software engineering, verified compilation, Compilers, Hardware and Architecture, TheoryofComputation_LOGICSANDMEANINGSOFPROGRAMS, [INFO.INFO-ES]Computer Science [cs]/Embedded Systems, Compiler, computer, Software
Abstract: Dataflow languages allow the specification of reactive systems by mutually recursive stream equations, functions, and boolean activation conditions called clocks. Lustre and Scade are dataflow languages for programming embedded systems. Dataflow programs are compiled by a succession of passes. This article focuses on the normalization pass which rewrites programs into the simpler form required for code generation. Vélus is a compiler from a normalized form of Lustre to CompCert’s Clight language. Its specification in the Coq interactive theorem prover includes an end-to-end correctness proof that the values prescribed by the dataflow semantics of source programs are produced by executions of generated assembly code. We describe how to extend Vélus with a normalization pass and to allow subsampled node inputs and outputs. We propose semantic definitions for the unrestricted language, divide normalization into three steps to facilitate proofs, adapt the clock type system to handle richer node definitions, and extend the end-to-end correctness theorem to incorporate the new features. The proofs require reasoning about the relation between static clock annotations and the presence and absence of values in the dynamic semantics. The generalization of node inputs requires adding a compiler pass to ensure the initialization of variables passed in function calls.
Published: 2021
Full Text: View/download PDF

206. Enhancing the Utilization of Processing Elements in Spatial Deep Neural Network Accelerators

Author: Seok-Bum Ko and Mohammadreza Asadikouhanjani
Subjects: 050210 logistics & transportation, 0209 industrial biotechnology, Speedup, Dataflow, Least slack time scheduling, Computer science, business.industry, Deep learning, Reference design, 05 social sciences, 02 engineering and technology, Computer Graphics and Computer-Aided Design, 020901 industrial engineering & automation, Network on a chip, Computer engineering, 0502 economics and business, Bandwidth (computing), Overhead (computing), Artificial intelligence, Electrical and Electronic Engineering, business, Software
Abstract: Equipping mobile platforms with deep learning applications is very valuable. Providing healthcare services in remote areas, improving privacy, and lowering needed communication bandwidth are the advantages of such platforms. Designing an efficient computation engine enhances the performance of these platforms while running deep neural networks (DNNs). Energy-efficient DNN accelerators use skipping sparsity and early negative output feature detection to prune the computations. Spatial DNN accelerators in principle can support computation-pruning techniques compared to other common architectures, such as systolic arrays. These accelerators need a separate data distribution fabric like buses or trees with support for high bandwidth to run the mentioned techniques efficiently and avoid network on chip (NoC)-based stalls. Spatial designs suffer from divergence and unequal work distribution. Therefore, applying computation-pruning techniques into a spatial design, which is even equipped with an NoC that supports high bandwidth for the processing elements (PEs), still causes stalls inside the computation engine. In a spatial architecture, the PEs that perform their tasks earlier have a slack time compared to others. In this article, we propose an architecture with a negligible area overhead based on sharing the scratchpads in a novel way between the PEs to use the available slack time caused by applying computation-pruning techniques or the used NoC format. With the use of our dataflow, a spatial engine can benefit from computation-pruning and data reuse techniques more efficiently. When compared to the reference design, our proposed method achieves a speedup of $\times 1.24$ and an energy efficiency of $\times 1.18$ per inference.
Published: 2021
Full Text: View/download PDF

207. EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks

Author: Dawen Xu, Ying Wang, Lei He, Huawei Li, Xiaowei Li, Shengwen Liang, and Cheng Liu
Subjects: Speedup, Artificial neural network, Computer science, Dataflow, Graph theory, 02 engineering and technology, Parallel computing, Data structure, 020202 computer hardware & architecture, Theoretical Computer Science, Memory management, Computational Theory and Mathematics, Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Hardware acceleration, Overhead (computing), Software
Abstract: Graph neural networks (GNNs) emerge as a powerful approach to process non-euclidean data structures and have been proved powerful in various application domains such as social networks and e-commerce. While such graph data maintained in real-world systems can be extremely large and sparse, thus employing GNNs to deal with them requires substantial computational and memory overhead, which induces considerable energy and resource cost on CPUs and GPUs. In this article, we present a specialized accelerator architecture, EnGN, to enable high-throughput and energy-efficient processing of large-scale GNNs. The proposed EnGN is designed to accelerate the three key stages of GNN propagation, which is abstracted as common computing patterns shared by typical GNNs. To support the key stages simultaneously, we propose the ring-edge-reduce(RER) dataflow that tames the poor locality of sparsely-and-randomly connected vertices, and the RER PE-array to practice RER dataflow. In addition, we utilize a graph tiling strategy to fit large graphs into EnGN and make good use of the hierarchical on-chip buffers through adaptive computation reordering and tile scheduling. Overall, EnGN achieves performance speedup by 1802.9X, 19.75X, and 2.97X and energy efficiency by 1326.35X, 304.43X, and 6.2X on average compared to CPU, GPU, and a state-of-the-art GCN accelerator HyGCN, respectively.
Published: 2021
Full Text: View/download PDF

208. Analytical Performance Estimation for Large-Scale Reconfigurable Dataflow Platforms

Author: Ce Guo, Ryota Yasudo, Jose G. F. Coutinho, Ana Lucia Varbanescu, Wayne Luk, Hideharu Amano, and Tobias Becker
Subjects: General Computer Science, Scale (ratio), Computer engineering, Computer science, Dataflow, Performance estimation, Path (graph theory), Field-programmable gate array
Abstract: Next-generation high-performance computing platforms will handle extreme data- and compute-intensive problems that are intractable with today’s technology. A promising path in achieving the next leap in high-performance computing is to embrace heterogeneity and specialised computing in the form of reconfigurable accelerators such as FPGAs, which have been shown to speed up compute-intensive tasks with reduced power consumption. However, assessing the feasibility of large-scale heterogeneous systems requires fast and accurate performance prediction. This article proposes Performance Estimation for Reconfigurable Kernels and Systems (PERKS), a novel performance estimation framework for reconfigurable dataflow platforms. PERKS makes use of an analytical model with machine and application parameters for predicting the performance of multi-accelerator systems and detecting their bottlenecks. Model calibration is automatic, making the model flexible and usable for different machine configurations and applications, including hypothetical ones. Our experimental results show that PERKS can predict the performance of current workloads on reconfigurable dataflow platforms with an accuracy above 91%. The results also illustrate how the modelling scales to large workloads, and how performance impact of architectural features can be estimated in seconds.
Published: 2021
Full Text: View/download PDF

209. Dynamic Dataflow Scheduling and Computation Mapping Techniques for Efficient Depthwise Separable Convolution Acceleration

Author: Longjun Liu, Xuchong Zhang, Hang Wang, Nanning Zheng, Jie Ren, Hongbin Sun, and Baoting Li
Subjects: Hardware architecture, Dataflow, Computer science, Reference design, System on a chip, Parallel computing, Electrical and Electronic Engineering, Field-programmable gate array, Convolutional neural network, Convolution, Scheduling (computing)
Abstract: Depthwise separable convolution (DSC) has become one of the essential structures for lightweight convolutional neural networks. Nevertheless, its hardware architecture has not received much attention. Several previous hardware designs incur either high off-chip memory traffic or large on-chip memory usage, and hence have deficiency in terms of hardware efficiency as well as performance. This paper proposes two efficient dynamic design techniques, i.e. adaptive row-based dataflow scheduling and adaptive computation mapping, to achieve a much better trade-off between hardware efficiency and performance for DSC-based lightweight CNN accelerator. The effectiveness and efficiency of the proposed dynamic design techniques have been extensively evaluated using six DSC-based lightweight CNNs. Compared with the reference architectures, the simulation results show the proposed architectural techniques can at least reduce on-chip buffer size by 50.4% and improve the performance of convolution calculation by $1.18\times $ while maintaining the minimum off-chip memory traffic. MobileNetV2 is implemented on Zynq UltraScale+ ZCU102 SoC FPGA, and the results show the proposed accelerator can achieve 381.7 frames per second (fps), which is $1.43\times $ of the reference design, and it can save about 36.3% on-chip buffer size compared with the reference design, while maintaining the same off-chip memory traffic.
Published: 2021
Full Text: View/download PDF

210. Evolutionary Multi-Objective Model Compression for Deep Neural Networks

Author: Rick Siow Mong Goh, Liangli Zhen, Joey Tianyi Zhou, Tao Luo, Miqing Li, and Zhehui Wang
Subjects: education.field_of_study, Dataflow, business.industry, Deep learning, Population, Energy consumption, Theoretical Computer Science, Computer engineering, Artificial Intelligence, Pruning (decision trees), Artificial intelligence, Language translation, education, Quantization (image processing), business, Efficient energy use
Abstract: While deep neural networks (DNNs) deliver state-of-the-art accuracy on various applications from face recognition to language translation, it comes at the cost of high computational and space complexity, hindering their deployment on edge devices. To enable efficient processing of DNNs in inference, a novel approach, called Evolutionary Multi-Objective Model Compression (EMOMC), is proposed to optimize energy efficiency (or model size) and accuracy simultaneously. Specifically, the network pruning and quantization space are explored and exploited by using architecture population evolution. Furthermore, by taking advantage of the orthogonality between pruning and quantization, a two-stage pruning and quantization co-optimization strategy is developed, which considerably reduces time cost of the architecture search. Lastly, different dataflow designs and parameter coding schemes are considered in the optimization process since they have a significant impact on energy consumption and the model size. Owing to the cooperation of the evolution between different architectures in the population, a set of compact DNNs that offer trade-offs on different objectives (e.g., accuracy, energy efficiency and model size) can be obtained in a single run. Unlike most existing approaches designed to reduce the size of weight parameters with no significant loss of accuracy, the proposed method aims to achieve a trade-off between desirable objectives, for meeting different requirements of various edge devices. Experimental results demonstrate that the proposed approach can obtain a diverse population of compact DNNs that are suitable for a broad range of different memory usage and energy consumption requirements. Under negligible accuracy loss, EMOMC improves the energy efficiency and model compression rate of VGG-16 on CIFAR-10 by a factor of more than 8 9. X and 2.4 X, respectively.
Published: 2021
Full Text: View/download PDF

211. nZESPA: A Near-3D-Memory Zero Skipping Parallel Accelerator for CNNs

Author: Palash Das and Hemangee K. Kapoor
Subjects: Hybrid Memory Cube, Exploit, Computer science, Dataflow, Feature extraction, 02 engineering and technology, Energy consumption, Parallel computing, Computer Graphics and Computer-Aided Design, Convolutional neural network, 020202 computer hardware & architecture, Parallel processing (DSP implementation), Encoding (memory), 0202 electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Software
Abstract: Convolutional neural networks (CNNs) are one of the most popular machine learning tools for computer vision. The ubiquitous use in several applications with its high computation-cost has made it lucrative for optimization through accelerated architecture. State-of-the-art has either exploited the parallelism of CNNs, or eliminated computations through sparsity or used near-memory processing (NMP) to accelerate the CNNs. We introduce NMP-fully sparse architecture, which acquires all three capabilities. The proposed architecture is parallel and hence processes the independent CNN tasks concurrently. To exploit the sparsity, the proposed system employs a dataflow, namely, Near-3D-Memory Zero Skipping Parallel dataflow or nZESPA dataflow. This dataflow maintains the compressed-sparse encoding of data that skips all ineffectual zero-valued computations of CNNs. We design a custom accelerator which employs the nZESPA dataflow. The grids of nZESPA modules are integrated into the logic layer of the hybrid memory cube. This integration saves a significant amount of off-chip communications while implementing the concept of NMP. We compare the proposed architecture with three other architectures which either do not exploit sparsity (NMP-dense) or do not employ NMP (traditional-fully sparse) or do not include both (traditional-dense). The proposed system outperforms the baselines in terms of performance and energy consumption while executing CNN inference.
Published: 2021
Full Text: View/download PDF

212. Fine Grain Distributed Implementation of a Dataflow Language with Provable Performances

Author: Gautier, Thierry, Roch, Jean-Louis, Wagner, Frédéric, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Rangan, C. Pandu, editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Shi, Yong, editor, van Albada, Geert Dick, editor, Dongarra, Jack, editor, and Sloot, Peter M. A., editor
Published: 2007
Full Text: View/download PDF

213. Anàlisi del rendiment dels mapeigs GEMM en arquitectures de CPU

Author: Gallego Jené, Martina, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Georgia Institute of Technology, Abadal Cavallé, Sergi, and Krishna, Tushar
Subjects: Neural networks (Computer science), Enginyeria de la telecomunicació::Telemàtica i xarxes d'ordinadors [Àrees temàtiques de la UPC], GEMM, flujo de datos, tiling, Xarxes neuronals (Informàtica), dataflow, DNN
Abstract: Deep Neural Networks (DNN) have revolutionized the scene of Machine learning (ML) with their ability to process large amounts of data, which allow them to model complex relationships and make predictions across multiple and diverse fields such as e-commerce, medical and entertainment. With such great advantages comes a great computational cost. One of the operations that represents a large fraction of the computational workload of most types of DNNs, including Convolutional Neural Networks (CNNs) or Graph Neural Networks (GNNs), are General Matrix-Matrix Multiplications (GEMMs). Nowadays, most of the research on improving the speed and efficiency of GEMMs is focused on accelerators and GPUs, but we have to take into account that CPUs are still extensively present and sometimes underutilized in modern computing systems, especially in data centers. As a result, this thesis focuses on the performance analysis of GEMMs in CPU systems. In particular, this study is explores the impact of different tiling, dataflow, and partitioning techniques on the data reuse and data movement in a set of CPU architectures. The results obtained from these simulations have shown that multi-core architectures are faster than single-core ones due to the high degree of parallelism in GEMMs. Also, we have observed that the impact of tiling on the memory accesses and runtime is not uniform across the different matrix dimensions. Finally, we have observed that the input stationary dataflow has, in general, better performance than the output stationary, however, we have observed that there are some exceptions. Les Xarxes Neuronals Profundes (DNNs) han revolucionat el panorama del Machine Learning (ML) amb la seva capacitat de processar grans quantitats de dades, cosa que els permet modelar relacions complexes i fer prediccions en múltiples i diversos camps com el comerç electrònic, la medicina i l'entreteniment. La seva aplicació aporta grans avantatges, però implica un gran cost per la capacitat computacional necessària. Una de les operacions que representa una gran part de la càrrega de treball computacional de la majoria de tipus de DNNs, incloent-hi les Xarxes Neuronals Convolucionals (CNNs) o les Xarxes Neuronals Gràfiques (GNNs), són les Multiplicacions Matricials Generals (General Martix Multiplication GEMMs). En l'actualitat, la major part de la investigació per millorar la velocitat i l'eficiència de les GEMMs se centra en acceleradors i GPUs, però cal tenir en compte que les CPUs continuen estant àmpliament presents i de vegades infrautilitzades en els sistemes de computació moderns, especialment en els centres de dades. Per això, aquest treball se centra en l?anàlisi del rendiment de les GEMM en sistemes de CPU. En particular, aquest estudi explora l'impacte de diferents tècniques basades en la reutilització i el moviment de dades que consisteixen en la divisió de matrius, la modificació del flux de dades i la partició en diferents nuclis en un conjunt d'arquitectures de CPU. Els resultats obtinguts d'aquestes simulacions han demostrat que les arquitectures multicore tenen el millor rendiment a causa de l'alt grau de paral·lelisme de les GEMM. A més, hem observat que l'impacte de la divisió de les matrius en els accessos a la memòria i el temps d'execució no és uniforme en les diferents dimensions de les matrius. Finalment, hem observat que quan les dades d'entrada són estacionàries, en general, el flux de dades té millor rendiment que quan les dades de sortida són estacionàries, tot i que hem observat també que hi ha algunes excepcions.
Published: 2022

214. Towards a Message Broker Free FaaS for Distributed Dataflow Applications

Author: Fortier, Patrik, Mouel, Frederic, Ponge, Julien, Dynamic Software and Distributed Systems (DYNAMID), CITI Centre of Innovation in Telecommunications and Integration of services (CITI), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria), and Red Hat Inc.
Subjects: FaaS Multi-tier Programming Macro Programming Distributed Systems Dataflow, Multi-tier Programming, Macro Program- ming, Macro Programming, Distributed Systems, FaaS, Dataflow, [INFO]Computer Science [cs]
Abstract: International audience; We present an extended implementation of Dyninka, a framework to prototype FaaS-based distributed dataflow applications. Its programming model gathers the definition and the composition of services within a single file using the multi-tier programming paradigm, and compiles them into multiple services to be deployed on cloud computing infrastructure. Our framework is built without a gateway or a messaging platform. Services communicate directly with each other within the cloud abstracted infrastructure. As a result, we emancipate ourselves from message brokers and reduce the network and computation overheads introduced by other FaaS frameworks such as OpenFaaS. We validated our approach on a Fog computing scenario with limited resources and several load profiles. Our framework shows better stability, throughput, and a reduced overhead compared to OpenFaaS.
Published: 2022
Full Text: View/download PDF

215. Automated Generation and Evaluation of Dataflow-Based Test Data for Object-Oriented Software

Author: Oster, Norbert, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Reussner, Ralf, editor, Mayer, Johannes, editor, Stafford, Judith A., editor, Overhage, Sven, editor, Becker, Steffen, editor, and Schroeder, Patrick J., editor
Published: 2005
Full Text: View/download PDF

216. The ontology-based modeling and evolution of digital twin for assembly workshop

Author: Wei Wang, Yong Yu, Qiangwei Bao, Gang Zhao, and Sheng Dai
Subjects: Process (engineering), Computer science, business.industry, Dataflow, Mechanical Engineering, Tracing, Ontology (information science), Field (computer science), Industrial and Manufacturing Engineering, Task (project management), Computer Science Applications, Resource (project management), Control and Systems Engineering, Software engineering, business, Software, Production system
Abstract: Digital twin (DT) technology has been entrusted with the tasks of modeling and monitoring of the product, process and production system. Moreover, the development of semantic modeling and digital perception provides the feasibility for the application of DT in the manufacturing industry. However, the application of DT technology in assembly workshop modeling and management is immature for the discreteness of assembly process, diversity of assembly resource and complexity of dataflow in the assembly task execution. A method of ontology-based modeling and evolution of DT for the assembly workshop is proposed to deal with this situation. Firstly, the ontology-based modeling method is given for the assembly resource and process. By instantiating in the ontology, resources and processes can be involved in the modeling and evolution of the DT workshop. Secondly, the DT assembly workshop framework is introduced with the detailed discussions of dataflow mapping, DT evolution, storage and tracing of historical data generated during the operation of the workshop. In addition, a case study is illustrated to show the entire process of construction and evolution of DT modeled on an experimental field, indicating the feasibility and validity of the method proposed.
Published: 2021
Full Text: View/download PDF

217. In‐Memory Computation

Author: Albert Chun-Chen Liu and Oscar Ming Kin Law
Subjects: Hardware_MEMORYSTRUCTURES, Hybrid Memory Cube, Artificial neural network, Dataflow, business.industry, Computer science, Computation, Deep learning, Parallel computing, Bottleneck, Artificial intelligence, Cache, business, Data transmission
Abstract: To overcome the deep learning memory challenge, the in‐memory computation is proposed. This chapter introduces several memory‐centric Processor‐in‐Memory architectures, Neurocube, Tetris, and NeuroStream accelerators. Georgia Institute of Technologies Neurocube accelerator integrates the parallel neural processing unit with the high‐density 3D memory package Hybrid Memory Cube (HMC) to resolve the memory bottleneck. Neurocube accelerator applies the memory centric neural computing approach for data‐driven computation. Stanford University Tetris accelerator dapts MIT Eyeriss Row Stationary dataflow with additional 3D memory–HMC to optimize the memory access for in‐memory computation. Tetris accelerator implements in‐memory accumulation to eliminate half of the ofmaps memory access and TSV data transfer. University of Bologna NeuroStream accelerator is derived from Processor‐inMemory architecture. The key features of the NeuroStream accelerator are listed as follows: NeuroCluster frequency, NeuroStream per cluster, Instruction cache per core, and Scratchpad per cluster. The roofline plot is used to illustrate NeuroStream processor performance.
Published: 2021
Full Text: View/download PDF

218. Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Author: Yang Guo, Yaohua Wang, Rui Xu, Sheng Ma, and Xinhai Chen
Subjects: 010302 applied physics, Speedup, business.industry, Dataflow, Computer science, Systolic array, 02 engineering and technology, Energy consumption, 01 natural sciences, Convolutional neural network, 020202 computer hardware & architecture, Convolution, Transmission (telecommunications), Hardware and Architecture, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Hardware acceleration, business, Software, Computer hardware, Information Systems
Abstract: The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control and dataflow, hardware accelerators with the systolic array can calculate traditional convolution very efficiently. However, this advantage also brings new challenges to the systolic array. When computing special types of convolution, such as the small-scale convolution or depthwise convolution, the processing element (PE) utilization rate of the array decreases sharply. The main reason is that the simple architecture design limits the flexibility of the systolic array. In this article, we design a configurable multi-directional systolic array (CMSA) to address these issues. First, we added a data path to the systolic array. It allows users to split the systolic array through configuration to speed up the calculation of small-scale convolution. Second, we redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies. This allows users to switch the dataflow of the PE array to speed up the calculation of depthwise convolution. In addition, unlike other works, we only make a few changes and modifications to the existing systolic array architecture. It avoids additional hardware overheads and can be easily deployed in application scenarios that require small systolic arrays such as mobile terminals. Based on our evaluation, CMSA can increase the PE utilization rate by up to 1.6 times compared to the typical systolic array when running the last layers of ResNet-18. When running depthwise convolution in MobileNet, CMSA can increase the utilization rate by up to 14.8 times. At the same time, CMSA and the traditional systolic arrays are similar in area and energy consumption.
Published: 2021
Full Text: View/download PDF

219. Compliant geo-distributed data processing in action

Author: Kaustubh Beedkar, David Brekardin, Volker Markl, and Jorge-Arnulfo Quiané-Ruiz
Subjects: Data processing, Action (philosophy), Work (electrical), Computer science, Human–computer interaction, Dataflow, Movement (music), General Engineering, Dimension (data warehouse)
Abstract: In this paper we present our work on compliant geo-distributed data processing. Our work focuses on the new dimension of dataflow constraints that regulate the movement of data across geographical or institutional borders. For example, European directives may regulate transferring only certain information fields (such as non personal information) or aggregated data. Thus, it is crucial for distributed data processing frameworks to consider compliance with respect to dataflow constraints derived from these regulations. We have developed a compliance-based data processing framework, which (i) allows for the declarative specification of dataflow constraints, (ii) determines if a query can be translated into a compliant distributed query execution plan, and (iii) executes the compliant plan over distributed SQL databases. We demonstrate our framework using a geo-distributed adaptation of the TPC-H benchmark data. Our framework provides an interactive dashboard, which allows users to specify dataflow constraints, and analyze and execute compliant distributed query execution plans.
Published: 2021
Full Text: View/download PDF

220. An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks

Author: Jun Lin, Zhongfeng Wang, Xiaoru Xie, and Jinghe Wei
Subjects: Computational complexity theory, Computer science, Dataflow, business.industry, 020208 electrical & electronic engineering, Bandwidth (signal processing), 02 engineering and technology, Parallel computing, Convolutional neural network, 0202 electrical engineering, electronic engineering, information engineering, System on a chip, Electrical and Electronic Engineering, Field-programmable gate array, business, Electrical efficiency, Digital signal processing
Abstract: Designing hardware accelerators for convolutional neural networks (CNNs) has recently attracted tremendous attention. Plenty of existing accelerators are built for dense CNNs or structured sparse CNNs. By contrast, unstructured sparse CNNs can achieve higher compression ratio with equivalent accuracy. However, their corresponding hardware implementations generally suffer from load imbalance and conflict access to on-chip buffers, which results in under utilization of processing elements (PEs). To tackle these issues, we propose a hardware/power-efficient and highly flexible architecture to support both unstructured and structured sparse CNNs with various configurations. Firstly, we propose an efficient weight reordering algorithm to preprocess compressed weights and balance the workload of PEs. Secondly, an adaptive on-chip dataflow, namely hybrid parallel (HP) dataflow, is introduced to promote weight reuse. Thirdly, the partial fusion scheme, which was first introduced in one of our prior works, is incorporated as the off-chip dataflow. Benefited from dataflow optimizations, the repetitive data exchanges between on-chip buffers and external memories are significantly reduced. We implement the design on the Intel Arria10 SX660 platform and evaluate with MobileNet-v2, ResNet-50, and ResNet-18 on ImageNet dataset. Compared to existing sparse accelerators on FPGAs, the proposed accelerator can achieve $1.35\sim 1.81\times $ improvement in power efficiency with the same sparsity. Compared to prior dense accelerators, this accelerator can achieve an improvement of $1.92\sim 5.84\times $ in DSP efficiency.
Published: 2021
Full Text: View/download PDF

221. A Binary Translation Framework for Automated Hardware Generation

Author: Joao M. P. Cardoso, João Bispo, João Canas Ferreira, and Nuno Paulino
Subjects: MicroBlaze, Dataflow, Computer science, business.industry, Binary translation, Symmetric multiprocessor system, Energy consumption, Reconfigurable computing, Hardware and Architecture, Retargeting, Electrical and Electronic Engineering, Field-programmable gate array, business, Software, Computer hardware
Abstract: As applications move to the edge, efficiency in computing power and power/energy consumption is required. Heterogeneous computing promises to meet these requirements through application-specific hardware accelerators. Runtime adaptivity might be of paramount importance to realize the potential of hardware specialization, but further study is required on workload retargeting and offloading to reconfigurable hardware. This article presents our framework for the exploration of both offloading and hardware generation techniques. The framework is currently able to process instruction sequences from MicroBlaze, ARMv8, and riscv32imaf binaries, and to represent them as Control and Dataflow Graphs for transformation to implementations of hardware modules. We illustrate the framework’s capabilities for identifying binary sequences for hardware translation with a set of 13 benchmarks.
Published: 2021
Full Text: View/download PDF

222. Watermarks in stream processing systems

Author: Slava Chernyak, Daniel Mills, Fabian Hueske, Kenneth Knowles, Kathryn Knight, Dan Sotolongo, Tyler Akidau, and Edmon Begoli
Subjects: Stream processing, Semantics (computer science), Dataflow, Computer science, Programming language, business.industry, General Engineering, Cloud computing, computer.software_genre, business, computer
Abstract: Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks , a key tool for reasoning about temporal completeness in infinite streams. First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches: • Computing a single correct answer, as in notifications. • Reasoning about a lack of data, as in dip detection. • Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models. • Safely and punctually garbage collecting obsolete inputs and intermediate state. • Surfacing a reliable signal of overall pipeline health . Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.
Published: 2021
Full Text: View/download PDF

223. A Robust Deep-Learning-Enabled Trust-Boundary Protection for Adversarial Industrial IoT Environment

Author: Shamsul Huda, Victor Hugo C. de Albuquerque, Mohammad Mehedi Hassan, and Md. Rafiul Hassan
Subjects: Trust boundary, Artificial neural network, Computer Networks and Communications, Computer science, business.industry, Dataflow, Distributed computing, Deep learning, 020206 networking & telecommunications, 02 engineering and technology, Attack surface, Computer Science Applications, Attack model, Hardware and Architecture, Robustness (computer science), Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Information Systems
Abstract: In recent years, trust-boundary protection has become a challenging problem in Industrial Internet of Things (IIoT) environments. Trust boundaries separate IIoT processes and data stores in different groups based on user access privilege. Points where dataflow intersects with the trust boundary are becoming entry points for attackers. Attackers use various model skewing and intelligent techniques to generate adversarial/noisy examples that are indistinguishable from natural data. Many of the existing machine-learning (ML)-based approaches attempt to circumvent this problem. However, owing to an extremely large attack surface in the IIoT network, capturing a true distribution during training is difficult. The standard generative adversarial network (GAN) commonly generates adversarial examples for training using randomly sampled noise. However, the distribution of noisy inputs of GAN largely differs from actual distribution of data in IIoT networks and shows less robustness against adversarial attacks. Therefore, in this article, we propose a downsampler-encoder-based cooperative data generator that is trained using an algorithm to ensure better capture of the actual distribution of attack models for the large IIoT attack surface. The proposed downsampler-based data generator is alternatively updated and verified during training using a deep neural network discriminator to ensure robustness. This guarantees the performance of the generator against input sets with a high noise level at time of training and testing. Various experiments are conducted on a real IIoT testbed data set. Experimental results show that the proposed approach outperforms conventional deep learning and other ML techniques in terms of robustness against adversarial/noisy examples in the IIoT environment.
Published: 2021
Full Text: View/download PDF

224. System-Technology Codesign of 3-D NAND Flash-Based Compute-in-Memory Inference Engine

Author: Wonbo Shim and Shimeng Yu
Subjects: Computer engineering. Computer hardware, Hardware_MEMORYSTRUCTURES, Computer science, business.industry, Dataflow, Inference, NAND gate, Chip, Electronic, Optical and Magnetic Materials, Resistive random-access memory, deep neural network (DNN), TK7885-7895, Hardware and Architecture, Static random-access memory, Electrical and Electronic Engineering, Inference engine, compute-in-memory (CIM), business, hardware accelerator, Throughput (business), 3-D NAND, Computer hardware
Abstract: Due to its ultrahigh density and commercially matured fabrication technology, 3-D NAND flash memory has been proposed as an attractive candidate of inference engine for deep neural network (DNN) workloads. However, the peripheral circuits require to be modified with conventional 3-D NAND flash to enable compute-in-memory (CIM), and the chip architectures need to be redesigned for an optimized dataflow. In this work, we present a design of 3-D NAND-CIM accelerator based on the macro parameters from an industry-grade prototype chip. The DNN inference performance is evaluated using the DNN+NeuroSim framework. To exploit the ultrahigh density of 3-D NAND flash, both inputs and weights mapping strategies are introduced to improve the throughput. The benchmarking on the VGG network was performed across the technological candidates for CIM, including SRAM, resistive random access memory (RRAM), and 3-D NAND. Compared to the similar designs with SRAM or RRAM, the result shows that the 3-D NAND-based CIM design can achieve not only 17%–24% chip size but also 1.9–2.7 times more competitive energy efficiency for 8-bit precision inference. Inference accuracy drop induced by 3-D NAND string current drift and variation is also investigated. No accuracy degradation by current variation was observed with the proposed input mapping scheme, while accuracy drops sensitive to the current drift, which implies that some compensation schemes are needed to maintain the inference accuracy.
Published: 2021
Full Text: View/download PDF

225. The System for Transforming the Code of Dataflow Programs into Imperative

Author: Vladimir S. Vasilev, Alexander I. Legalov, and Sergey V. Zykov
Subjects: 0209 industrial biotechnology, Source code, Computer science, Dataflow, program analysis, media_common.quotation_subject, transformation of programs, 02 engineering and technology, Information technology, computer.software_genre, Set (abstract data type), symbols.namesake, 020901 industrial engineering & automation, Program analysis, 0202 electrical engineering, electronic engineering, information engineering, media_common, Programming language, intermediate program representations, typing, Dataflow programming, T58.5-58.64, Imperative programming, Transformation (function), symbols, 020201 artificial intelligence & image processing, computer, Von Neumann architecture, dataflow parallel programming
Abstract: Functional dataflow programming languages are designed to create parallel portable programs. The source code of such programs is translated into a set of graphs that reflect information and control dependencies. The main way of their execution is interpretation, which does not allow to perform calculations efficiently on real parallel computing systems and leads to poor performance. To run programs directly on existing computing systems, you need to use specific optimization and transformation methods that take into account the features of both the programming language and the architecture of the system. Currently, the most common is the Von Neumann architecture, however, parallel programming for it in most cases is carried out using imperative languages with a static type system. For different architectures of parallel computing systems, there are various approaches to writing parallel programs. The transformation of dataflow parallel programs into imperative programs allows to form a framework of imperative code fragments that directly display sequential calculations. In the future, this framework can be adapted to a specific parallel architecture. The paper considers an approach to performing this type of transformation, which consists in allocating fragments of dataflow parallel programs as templates, which are subsequently replaced by equivalent fragments of imperative languages. The proposed transformation methods allow generating program code, to which various optimizing transformations can be applied in the future, including parallelization taking into account the target architecture.
Published: 2021

226. Distributed temporal graph analytics with GRADOOP

Author: Martin Junghanns, Erhard Rahm, Matthias Täschner, Lucas Schons, Timo Adameit, Gomez Kevin A, Philip Fritzsche, Christopher Rost, and Lukas Christ
Subjects: Power graph analysis, Structure (mathematical logic), Theoretical computer science, Computer science, business.industry, Dataflow, Data model, Hardware and Architecture, Analytics, Scalability, Pattern matching, business, Bitemporal Modeling, MathematicsofComputing_DISCRETEMATHEMATICS, Information Systems
Abstract: Temporal property graphs are graphs whose structure and properties change over time. Temporal graph datasets tend to be large due to stored historical information, asking for scalable analysis capabilities. We give a complete overview of Gradoop, a graph dataflow system for scalable, distributed analytics of temporal property graphs which has been continuously developed since 2005. Its graph model TPGM allows bitemporal modeling not only of vertices and edges but also of graph collections. A declarative analytical language called GrALa allows analysts to flexibly define analytical graph workflows by composing different operators that support temporal graph analysis. Built on a distributed dataflow system, large temporal graphs can be processed on a shared-nothing cluster. We present the system architecture of Gradoop, its data model TPGM with composable temporal graph operators, like snapshot, difference, pattern matching, graph grouping and several implementation details. We evaluate the performance and scalability of selected operators and a composed workflow for synthetic and real-world temporal graphs with up to 283 M vertices and 1.8 B edges, and a graph lifetime of about 8 years with up to 20 M new edges per year. We also reflect on lessons learned from the Gradoop effort.
Published: 2021
Full Text: View/download PDF

227. Theoretical and applied aspects of library automation: Analyzing Russian Science Citation Index

Subjects: Bibliometric analysis, Impact factor, Computer science, Dataflow, 05 social sciences, Science Citation Index, Subject (documents), General Medicine, 050905 science studies, World Wide Web, Information and Communications Technology, Information system, Library automation, 0509 other social sciences, 050904 information & library sciences
Abstract: The 20-year document flow in library automation is analyzed based on the Russian Science Citation Index. The study goal was to evaluate the state of library automation as reflected in the document flow in library and information studies. The stages of library automation are specified; selection principles for related publications retrieval and analysis are characterized. The dynamics and statistical data of the dataflow are presented; sustainable interest toward the problem is revealed. Based on the 2- and 5-year impact factor, 10 most productive journals on the subject are identified, among them “Scientific and technical libraries” and “Bibliotekovedenie” [Russian journal of Library Science] journals. Organizations leading in library automation are introduced. The sectoral and content structures of the array of interest are characterized; the current trends in library automation related to expanding ALIS functionality and implementation of information and communication technologies are identified. The ranked list of publications on individual automated library and information systems is included.
Published: 2021
Full Text: View/download PDF

228. A Real-Time Architecture for Pruning the Effectual Computations in Deep Neural Networks

Author: Hyuk-Jae Lee, Lakshminarayanan Gopalakrishnan, Mohammadreza Asadikouhanjani, Hao Zhang, and Seok-Bum Ko
Subjects: Speedup, Artificial neural network, Computer science, Dataflow, Reference design, 020208 electrical & electronic engineering, Sorting, 02 engineering and technology, Parallel computing, Filter (video), 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), Pruning (decision trees), Electrical and Electronic Engineering
Abstract: Integrating Deep Neural Networks (DNNs) into the Internet of Thing (IoT) devices could result in the emergence of complex sensing and recognition tasks that support a new era of human interactions with surrounding environments. However, DNNs are power-hungry, performing billions of computations in terms of one inference. Spatial DNN accelerators in principle can support computation-pruning techniques compared to other common architectures such as systolic arrays. Energy-efficient DNN accelerators skip bit-wise or word-wise sparsity in the input feature maps (ifmaps) and filter weights which means ineffectual computations are skipped. However, there is still room for pruning the effectual computations without reducing the accuracy of DNNs. In this paper, we propose a novel real-time architecture and dataflow by decomposing multiplications down to the bit level and pruning identical computations in spatial designs while running benchmark networks. The proposed architecture prunes identical computations by identifying identical bit values available in both ifmaps and filter weights without changing the accuracy of benchmark networks. When compared to the reference design, our proposed design achieves an average per layer speedup of $\times 1.4$ and an energy efficiency of $\times 1.21$ per inference while maintaining the accuracy of benchmark networks.
Published: 2021
Full Text: View/download PDF

229. Dataflow-Aware Macro Placement Based on Simulated Evolution Algorithm for Mixed-Size Designs

Author: You-Lun Deng, Jai-Ming Lin, Jia-Jian Chen, Po-Chen Lu, and Ya-Chu Yang
Subjects: Very-large-scale integration, Computer science, Dataflow, 02 engineering and technology, Data structure, 020202 computer hardware & architecture, Constraint (information theory), Image stitching, Hardware and Architecture, Simulated annealing, Hardware_INTEGRATEDCIRCUITS, 0202 electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Routing (electronic design automation), Macro, Algorithm, Software
Abstract: This article proposes a novel approach to handle macro placement. Previous works usually apply the simulated annealing (SA) algorithm to handle this problem. However, the SA-based approaches usually have difficulty in handling preplaced macros and require longer runtime. To resolve these problems, we propose a macro placement procedure based on the corner stitching data structure and then apply an efficient and effective simulated evolution algorithm to further refine placement results. In order to relieve local routing congestion, we propose to expand areas of movable macros according to the design hierarchy before applying the macro placement algorithm. Finally, we extend our macro placement methodology to consider dataflow constraint so that dataflow-related macros can be placed at close locations. The experimental results show that our approach obtains a better solution than a previous macro placement algorithm and a tool. Besides, placement quality can be further improved when the dataflow constraint is considered.
Published: 2021
Full Text: View/download PDF

230. A Dataflow Language (AVON) as an Architecture Description Language (ADL)

Author: Deb, Ashoke, Kleinjohann, Bernd, editor, Gao, Guang R., editor, Kopetz, Hermann, editor, Kleinjohann, Lisa, editor, and Rettberg, Achim, editor
Published: 2004
Full Text: View/download PDF

231. A Network Scheduling Method Based on Segmented Constraints for Convergence of Time-Sensitive Networking and Industrial Wireless Networks

Author: Min Wei, Chang Liu, Jin Wang, and Shujie Yang
Subjects: industrial wireless network, time-sensitive networking, scheduling method, greedy algorithm, dataflow, Computer Networks and Communications, Hardware and Architecture, Control and Systems Engineering, Signal Processing, Electrical and Electronic Engineering
Abstract: In industrial applications, it is necessary to select different types of networks according to different communication requirements. To meet this requirement, a converged network of wired and wireless networks is frequently employed. Notably, fulfilling the end-to-end transmission requirements of converged networks is challenging. As a solution, converged-network scheduling methods have proved valuable. In this paper, a network scheduling method for the convergence of industrial wireless networks and time-sensitive networks is proposed. Additionally, the proposed method is tested and verified. The results show that the end-to-end average transmission delay is reduced and the jitter is acceptable.
Published: 2023
Full Text: View/download PDF

232. Teaching Programming Broadly and Deeply: The Kernel Language Approach

Author: Van Roy, Peter, Haridi, Seif, Cassel, Lillian, editor, and Reis, Ricardo A., editor
Published: 2003
Full Text: View/download PDF

233. Understanding the design-space of sparse/dense multiphase GNN dataflows on spatial accelerators

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. IDEAI-UPC - Intelligent Data sciEnce and Artificial Intelligence Research Group, Garg, Raveesh, Qin, Eric, Muñoz Martínez, Francisco, Guirado Liñan, Robert, Jain, Akshay, Abadal Cavallé, Sergi, Abellán Miguel, José Luis, Acacio Sánchez, Manuel E., Alarcón Cot, Eduardo José, Rajamanickam, Sivasankaran, Krishna, Tushar, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Enginyeria Electrònica, Universitat Politècnica de Catalunya. IDEAI-UPC - Intelligent Data sciEnce and Artificial Intelligence Research Group, Garg, Raveesh, Qin, Eric, Muñoz Martínez, Francisco, Guirado Liñan, Robert, Jain, Akshay, Abadal Cavallé, Sergi, Abellán Miguel, José Luis, Acacio Sánchez, Manuel E., Alarcón Cot, Eduardo José, Rajamanickam, Sivasankaran, and Krishna, Tushar
Abstract: Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and memory characteristics that come from an interplay between dense and sparse phases of computations, the emergence of recon-figurable dataflow (aka spatial) accelerators offers promise for acceleration by mapping optimized dataflows (i.e., computation order and parallelism) for both phases. The goal of this work is to characterize and understand the design-space of dataflow choices for running GNNs on spatial accelerators in order for mappers or design-space exploration tools to optimize the dataflow based on the workload. Specifically, we propose a taxonomy to describe all possible choices for mapping the dense and sparse phases of GNN inference, spatially and temporally over a spatial accelerator, capturing both the intra-phase dataflow and the inter-phase (pipelined) dataflow. Using this taxonomy, we do deep-dives into the cost and benefits of several dataflows and perform case studies on implications of hardware parameters for dataflows and value of flexibility to support pipelined execution., Parts of this work were supported through a fellowship by NEC Laboratories Europe, Project grant PID2020-112827GB-I00 funded by MCIN/AEI/ 10.13039/501100011033, RTI2018-098156-B-C53 (MCIU/AEI/FEDER,UE) and grant 20749/FPI/18 from Fundación Séneca., Peer Reviewed, Postprint (author's final draft)
Published: 2022

234. The Digital Earth SMART monitoring concept and tools

Author: Bouwer, L.M., Dransch, D., Ruhnke, R., Rechid, D., Frickenhaus, S., Greinert, J., Koedel, Uta, Dietrich, Peter, Fischer, P., Bundke, U., Burwicz-Galerne, E., Haas, A., Herrarte, I., Haroon, A., Jegen, M., Kalbacher, Thomas, Kennert, M., Korf, T., Kunkel, R., Kwok, Ching Yin, Mahnke, C., Nixdorf, Erik, Paasche, Hendrik, González Ávalos, E., Petzold, A., Rohs, S., Wagner, R., Walter, A., Bouwer, L.M., Dransch, D., Ruhnke, R., Rechid, D., Frickenhaus, S., Greinert, J., Koedel, Uta, Dietrich, Peter, Fischer, P., Bundke, U., Burwicz-Galerne, E., Haas, A., Herrarte, I., Haroon, A., Jegen, M., Kalbacher, Thomas, Kennert, M., Korf, T., Kunkel, R., Kwok, Ching Yin, Mahnke, C., Nixdorf, Erik, Paasche, Hendrik, González Ávalos, E., Petzold, A., Rohs, S., Wagner, R., and Walter, A.
Abstract: Reliable data are the base of all scientific analyses, interpretations and conclusions. Evaluating data in a smart way speeds up the process of interpretation and conclusion and highlights where, when and how additionally acquired data in the field will support knowledge gain. An extended SMART monitoring concept is introduced which includes SMART sensors, DataFlows, MetaData and Sampling approaches and tools. In the course of the Digital Earth project, the meaning of SMART monitoring has significantly evolved. It stands for a combination of hard- and software tools enhancing the traditional monitoring approach where a SMART monitoring DataFlow is processed and analyzed sequentially on the way from the sensor to a repository into an integrated analysis approach. The measured values itself, its metadata, and the status of the sensor, and additional auxiliary data can be made available in real time and analyzed to enhance the sensor output concerning accuracy and precision. Although several parts of the four tools are known, technically feasible and sometimes applied in Earth science studies, there is a large discrepancy between knowledge and our derived ambitions and what is feasible and commonly done in the reality and in the field.
Published: 2022

235. Performance Analysis of GEMM Mappings in CPU Architectures

Author: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Georgia Institute of Technology, Abadal Cavallé, Sergi, Krishna, Tushar, Gallego Jené, Martina, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Georgia Institute of Technology, Abadal Cavallé, Sergi, Krishna, Tushar, and Gallego Jené, Martina
Abstract: Deep Neural Networks (DNN) have revolutionized the scene of Machine learning (ML) with their ability to process large amounts of data, which allow them to model complex relationships and make predictions across multiple and diverse fields such as e-commerce, medical and entertainment. With such great advantages comes a great computational cost. One of the operations that represents a large fraction of the computational workload of most types of DNNs, including Convolutional Neural Networks (CNNs) or Graph Neural Networks (GNNs), are General Matrix-Matrix Multiplications (GEMMs). Nowadays, most of the research on improving the speed and efficiency of GEMMs is focused on accelerators and GPUs, but we have to take into account that CPUs are still extensively present and sometimes underutilized in modern computing systems, especially in data centers. As a result, this thesis focuses on the performance analysis of GEMMs in CPU systems. In particular, this study is explores the impact of different tiling, dataflow, and partitioning techniques on the data reuse and data movement in a set of CPU architectures. The results obtained from these simulations have shown that multi-core architectures are faster than single-core ones due to the high degree of parallelism in GEMMs. Also, we have observed that the impact of tiling on the memory accesses and runtime is not uniform across the different matrix dimensions. Finally, we have observed that the input stationary dataflow has, in general, better performance than the output stationary, however, we have observed that there are some exceptions., Les Xarxes Neuronals Profundes (DNNs) han revolucionat el panorama del Machine Learning (ML) amb la seva capacitat de processar grans quantitats de dades, cosa que els permet modelar relacions complexes i fer prediccions en múltiples i diversos camps com el comerç electrònic, la medicina i l'entreteniment. La seva aplicació aporta grans avantatges, però implica un gran cost per la capacitat computacional necessària. Una de les operacions que representa una gran part de la càrrega de treball computacional de la majoria de tipus de DNNs, incloent-hi les Xarxes Neuronals Convolucionals (CNNs) o les Xarxes Neuronals Gràfiques (GNNs), són les Multiplicacions Matricials Generals (General Martix Multiplication GEMMs). En l'actualitat, la major part de la investigació per millorar la velocitat i l'eficiència de les GEMMs se centra en acceleradors i GPUs, però cal tenir en compte que les CPUs continuen estant àmpliament presents i de vegades infrautilitzades en els sistemes de computació moderns, especialment en els centres de dades. Per això, aquest treball se centra en l?anàlisi del rendiment de les GEMM en sistemes de CPU. En particular, aquest estudi explora l'impacte de diferents tècniques basades en la reutilització i el moviment de dades que consisteixen en la divisió de matrius, la modificació del flux de dades i la partició en diferents nuclis en un conjunt d'arquitectures de CPU. Els resultats obtinguts d'aquestes simulacions han demostrat que les arquitectures multicore tenen el millor rendiment a causa de l'alt grau de paral·lelisme de les GEMM. A més, hem observat que l'impacte de la divisió de les matrius en els accessos a la memòria i el temps d'execució no és uniforme en les diferents dimensions de les matrius. Finalment, hem observat que quan les dades d'entrada són estacionàries, en general, el flux de dades té millor rendiment que quan les dades de sortida són estacionàries, tot i que hem observat també que hi ha algunes excepcions.
Published: 2022

236. Multithread Accelerators on FPGAs: A Dataflow-Based Approach

Author: Francesco Ratto and Stefano Esposito and Carlo Sau and Luigi Raffo and Francesca Palumbo, Ratto, Francesco, Esposito, Stefano, Sau, Carlo, Raffo, Luigi, Palumbo, Francesca, Francesco Ratto and Stefano Esposito and Carlo Sau and Luigi Raffo and Francesca Palumbo, Ratto, Francesco, Esposito, Stefano, Sau, Carlo, Raffo, Luigi, and Palumbo, Francesca
Abstract: Multithreading is a well-known technique for general-purpose systems to deliver a substantial performance gain, raising resource efficiency by exploiting underutilization periods. With the increase of specialized hardware, resource efficiency became fundamental to master the introduced overhead of such kind of devices. In this work, we propose a model-based approach for designing specialized multithread hardware accelerators. This novel approach exploits dataflow models of applications and tagged tokens to let the resulting hardware support concurrent threads without the need to replicate the whole accelerator. Assessment is carried out over different versions of an accelerator for a compute-intensive step of modern video coding algorithms, under several feeding configurations. Results highlight that the proposed multithread accelerators achieve a valuable tradeoff: saving computational resources with respect to replicated parallel single-thread accelerators, while guaranteeing shorter waiting, response, and elaboration time than a unique single-thread accelerator multiplexed in time.
Published: 2022
Full Text: View/download PDF

237. Hierarchical Dataflow Modeling of Iterative Applications.

Author: Hyesun Hong, Hyunok Oh, and Soonhoi Ha
Subjects: DATA flow computing, ELECTRONIC data processing, ITERATIVE methods (Mathematics), NUMERICAL analysis, NEURAL circuitry
Abstract: Even though dataflow models are good at exploiting task-level parallelism of an application, it is difficult to exploit the parallelism of loop structures since they are not explicitly specified in existent dataflow models. To overcome this drawback, we propose a novel extension to the SDF model, called SDF/L graph, specifying the loop structures explicitly in a hierarchical fashion. With a given SDF/L graph specification and the mapping and scheduling information, an application can be automatically parallelized on a multicore system. The enhanced expression capability by the proposed extension is verified with two applications, k-means clustering and deep neural network application. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

238. StaccatoLab: A Programming and Execution Model for Large‐scale Dataflow Computing

Author: Kees van Berkel
Subjects: Scale (ratio), Computer science, Dataflow, Dataflow programming, Parallel computing, Execution model
Published: 2021
Full Text: View/download PDF

239. Evaluation of the Exact Throughput of a Synchronous DataFlow Graph

Author: Alix Munier Kordon and Bruno Bodin
Subjects: Schedule, Computer science, Dataflow, Iterative method, 020206 networking & telecommunications, 02 engineering and technology, Parallel computing, computer.software_genre, Theoretical Computer Science, Set (abstract data type), Task (computing), Hardware and Architecture, Control and Systems Engineering, Modeling and Simulation, Signal Processing, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), 020201 artificial intelligence & image processing, Compiler, Throughput (business), computer, Information Systems
Abstract: Synchronous DataFlow Graph (SDFG in short) is a formalism frequently considered in electronic design and software compilers to model communications between components with different rates. The development of efficient algorithms to evaluate the maximum throughput of SDFGs is a challenging question. This paper presents a mathematical framework to perform schedulability analysis and to compute the maximum throughput of SDFGs. This work focuses on strictly K-Periodic schedules for which a fixed set of execution times coupled with a period are associated with each task and define a schedule of every task executions. This class of schedules can always reach maximal throughput; we present an algorithm that computes the exact maximum throughput by iteratively generating K-periodic schedules until we reach optimality. The complexity of this iterative algorithm is studied by using the well-established benchmarking suite SDF3, and compared against the most common throughput analysis techniques. We show several orders of magnitude improvement over state-of-the-art both in terms of computation time, and size of the final schedules.
Published: 2021
Full Text: View/download PDF

240. I-Scheduler: Iterative scheduling for distributed stream processing systems

Author: Leila Eskandari, David Eyers, Jason Mair, and Zhiyi Huang
Subjects: Stream processing, Computer Networks and Communications, Hardware and Architecture, Computer science, Dataflow, Graph partition, Throughput, Parallel computing, Directed acyclic graph, Software, Scheduling (computing)
Abstract: Task allocation in Data Stream Processing Systems (DSPSs) has a significant impact on performance metrics such as data processing latency and system throughput. An application processed by DSPSs can be represented as a Directed Acyclic Graph (DAG), where each vertex represents a task and the edges show the dataflow between the tasks. Task allocation can be defined as the assignment of the vertices in the DAG to the physical compute nodes such that the data movement between the nodes is minimised. Finding an optimal task placement for DSPSs is NP-hard. Thus, approximate scheduling approaches are required to improve the performance of DSPSs. In this paper, we propose a heuristic scheduling algorithm which reliably and efficiently finds highly communicating tasks by exploiting graph partitioning algorithms and a mathematical optimisation software package. We evaluate the communication cost of our method using three micro-benchmarks, showing that we can achieve results that are close to optimal. We further compare our scheduler with two popular existing schedulers, R-Storm and Aniello et al.’s ‘Online scheduler’ using two real-world applications. Our experimental results show that our proposed scheduler outperforms R-Storm, increasing throughput by up to 30%, and improves on the Online scheduler by 20%–86% as a result of finding a more efficient schedule. 1
Published: 2021
Full Text: View/download PDF

241. SEIZE: Runtime Inspection for Parallel Dataflow Systems

Author: Carlo Zaniolo, Matteo Interlandi, Youfu Li, Wei Wang, and Fotis Psallidas
Subjects: Computer science, Dataflow, Programming language, media_common.quotation_subject, Query optimization, computer.software_genre, Scheduling (computing), Computational Theory and Mathematics, Debugging, Hardware and Architecture, Computer cluster, Signal Processing, Spark (mathematics), Programming paradigm, Overhead (computing), computer, media_common
Abstract: Many Data-Intensive Scalable Computing (DISC) Systems provide easy-to-use functional APIs, and efficient scheduling and execution strategies allowing users to build concise data-parallel programs. In these systems, data transformations are concealed by exposed APIs, and intermediate execution states are masked under dataflow transitions. Consequently, many crucial features and optimizations (e.g., debugging, data provenance, runtime skew detection), which require runtime datafow states, are not well-supported. Inspired by our experience in implementing features and optimizations over DISC systems, we present $\mathsf {SEIZE}$ SEIZE , a unified framework that enables dataflow inspection—wiretapping the data-path with listening logic—in MapReduce-style programming model. We generalize our lessons learned by providing a set of primitives defining dataflow inspection, orchestration options for different inspection granularities, and operator decomposition and dataflow punctuation strategy for dataflow intervention. We demonstrate the generality and flexibility of the approach by deploying $\mathsf {SEIZE}$ SEIZE in both Apache Spark and Apache Flink, and by implementing a prototype runtime query optimizer for Spark. Our experiments show that, the overhead introduced by the inspection logic is most of the time negligible (less than 5 percent in Spark and 10 percent in Flink).
Published: 2021
Full Text: View/download PDF

242. Probeware for the Modern Era: IoT Dataflow System Design for Secondary Classrooms

Author: Sherry Hsi, Seth Van Doren, and Leslie G. Bondaryk
Subjects: Multimedia, Distributed database, Computer science, Dataflow, business.industry, Interface (computing), 05 social sciences, General Engineering, 050301 education, Dataflow programming, Cloud computing, computer.software_genre, Computer Science Applications, Education, Variety (cybernetics), ComputingMilieux_COMPUTERSANDEDUCATION, Systems design, 0501 psychology and cognitive sciences, The Internet, business, 0503 education, computer, 050107 human factors
Abstract: Sensor systems have the potential to make abstract science phenomena concrete for K–12 students. Internet of Things (IoT) sensor systems provide a variety of benefits for modern classrooms, creating the opportunity for global data production, orienting learners to the opportunities and drawbacks of distributed sensor and control systems, and reducing classroom hardware burden by allowing many students to “listen” to the same data stream. To date, few robust IoT classroom systems have emerged, partially due to lack of appropriate curriculum and student-accessible interfaces, and partially due to lack of classroom-compliant server technology. In this article, we present an architecture and sensor kit system that addresses issues of sensor ubiquity, acquisition clarity, data transparency, reliability, and security. The system has a dataflow programming interface to support both science practices and computational data practices, exposing the movement of data through programs and data files. The IoT Dataflow System supports authentic uses of computational tools for data production through this distributed cloud-based system, overcoming a variety of implementation challenges specific to making programs run for arbitrary duration on a variety of sensors. In practice, this system provides a number of unique yet unexplored educational opportunities. Early results show promise for Dataflow as a valuable learning technology from research conducted in a high school classroom.
Published: 2021
Full Text: View/download PDF

243. SymPas: Symbolic Program Slicing

Author: Ying-Zhou Zhang
Subjects: Computer science, Programming language, Dataflow, Context (language use), computer.software_genre, Slicing, Computer Science Applications, Theoretical Computer Science, Computational Theory and Mathematics, Hardware and Architecture, Virtual machine, Program Dependence Graph, Benchmark (computing), Program slicing, Graph (abstract data type), computer, Software
Abstract: Program slicing is a technique for simplifying programs by focusing on selected aspects of their behavior. Current mainstream static slicing methods operate on dependence graph PDG (program dependence graph) or SDG (system dependence graph), but these friendly graph representations may be a bit expensive for some users. In this paper we attempt to study a light-weight approach of static program slicing, called Symbolic Program Slicing (SymPas), which works as a dataflow analysis on LLVM (low-level virtual machine). In our SymPas approach, slices are stored in symbolic forms, not in procedures being re-analyzed (cf. procedure summaries). Instead of re-analyzing a procedure multiple times to find its slices for each callling context, we calculate a single symbolic slice which can be instantiated at call sites avoiding re-analysis; SymPas is implemented with LLVM to perform slicing on LLVM intermediate representation (IR). For comparison, we systematically adapt IFDS (interprocedural finite distributive subset) analysis and the SDG-based slicing method (SDG-IFDS) to statically slice IR programs. Evaluated on open-source and benchmark programs, our backward SymPas shows a factor-of-6 reduction in time cost and a factor-of-4 reduction in space cost, compared with backward SDG-IFDS, thus being more efficient. In addition, the result shows that after studying slices from 66 programs, ranging up to 336 800 IR instructions in size, SymPas is highly size-scalable.
Published: 2021
Full Text: View/download PDF

244. Code-size-aware Scheduling of Synchronous Dataflow Graphs on Multicore Systems

Author: Rizos Sakellariou and Mingze Ma
Subjects: business.industry, Dataflow, Heuristic (computer science), Computer science, 02 engineering and technology, Parallel computing, Code size, 020202 computer hardware & architecture, Scheduling (computing), Reduction (complexity), Hardware and Architecture, 0202 electrical engineering, electronic engineering, information engineering, Multicore systems, 020201 artificial intelligence & image processing, business, Throughput (business), Software, Digital signal processing
Abstract: Synchronous dataflow graphs are widely used to model digital signal processing and multimedia applications. Self-timed execution is an efficient methodology for the analysis and scheduling of synchronous dataflow graphs. In this article, we propose a communication-aware self-timed execution approach to solve the problem of scheduling synchronous dataflow graphs on multicore systems with communication delays. Based on this communication-aware self-timed execution approach, four communication-aware scheduling algorithms are proposed using different allocation rules. Furthermore, a code-size-aware mapping heuristic is proposed and jointly used with a proposed scheduling algorithm to reduce the code size of SDFGs on multicore systems. The proposed scheduling algorithms are experimentally evaluated and found to perform better than existing algorithms in terms of throughput and runtime for several applications. The experiments also show that the proposed code-size-aware mapping approach can achieve significant code size reduction with limited throughput degradation in most cases.
Published: 2021
Full Text: View/download PDF

245. Cooperative Coevolution-based Design Space Exploration for Multi-mode Dataflow Mapping

Author: Ke Tang, Xiaofen Lu, Bo Yuan, and Xin Yao
Subjects: education.field_of_study, Theoretical computer science, Cooperative coevolution, Fitness approximation, Design space exploration, Dataflow, Computer science, Population, 02 engineering and technology, 020202 computer hardware & architecture, Hardware and Architecture, Genetic algorithm, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), 020201 artificial intelligence & image processing, Electronic design automation, education, Software
Abstract: Some signal processing and multimedia applications can be specified by synchronous dataflow (SDF) models. The problem of SDF mapping to a given set of heterogeneous processors has been known to be NP-hard and widely studied in the design automation field. However, modern embedded applications are becoming increasingly complex with dynamic behaviors changes over time. As a significant extension to the SDF, the multi-mode dataflow (MMDF) model has been proposed to specify such an application with a finite number of behaviors (or modes) and each behavior (mode) is represented by an SDF graph. The multiprocessor mapping of an MMDF is far more challenging as the design space increases with the number of modes. Instead of using traditional genetic algorithm (GA)-based design space exploration (DSE) method that encodes the design space as a whole, this article proposes a novel cooperative co-evolutionary genetic algorithm (CCGA)-based framework to efficiently explore the design space by a new problem-specific decomposition strategy in which the solutions of node mapping for each individual mode are assigned to an individual population. Besides, a problem-specific local search operator is introduced as a supplement to the global search of CCGA for further improving the search efficiency of the whole framework. Furthermore, a fitness approximation method and a hybrid fitness evaluation strategy are applied for reducing the time consumption of fitness evaluation significantly. The experimental studies demonstrate the advantage of the proposed DSE method over the previous GA-based method. The proposed method can obtain an optimization result with 2×−3× better quality using less (1/2−1/3) optimization time.
Published: 2021
Full Text: View/download PDF

246. BOND

Author: Zhichao Cao, Xiaolong Zheng, Wei Gong, and Qiang Ma
Subjects: Computer Networks and Communications, Computer science, Dataflow, business.industry, Node (networking), 020206 networking & telecommunications, 02 engineering and technology, Data loss, Bottleneck, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Network performance, Routing (electronic design automation), Hidden Markov model, business, Wireless sensor network, Computer network
Abstract: In a large-scale wireless sensor network, hundreds and thousands of sensors sample and forward data back to the sink periodically. In two real outdoor deployments GreenOrbs and CitySee, we observe that some bottleneck nodes strongly impact other nodes’ data collection and thus degrade the whole network performance. To figure out the importance of a node in the process of data collection, system manager is required to understand interactive behaviors among the parent and child nodes. So we present a management tool BOND (BOttleneck Node Detector), which explains the concept of Node Dependence to characterize how much a node relies on each of its parent nodes, and also models the routing process as a Hidden Markov Model and then uses a machine learning approach to learn the state transition probabilities in this model. Moreover, BOND can predict the network dataflow if some nodes are added or removed to avoid data loss and flow congestion in network redeployment. We implement BOND on real hardware and deploy it in an outdoor network system. The extensive experiments show that Node Dependence indeed help to explore the hidden bottleneck nodes in the network, and BOND infers the Node Dependence with an average accuracy of more than 85%.
Published: 2021
Full Text: View/download PDF

247. HIGH-LEVEL TOOLS FOR TRANSLATION OF C-APPLICATIONS INTO APPLICATIONS IN DATAFLOW LANGUAGE COLAMO

Author: А.А. Gulenok, Ilya I. Levin, S.A. Dudko, V.A. Gudkov, Alexey I. Dordopulo, and A. V. Bovkun
Subjects: Computer science, Dataflow, Programming language, Translation (geometry), computer.software_genre, computer
Published: 2021
Full Text: View/download PDF

248. SNAP: An Efficient Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference

Author: Ching-En Lee, Chester Liu, Jie-Fang Zhang, Zhengya Zhang, Yakun Sophia Shao, and Stephen W. Keckler
Subjects: Pointwise, Artificial neural network, Dataflow, Computer science, business.industry, Deep learning, 020208 electrical & electronic engineering, 02 engineering and technology, Chip, Computational science, Convolution, Reduction (complexity), CMOS, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Electrical and Electronic Engineering, business, Throughput (business)
Abstract: Recent developments in deep neural network (DNN) pruning introduces data sparsity to enable deep learning applications to run more efficiently on resource- and energy-constrained hardware platforms. However, these sparse models require specialized hardware structures to exploit the sparsity for storage, latency, and efficiency improvements to the full extent. In this work, we present the sparse neural acceleration processor (SNAP) to exploit unstructured sparsity in DNNs. SNAP uses parallel associative search to discover valid weight (W) and input activation (IA) pairs from compressed, unstructured, sparse W and IA data arrays. The associative search allows SNAP to maintain a 75% average compute utilization. SNAP follows a channel-first dataflow and uses a two-level partial sum (psum) reduction dataflow to eliminate access contention at the output buffer and cut the psum writeback traffic by 22 $\times $ compared with state-of-the-art DNN accelerator designs. SNAP’s psum reduction dataflow can be configured in two modes to support general convolution (CONV) layers, pointwise CONV, and fully connected layers. A prototype SNAP chip is implemented in a 16-nm CMOS technology. The 2.3-mm2 test chip is measured to achieve a peak effectual efficiency of 21.55 TOPS/W (16 b) at 0.55 V and 260 MHz for CONV layers with 10% weight and activation densities. Operating on a pruned ResNet-50 network, the test chip achieves a peak throughput of 90.98 frames/s at 0.80 V and 480 MHz, dissipating 348 mW.
Published: 2021
Full Text: View/download PDF

249. An efficient scheduling algorithm for dataflow architecture using loop-pipelining

Author: Li Wenming, Meng Wu, Xiaochun Ye, Da Wang, Li Yi, Hao Zhang, Dongrui Fan, and Rui Xue
Subjects: Loop optimization, Information Systems and Management, Multicast, Dataflow, business.industry, Computer science, Node (networking), Instruction scheduling, Supercomputer, Computer Science Applications, Theoretical Computer Science, Scheduling (computing), Network congestion, Network on a chip, Software, Computer architecture, Artificial Intelligence, Control and Systems Engineering, business, Dataflow architecture
Abstract: Dataflow architecture has native advantages in achieving high instruction parallelism and power efficiency for today’s emerging applications such as high performance computing and deep neural network. In dataflow computing, the execution of instructions is driven by data, so the data transfer efficiency of the network on chip (NoC) is a key factor affecting performance. However, the NoC performance degrades due to the increasing use of multicast communications in many applications. The existing dataflow architecture instruction scheduling algorithms do not optimize multicast communication between the instruction and its successor instructions, so the routing paths of many multicast packets have forks which cause bandwidth waste and potential network congestion. We propose a sharing path awareness (SPA) algorithm to optimize multicast communication in the dataflow architecture. The algorithm shares the routing paths from the instruction to its child node to reduce the NoC bandwidth waste through the instruction scheduler. For applications using software iteration, we further extend the loop optimization to the SPA algorithm to sufficiently exploit instruction-level parallelism. Compared with the state-of-the-art algorithm, we show that the SPA algorithm achieves 20.21% average performance improvement and 15.11% energy consumption reduction for our experimental workloads.
Published: 2021
Full Text: View/download PDF

250. A Technology-Scalable Architecture for Fast Clocks and High ILP

Author: Sankaralingam, Karthikeyan, Nagarajan, Ramadass, Burger, Doug, Keckler, Stephen W., Lee, Gyungho, editor, and Yew, Pen-Chung, editor
Published: 2001
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

5,379 results on '"dataflow"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources