9,780 results on '"Central processing unit"'
Search Results
2. Power Optimization in Wireless Sensor Network Using VLSI Technique on FPGA Platform.
- Author
-
Leelakrishnan, Saranya and Chakrapani, Arvind
- Abstract
Nowadays, the demand for high-performance wireless sensor networks (WSN) is increasing, and its power requirement has threatened the survival of WSN. The routing methods cannot optimize power consumption. To improve the power consumption, VLSI based power optimization technology is proposed in this article. Different elements in WSN, such as sensor nodes, modulation schemes, and package data transmission, influence energy usage. Following a WSN power study, it was discovered that lowering the energy usage of sensor networks is critical in WSN. In this manuscript, a power optimization model for wireless sensor networks (POM-WSN) is proposed. The proposed system shows how to build and execute a power-saving strategy for WSNs using a customized collaborative unit with parallel processing capabilities on FPGA (Field Programmable Gate Array) and a smart power component. The customizable cooperation unit focuses on applying specialized hardware to customize Operating System speed and transfer it to a soft intel core. This device decreases the OS (Operating System) central processing unit (CPU) overhead associated with installing processor-based IoT (Internet of Things) devices. The smart power unit controls the soft CPU’s clock and physical peripherals, putting them in the right state depending on the hardware requirements of the program (tasks) being executed. Furthermore, by taking the command signal from a collaborative custom unit, it is necessary to adjust the amplitude and current. The efficiency and energy usage of the FPGA-based energy saver approach for sensor nodes are compared to the energy usage of processor-based WSN nodes implementations. Using FPGA programmable architecture, the research seeks to build effective power-saving approaches for WSNs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Performance Analysis of Multicore Processor Using FOFO-Based Approximate Compatible ALU.
- Author
-
Senthilmurugan, S. and Gunaseelan, K.
- Subjects
- *
CENTRAL processing units , *MICROPROCESSORS , *MULTICORE processors , *APPROXIMATION error , *ELECTRONIC equipment - Abstract
Modern electronic devices are supported by multicore processors and they have become an unavoidable part of recent technologies. The Arithmetic Logic Unit (ALU) is one of the indispensable parts of the Central Processing Unit (CPU) which performs most of the operations. Operations like multiplications, and exponentials consume more power than other normal operations. Power management is one of the major challenges in designing processors. To maximize speed and reduce power consumption, a new technique in Multiplier and ALU incorporated with a multicore processor has been proposed. The Proposed Approximate Compatible ALU (ACALU) is designed to perform specified operations that consume more power, using an advanced multiplication technique called the First One First Operand (FOFO) method. The selection process of eight bits is done by the FOFO method followed by error detection and error approximation is performed to get the accurate output. The ACALU performs these operations in the Approximate Computing Mode which utilizes only eight bits for the operations, thus reducing the power consumption. The results of several analyses between multipliers, ALUs, and microprocessors are provided. As we optimize the multiplier and ALU unit, the multicore processor is also enhanced. The proposed multicore processor is compared with the AMD processor and the latest INTEL processor and the comparison results show that the proposed technique is 95.2% and 97.8% more efficient than the existing processors. Thus, the power consumption is minimized and the performance speed is increased in the proposed multicore processor. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Yapay zekâ tarafından kontrol edilen özgün bir termoelektrik CPU soğutma sistemi.
- Author
-
Umut, İlhan and Akal, Dinçer
- Subjects
- *
CENTRAL processing units , *ARTIFICIAL intelligence , *HEAT convection , *HEAT conduction , *RANDOM forest algorithms , *K-nearest neighbor classification - Abstract
Due to the excessive temperature rise in the Central Processing Unit (CPU), computers shut down and system damage occurs over time. In this study, a new thermoelectric cooling system is designed to reduce the temperature in the CPU. The novel cooling system is designed using a thermoelectric module. It is to remove the excess heat by conduction and convection by taking advantage of the temperature difference between the thermoelectric cooler and the CPU we add to the system. Since the temperature of the thermoelectric cooler will always be lower than the CPU temperature, effective cooling will be provided. A special electronic circuit and software have been developed for the control of the cooling unit. Three different artificial intelligence models (artificial neural network, random forest, and k-nearest neighbor) were created to dynamically control the additional cooling system and their successes were compared. Artificial intelligence determines the power and fan speed of the thermoelectric cooling system. It performs this control by evaluating all parameters (different values such as CPU frequency, voltage, number of processes) instead of a specific CPU load or a specific temperature value. While the CPU temperature was 41°C at maximum load, this temperature was reduced to 31°C thanks to the designed thermoelectric cooling system. All methods provided a high classification success in education. By using the artificial neural network 97.973% classification success was achieved, whereas the classification success for random forest and k-nearest neighbour model were 97.297% and 96.306%, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. gem5-accel: A Pre-RTL Simulation Toolchain for Accelerator Architecture Validation.
- Author
-
Vieira, Joao, Roma, Nuno, Falcao, Gabriel, and Tomas, Pedro
- Abstract
Attaining the performance and efficiency levels required by modern applications often requires the use of application-specific accelerators. However, writing synthesizable Register-Transfer Level code for such accelerators is a complex, expensive, and time-consuming process, which is cumbersome for early architecture development phases. To tackle this issue, a pre-synthesis simulation toolchain is herein proposed that facilitates the early architectural evaluation of complex accelerators aggregated to multi-level memory hierarchies. To demonstrate its usefulness, the proposed gem5-accel is used to model a tensor accelerator based on Gemmini, showing that it can successfully anticipate the results of complex hardware accelerators executing deep Neural Networks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Design of Intelligent Window Dwelling System Based on Multi Sensor Fusion
- Author
-
Ding, Simin, Wang, Gang, Sun, Lihui, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Zhang, Junjie James, Series Editor, Tan, Kay Chen, Series Editor, and Deng, Zhidong, editor
- Published
- 2023
- Full Text
- View/download PDF
7. Gender Classification Using CNN Transfer Learning and Fine-Tuning
- Author
-
Mustapha, Muhammad Firdaus, Mohamad, Nur Maisarah, Ab Hamid, Siti Haslini, Xhafa, Fatos, Series Editor, Wah, Yap Bee, editor, Berry, Michael W., editor, Mohamed, Azlinah, editor, and Al-Jumeily, Dhiya, editor
- Published
- 2023
- Full Text
- View/download PDF
8. Computer Systems
- Author
-
LaMeres, Brock J. and LaMeres, Brock J.
- Published
- 2023
- Full Text
- View/download PDF
9. Efficient AI Reasoning Model Based on FPGA.
- Author
-
Ran, Junyi
- Subjects
FIELD programmable gate arrays ,ARTIFICIAL intelligence ,ADDITION (Mathematics) - Abstract
The artificial intelligence inference model is currently the best mathematical model based on artificial neural networks. The intelligent reasoning mode based on artificial intelligence has a large construction scale. In a completed inference, many multiplication and addition operations need to be completed. To establish an effective artificial intelligence inference model, its computational complexity is dozens or even more times higher than traditional artificial intelligence algorithms. This article focused on the research on the isomerization acceleration of FPGA (Field Programmable Gate Array). This article was looking for a method that reduces both system changes and migration, while maximizing the flexibility of FPGA programmability. In order to build on this foundation, a fast interface for fast computation using FPGA was designed. The characteristic of this project was that it could combine the FPGA acceleration platform with ordinary computers or servers, and could effectively utilize traditional computer software and toolchains. This article compared the computational time consumption between the inference model and the CPU. In the scale calculation logic scenario of the inference model designed in this article, it could achieve the same computing power as the CPU when processing 70000 multiplication calculations. The experimental results indicated that the increase in CPU computing power was greater than that of the inference model. This indicated that the larger the amount of computational data, the more significant the acceleration effect of the inference model in this article would be. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
10. Energy-Efficient Federated Learning With Resource Allocation for Green IoT Edge Intelligence in B5G
- Author
-
Adeb Salh, Razali Ngah, Lukman Audah, Kwang Soon Kim, Qazwan Abdullah, Yahya M. Al-Moliki, Khaled A. Aljaloud, and Hairul Nizam Talib
- Subjects
Internet-of-things ,federated learning ,energy consumption ,edge nodes ,central processing unit ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
An edge intelligence-aided Internet-of-Things (IoT) network has been proposed to accelerate the response of IoT services by deploying edge intelligence near IoT devices. The transmission of data from IoT devices to the edge nodes leads to large network traffic in the wireless connections. Federated Learning (FL) is proposed to solve the high computational complexity by training the model locally on IoT devices and sharing the model parameters in the edge nodes. This paper focuses on developing an efficient integration of joint edge intelligence nodes depending on investigating an energy-efficient bandwidth allocation, computing Central Processing Unit (CPU) frequency, optimization transmission power, and the desired level of learning accuracy to minimize the energy consumption and satisfy the FL time requirement for all IoT devices. The proposal efficiently optimized the computation frequency allocation and reduced energy consumption in IoT devices by solving the bandwidth optimization problem in closed form. The remaining computational frequency allocation, transmission power allocation, and loss could be resolved with an Alternative Direction Algorithm (ADA) to reduce energy consumption and complexity at every iteration of FL time from IoT devices to edge intelligence nodes. The simulation results indicated that the proposed ADA can adapt the central processing unit frequency and power transmission control to reduce energy consumption at the cost of a small growth of FL time.
- Published
- 2023
- Full Text
- View/download PDF
11. An Energy Consumption Benchmark for a Low-Power RISC-V Core Aimed at Implantable Medical Devices.
- Author
-
Molina-Robles, Roberto, Arnaud, Alfredo, Miguez, Matias, Gak, Joel, Chacon-Rodriguez, Alfonso, and Garcia-Ramirez, Ronny
- Abstract
In this work, Siwa, a micropower 32-bits RISC-V core aimed at implantable medical SoCs is presented. The core was fabricated in a 180-nm CMOS-HV technology to directly drive biological stimuli circuits within the same ASIC. A complete set of power consumption measurements is presented; the core properly operated up to 30 MHz with a current consumption of $52~\mu \text{A}$ /MHz at 1.8-V supply voltage, and < 20-nA leakages at room temperature. Since the existing benchmarks are not completely adequate to compare Siwa performance to other microcontrollers used on implantable medical devices (IMDs), a simple, specific benchmark inspired in a pacemaker operation was developed. The new benchmark considers both the CPU current consumption and performance, sleep and run states, and allows to compare broadly different CPUs and operating conditions for specific IMD applications. Siwa’s performance was compared using this benchmark with 8 and 16-bits MCUs. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
12. Heterogeneous Acceleration of HAR Applications
- Author
-
Rodriguez-Borbon, Jose M, Ma, Xiaoyin, Roy-Chowdhury, Amit K, and Najjar, Walid A
- Subjects
Affordable and Clean Energy ,Field programmable gate arrays ,Feature extraction ,Throughput ,Graphics processing units ,Histograms ,Central Processing Unit ,Energy consumption ,HAR ,HOG3D ,accelerators ,FPGAs ,GPUs ,Artificial Intelligence and Image Processing ,Electrical and Electronic Engineering ,Artificial Intelligence & Image Processing - Abstract
Human action recognition (HAR) is an important field of research that intercepts with areas such as image processing, computer vision, and the design of fast algorithms, among others. HAR has several important applications including healthcare monitoring, security and surveillance, assisted living, smart homes, and video search and indexing. Despite recent developments in the field, major challenges remain. For instance, HAR is computationally expensive. Tasks such as video preprocessing, feature extraction, feature quantization, and feature classification require the execution of millions of arithmetic operations for a video sequence lasting a few seconds. To address these problems, we propose a heterogeneous approach that is based on an extensive algorithmic and experimental analysis of the histogram of gradients application. We divide the application into four stages and evaluate each on the CPU, GPU, and FPGA platforms. Our heterogeneous design combines the strengths of both the FPGA and GPU platforms, and achieves a .3X$ speedup compared with a state-of-the-art GPU while being .5X$ more energy efficient than other homogeneous solutions, including FPGA-based designs. Moreover, our heterogeneous HAR design using fixed-point arithmetic has comparable accuracy to those of HAR algorithms using single precision floating point arithmetic.
- Published
- 2020
13. CAMDNN: Content-Aware Mapping of a Network of Deep Neural Networks on Edge MPSoCs.
- Author
-
Heidari, Soroush, Ghasemi, Mehdi, Kim, Young Geun, Wu, Carole-Jean, and Vrudhula, Sarma
- Subjects
- *
ARTIFICIAL neural networks , *SYSTEMS on a chip , *HETEROGENEOUS computing , *LINEAR programming , *INTEGER programming , *CENTRAL processing units , *DEEP learning - Abstract
Machine Learning (ML) workloads are increasingly deployed at the edge. Enabling efficient inference execution while considering model and system heterogeneity remains challenging, especially for ML tasks built with a network of deep neural networks (DNNs). The challenge is to maximize the utilization of all available resources on the multiprocessor system on a chip (MPSoC) at the same time. This becomes even more complicated because the optimal mapping for the network of DNNs can vary with input batch sizes and scene complexity. In this paper, a holistic hierarchical scheduling framework is presented to optimize the execution time for a network of DNN models on an edge MPSoC at runtime, considering varying input characteristics. The framework consists of a local and a global scheduler. The local scheduler maps individual DNNs in the inference pipeline to the best-performing hardware unit while the global scheduler customizes an Integer Linear Programming (ILP) solution to instantiate DNN remapping. To minimize scheduler runtime overhead, an imitation learning (IL) based scheduler is used that approximates the ILP solutions. The proposed scheduling framework (CAMDNN) was implemented on a Qualcomm Robotic RB5 platform. CAMDNN resulted in lower execution time of up to 32% than heterogeneous earliest finish time, and by factors of 6.67X, 5.6X and 2.17X than the CPU-only, GPU-only and Central Queue schedulers. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
14. ParaX : Bandwidth-Efficient Instance Assignment for DL on Multi-NUMA Many-Core CPUs.
- Author
-
Zhang, Yiming, Yin, Lujia, Li, Dongsheng, Peng, Yuxing, and Lu, Kai
- Subjects
- *
NATURAL language processing , *IMAGE recognition (Computer vision) , *DEEP learning , *CENTRAL processing units - Abstract
Commercial clouds now heavily use CPUs in DL (deep learning) because there are large numbers of CPUs which would otherwise sit idle during off-peak periods. Following the trend, CPU vendors have not only released high-performance many-core CPUs but also developed efficient math kernel libraries. However, current DL platforms cannot scale well to a large number of CPU cores, making many-core CPUs inefficient in DL computation. We analyze the memory access patterns of various layers and identify the root cause of the low scalability, i.e., the per-layer barriers that are implicitly imposed by current platforms which assign one single instance (i.e., one batch of input data) to a CPU. The barriers cause severe memory bandwidth contention and CPU starvation in the access-intensive layers (like activation and BN). This paper presents a novel approach called ParaX, which boosts the performance of DL on multi-NUMA (non-uniform memory access) many-core CPUs by effectively alleviating bandwidth contention and CPU starvation. Our key idea is to assign one instance to each CPU core instead of to the entire CPU, so as to remove the per-layer barriers on the executions of the many cores. ParaX designs an ultralight scheduling policy which sufficiently overlaps the access-intensive layers with the compute-intensive ones to avoid contention, and proposes a NUMA-aware gradient server mechanism for training which leverages shared memory to substantially reduce the overhead of per-iteration parameter synchronization. We have implemented ParaX on MXNet. Extensive evaluation on a two-NUMA Intel 8280 CPU shows that ParaX significantly improves the training/inference throughput for all tested models (for image recognition and natural language processing) by $1.73\times \sim 2.93{\times}$ 1. 73 × ∼ 2. 93 × . [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
15. Architectural Supports for Block Ciphers in a RISC CPU Core by Instruction Overloading.
- Author
-
Choi, P., Kong, W., Kim, J.-H., Lee, M.-K., and Kim, Dong Kyue
- Subjects
- *
BLOCK ciphers , *CENTRAL processing units , *REDUCED instruction set computers , *DATA encryption - Abstract
We propose a novel computer architectural concept of instruction overloading to support block ciphers. Instead of adding new instructions, we extend only the execution of some existing instructions. The proposed method allows a central processing unit core to execute different operations for the same instructions, depending on the address of the data, similar to operator overloading in object-oriented languages. We first present an extension for the AES algorithm, then we demonstrate its enhanced applicability with two further extensions supporting multiple block ciphers and hardware masking. The first extension for AES is also applicable to add/AND-rotate-XOR-based block ciphers such as SIMON. The AES and SIMON encryption speed, on this extended core, is at least doubled and is significantly less affected by memory latency. In addition, the AES encryption code requires only 18% of the memory of the previous software implementation. The second extension can further support various block ciphers defined over GF(28), and the SM4 encryption speed is increased by at least 182%. The third extension provides correlation power analysis (CPA) resistance with a 66.6% area overhead but almost no speed overhead, whereas a typical software anti-CPA AES implementation requires at least hundreds of times the execution time. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
16. Exploring Synchronous Page Fault Handling.
- Author
-
Chen, Yin-Chiuan, Wu, Chun-Feng, Chang, Yuan-Hao, and Kuo, Tei-Wei
- Subjects
- *
NONVOLATILE memory , *CENTRAL processing units - Abstract
The advance of nonvolatile memory in storage technology has presented challenges in redefining the ways in handling the main memory and the storage. This work is motivated by the strong demands in effective handling of page faults over ultralow-latency storage devices. In particular, we propose synchronous and asynchronous prefetching strategies to satisfy process executions with different memory demands in supporting of synchronous page fault handling. An adaptive CPU scheduling strategy is also proposed to cope with the needs of processes in maintaining their working sets in the main memory. Six representative benchmarks and applications were evaluated. It was shown that our strategy can effectively save 12.33% of the total execution time and reduce 13.33% of page faults, compared to the conventional demand paging strategy with nearly no sacrificing of process fairness. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
17. SENTunnel: Fast Path for Sensor Data Access on Automotive Embedded Systems.
- Author
-
Zheng, Rongwei, Chen, Xianzhang, Liu, Duo, Feng, Junjie, Wang, Jiapin, Ren, Ao, Wang, Chengliang, and Tan, Yujuan
- Subjects
- *
DETECTORS , *OPTICAL radar - Abstract
Emerging autonomous vehicles equip multiple high-throughput sensors to enable automatic driving, such as multiline lidars and high-definition cameras. Existing automotive embedded systems usually employ software stacks to receive and preprocess high-throughput sensor data, which brings high latency and CPU consumption. Most research is devoted to accelerators for data processing but ignores the latency overhead caused by sensor data access. Therefore, this article proposes SENTunnel to build fast path from sensors to the corresponding processing units by offloading redundant software stacks into hardware. Specifically, SENTunnel builds fast path for sensor data access to processors/accelerators through two hardware modules. First, the unified access module is used to receive, parse, and transmit raw sensor data. Second, SENTunnel performs necessary preprocessing of different sensor data with the preprocessors module. Based on the design of SENTunnel, we implement a prototype for accessing the data of multiline lidars to the processor and a dedicated accelerator on FPGA. Experimental results indicate that SENTunnel reduces the latency by 55.5% for the data path to processors and reduces the CPU usage caused by the preprocessing driver by 45.9% on average. Compared to the original and partially offloaded data path to accelerators, SENTunnel reduces the latency by 93.8% and 93%, respectively, and eliminates the CPU costs. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
18. Enabling Cell-Free Massive MIMO Systems With Wireless Millimeter Wave Fronthaul.
- Author
-
Demirhan, Umut and Alkhateeb, Ahmed
- Abstract
Cell-free massive MIMO systems have promising data rate and coverage gains. These systems, however, typically rely on fiber based fronthaul for the communication between the central processing unit and the distributed access points (APs), which increases the infrastructure cost and installation complexity. To address these challenges, this paper proposes two architectures for cell-free massive MIMO systems based on wireless fronthaul that is operating at a higher-band compared to the access links. These dual-band architectures ensure high data rate fronthaul while reducing the infrastructure cost and enhancing the deployment flexibility and adaptability. To investigate the achievable data rates with the proposed architectures, we formulate the end-to-end data rate optimization problem accounting for the various practical aspects of the fronthaul and access links. Then, we develop a low-complexity yet efficient joint beamforming and resource allocation solution for the proposed architectures based on user-centric AP grouping. With this solution, we show that the proposed architectures can achieve comparable data rates to those obtained with optical fiber-based fronthaul under realistic assumptions on the fronthaul bandwidth, hardware constraints, and deployment scenarios. This highlights a promising path for realizing the cell-free massive MIMO gains in practice while reducing the infrastructure and deployment overhead. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
19. Regularity-Based Virtualization Under the ARINC 653 Standard for Embedded Systems.
- Author
-
Dai, Guangli, Paluri, Pavan Kumar, Cheng, Albert Mo Kim, and Liu, Bozheng
- Subjects
- *
VIRTUAL machine systems , *CENTRAL processing units , *TASK performance - Abstract
In embedded real-time virtualized systems (ERTVS), the ARINC 653 standard specifies a cyclic scheduling policy to guarantee the real-time performance of tasks in multiple Virtual Machines (VMs) residing on shared hardware. Based on this policy, the Regularity-based Resource Partitioning (RRP) model defines an efficient interface specification to hierarchically partition and assign resource slices among VMs. Although this model has received plenty of attention recently, three major pieces remain missing for applying this model in ERTVS. (1) Embedded systems are more sensitive to resource utilization efficiency since this may drastically affect their deployment cost for including additional cores. Therefore, this paper proposes an optimal and an approximate RRP resource scheduler for multi-core platforms. (2) A resource reconfiguration is required when an embedded system has to switch between operating modes, resulting in the current cyclic schedule being replaced by another pre-configured and verified cyclic schedule. This paper formalizes a new One-Hop Reconfiguration (OHR) problem tailored for mode-switch-capable embedded systems and introduces a corresponding optimal solution. (3) No RRP-based toolset is currently available for embedded systems. This paper thus presents an optimized RRP toolset tailored for embedded systems. Numerous experiments are conducted to evaluate the efficacy of this toolset. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
20. Tapping into NFV Environment for Opportunistic Serverless Edge Function Deployment.
- Author
-
Zhang, Lu, Feng, Weiqi, Li, Chao, Hou, Xiaofeng, Wang, Pengyu, Wang, Jing, and Guo, Minyi
- Subjects
- *
CONFLICT management , *CENTRAL processing units - Abstract
Even with Network Function Virtualization (NFV), many commodity network servers have spare cycles. Despite that they are small and irregularly occur, spare cycles are fit for deploying short-lived serverless computing functions at the network edge. In this work, we perform detailed analyses of the benefits and limitations of co-locating serverless functions on NFV-ready servers. We propose NEMO, a novel platform that enables efficient serverless edge function deployment in the NFV environment. NEMO can intelligently harvest spare cycles of network functions to warm up the serverless functions and speed up the function invocation in an agile manner. Besides, NEMO can judiciously manage the thread conflict in a resource-limited environment. We build a prototype of NEMO. Our thorough evaluations show that NEMO can harvest up to 41% spare cycles and achieve about 12.5 $\sim$ ∼ 25X performance improvement compared with straightforward co-location. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
21. Soft Error Effects on Arm Microprocessors: Early Estimations versus Chip Measurements.
- Author
-
Bodmann, Pablo R., Papadimitriou, George, Junior, Rubens L. Rech, Gizopoulos, Dimitris, and Rech, Paolo
- Subjects
- *
ARM microprocessors , *SOFT errors , *DATA corruption , *CENTRAL processing units , *ERROR rates , *AUTOMATIC timers - Abstract
Extensive research efforts are being carried out to evaluate and improve the reliability of computing devices either through beam experiments or simulation-based fault injection. Unfortunately, it is still largely unclear to which extend fault injection can provide an accurate error rate estimation at early stages and if beam experiments can be used to identify the weakest resources in a device. The importance and challenges associated with a timely, but yet realistic reliability evaluation grow with the increase of complexity in both the hardware domain, with the integration of different types of cores in an SoC (System-on-Chip), and the software domain, with the OS (operating system) required to take full advantage of the available resources. In this paper, we combine and analyze data gathered with extensive beam experiments (on the final physical CPU hardware) and microarchitectural fault injections (on early microarchitectural CPU models). We target a standalone Arm Cortex-A5 CPU and an Arm Cortex-A9 CPU integrated into an SoC and evaluate their reliability in bare-metal and Linux-based configurations. Combining experimental data that covers more than 18 million years of device time with the result of more than 176,000 injections we find that both the SoC integration and the presence of the OS increase the system DUEs (Detected Unrecoverable Errors) rate (for different reasons) but do not significantly impact the SDCs (Silent Data Corruptions) rate which is solely attributed to the CPU core. Our reliability analysis demonstrates that even considering SoC integration and OS inclusion, early, pre-silicon microarchitecture-level fault injection delivers accurate SDC rates estimations and lower bounds for the DUE rates. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
22. Reduced-Precision Acceleration of Radio-Astronomical Imaging on Reconfigurable Hardware
- Author
-
Stefano Corda, Bram Veenboer, Ahsan Javed Awan, John W. Romein, Roel Jordans, Akash Kumar, Albert-Jan Boonstra, and Henk Corporaal
- Subjects
Accelerator architectures ,approximation methods ,astronomy ,central processing unit ,field programmable gate arrays ,graphics processing units ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Radio telescopes produce large volumes of data that need to be processed to obtain high-resolution sky images. This is a complex task that requires computing systems that provide both high performance and high energy efficiency. Hardware accelerators such as GPUs (Graphics Processing Units) and FPGAs (Field Programmable Gate Arrays) can provide these two features and are thus an appealing option for this application. Most HPC (High-Performance Computing) systems operate in double precision (64-bit) or in single precision (32-bit), and radio-astronomical imaging is no exception. With reduced precision computing, smaller data types (e.g., 16-bit) are used to improve energy efficiency and throughput performance in noise-tolerant applications. We demonstrate that reduced precision can also be used to produce high-quality sky images. To this end, we analyze the gridding component (Image-Domain Gridding) of the widely-used WSClean imaging application. Gridding is typically one of the most time-consuming steps in the imaging process and, therefore, an excellent candidate for acceleration. We identify the minimum required exponent and mantissa bits for a custom floating-point data type. Then, we propose the first custom floating-point accelerator on a Xilinx Alveo U50 FPGA using High-Level Synthesis. Our reduced-precision implementation improves the throughput and energy efficiency of respectively $1.84\times$ and $2.03\times$ compared to the single-precision floating-point baseline on the same FPGA. Our solution is also $2.12\times$ faster and $3.46\times$ more energy-efficient than an Intel i9 9900k CPU (Central Processing Unit) and manages to keep up in throughput with an AMD RX 550 GPU.
- Published
- 2022
- Full Text
- View/download PDF
23. Wide-Area Monitoring of Large Power Systems Based on Simultaneous Processing of Spatio-Temporal Data
- Author
-
Barocio, Emilio, Romero, Josue, Betancourt, Ramon, Korba, Petr, Sevilla, Felix Rafael Segundo, Haes Alhelou, Hassan, editor, Abdelaziz, Almoataz Y., editor, and Siano, Pierluigi, editor
- Published
- 2021
- Full Text
- View/download PDF
24. PySchedCL: Leveraging Concurrency in Heterogeneous Data-Parallel Systems.
- Author
-
Ghose, Anirban, Singh, Siddharth, Kulaharia, Vivek, Dokara, Lokesh, Maity, Srijeeta, and Dey, Soumyajit
- Subjects
- *
HETEROGENEOUS computing , *HIGH performance computing , *PARALLEL programming , *DEEP learning , *PROGRAMMING languages , *CENTRAL processing units - Abstract
In the past decade, high performance compute capabilities exhibited by heterogeneous GPGPU platforms have led to the popularity of data parallel programming languages such as CUDA and OpenCL. Developing high performance parallel programming solutions using such languages involve a steep learning curve due to the complexity of the underlying heterogeneous compute devices and their impact on performance. This has led to the emergence of several High Performance Computing frameworks which provide high-level abstractions for easing the development of data-parallel applications on heterogeneous platforms. However, the scheduling decisions undertaken by such frameworks only exploit coarse-grained concurrency in data parallel applications. In this paper, we propose PySchedCL, a framework which explores fine-grained concurrency aware scheduling decisions that harness the power of heterogeneous CPU/GPU architectures efficiently. We showcase the efficacy of such scheduling mechanisms over existing coarse-grained dynamic scheduling schemes by conducting extensive experimental evaluations for a diverse set of popular Deep Learning benchmarks. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
25. Uplink Performance of Cell-Free Massive MIMO With Multi-Antenna Users Over Jointly-Correlated Rayleigh Fading Channels.
- Author
-
Wang, Zhe, Zhang, Jiayi, Ai, Bo, Yuen, Chau, and Debbah, Merouane
- Abstract
In this paper, we investigate a cell-free massive MIMO system with both access points (APs) and user equipments (UEs) equipped with multiple antennas over jointly-correlated Rayleigh fading channels. We study four uplink implementations, from fully centralized processing to fully distributed processing, and derive their achievable spectral efficiency (SE) expressions with minimum mean-squared error successive interference cancellation (MMSE-SIC) detectors and arbitrary combining schemes. Furthermore, the global and local MMSE combining schemes are derived based on full and local channel state information (CSI) obtained under pilot contamination, which can maximize the achievable SE for the fully centralized and distributed implementation, respectively. We study a two-layer decoding implementation with an arbitrary combining scheme in the first layer and optimal large-scale fading decoding (LSFD) in the second layer. Besides, we compute novel closed-form SE expressions for the two-layer decoding implementation with maximum ratio (MR) combining. In the numerical results, we compare the SE performance for different implementation levels, combining schemes, and channel models. It is important to note that increasing the number of antennas per UE may degrade the SE performance. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
26. AIOC: An All-in-One-Card Hardware Design for Financial Market Trading System.
- Author
-
Huang, Boming, Huan, Yuxiang, Jia, Hao, Ding, Chen, Yan, Yulong, Huang, Bin, Zheng, Li-Rong, and Zou, Zhuo
- Abstract
Latency of a trading system is crucial for gaining profitability in the financial market. Traditional software-based trading system suffers from high latency in market data parsing, network communication and operating system scheduling. The latency is even worse if algorithm processing is used for market data analysis, especially Deep Neural Networks (DNN) are involved. This brief presents a hardware-based trading system which achieves all key functionalities including network communication, financial protocol parsing and DNN processing in one FPGA card. Running at 300 MHz, the proposed hardware design realizes over $98\times $ acceleration than the software-based system regarding the overall latency. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
27. FPGA Acceleration of Matrix-Assembly Phase of RWG-Based MoM.
- Author
-
Topa, Tomasz, Noga, Artur, and Stefanski, Tomasz P.
- Abstract
In this letter, the field-programmable gate array (FPGA) accelerated implementation of matrix-assembly phase of the method of moments (MoM) is presented. The solution is based on a discretization of the frequency-domain mixed potential integral equation using the Rao–Wilton–Glisson basis functions and their extension to wire-to-surface junctions. To take advantage of the given hardware resources (i.e., Xilinx Alveo U200 accelerator card), nine independent processing paths/runtime efficient compute units are developed and synthesized. Numerical results provided for a quadrifilar spiral antenna mounted on a conductive handset box show that the proposed parallelization scheme performs 9.53× faster than a traditional (i.e., serial) central processing unit (CPU) MoM implementation, and about 1.67× faster than a parallel six-core CPU MoM implementation. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. A Real-Time Ultra-High Definition Video Decoder of AVS3 on Heterogeneous Systems.
- Author
-
Han, Xu, Pan, Xiaofei, Wang, Shiqi, Wang, Shanshe, and Gao, Wen
- Subjects
- *
VIDEO coding , *VIDEO compression , *DATA structures , *DATA transmission systems , *CENTRAL processing units - Abstract
Recent years have witnessed the exponential increase in the demand for ultra high definition (UHD) video compression. The third generation of Audio Video Coding Standard (AVS3), which is also known as IEEE Standard 1857.10, is the latest audio and video coding standard developed by the China AVS working group. In AVS3, targeting for UHD videos, a series of efficient coding tools have been introduced, leading to the dramatical increase of computational burden. In this scenario, real-time decoding of UHD videos becomes extremely challenging. This paper presents an improved hybrid CPU + GPU accelerated framework for AVS3 decoding. In particular, the motion vector (MV) derivation process is extracted from entropy decoding threads on the CPU. Therefore, the dependency between threads is removed and the entropy decoding can be performed by multiple threads efficiently. Regarding the GPU, we design compact data structures for transform, prediction, and in-loop filtering to reduce the burden of data transmission. A flexible information buffer supporting multi-thread random writing is further created to coordinate the computation between CPU and GPU. Through asynchronous operations on the buffer, the computation of different computing units and the data transmission between them could be performed in parallel. With NVIDIA GeForce RTX 2080Ti GPU and Intel Core i7 8700K CPU, the proposed decoder achieves 151 frames per second (fps) for 4K videos and 55 fps for 8K videos in all intra configuration. In random access configuration, 218 fps and 74 fps are obtained for 4K and 8K videos, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
29. A Data Layout With Good Data Locality for Single-Machine Based Graph Engines.
- Author
-
Jo, Yong-Yeon, Jang, Myung-Hwan, Kim, Sang-Wook, and Park, Sunju
- Subjects
- *
GRAPH algorithms , *ENGINES - Abstract
Graph engines have been used in many applications to handle big graphs efficiently. The majority of the research to improve their performance has focused primarily on the design of efficient graph processing. This paper claims, however, the focus should be given also to graph storage design. This is because good storage design can improve both CPU performance and I/O performance of graph engines. In this paper, we propose an efficient data layout for single-machine based graph engines. We identify the common node access pattern of the graph algorithms running on single-machine based graph engines. Based on this finding, we propose the breadth-first (BF) data layout which places the nodes processed together in the same or adjacent storage space so that they can be accessed together as much as possible. The experimental results show that the BF data layout improves both CPU and I/O performances significantly in all single-machine based graph engines. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. DPU: DAG Processing Unit for Irregular Graphs With Precision-Scalable Posit Arithmetic in 28 nm.
- Author
-
Shah, Nimish, Olascoaga, Laura Isabel Galindez, Zhao, Shirui, Meert, Wannes, and Verhelst, Marian
- Subjects
ARITHMETIC ,DIRECTED acyclic graphs ,LINEAR algebra ,MACHINE learning ,OPERATING budgets ,GRAPHICS processing units ,PROBABILISTIC number theory - Abstract
Computation in several real-world applications such as probabilistic machine learning, sparse linear algebra, and robotic navigation can be modeled as irregular directed acyclic graphs (DAGs). The irregular data dependencies in DAGs pose challenges to parallel execution on general-purpose CPUs and GPUs, resulting in severe under-utilization of the hardware. This article proposes DAG Processing Unit (DPU), a specialized processor designed for the efficient execution of irregular DAGs. The DPU is equipped with parallel compute units (CUs) that execute different subgraphs of a DAG independently. The CUs can synchronize within a cycle using a hardware-supported synchronization primitive and communicate via an efficient interconnect to a global banked scratchpad. Furthermore, a precision-scalable positTrademarked. arithmetic unit is developed to enable application-dependent precision. The DPU is taped out in 28-nm CMOS, achieving a speedup of 5.1 $\times $ and 20.6 $\times $ over state-of-the-art CPU and GPU implementations on DAGs of sparse linear algebra and probabilistic machine learning workloads. This performance is achieved while operating at a power budget of 0.23 W, as opposed to 55 and 98 W of the CPU and GPU, resulting in a peak efficiency of 538 GOPS/W with DPU, which is 1350 $\times $ and 9000 $\times $ higher than the CPU and GPU, respectively. Thus, with specialized architecture, DPU enables low-power execution of irregular DAG workloads. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
31. High Throughput Hardware/Software Heterogeneous System for RRPN-Based Scene Text Detection.
- Author
-
Xin, Yao, Chen, Donglong, Zeng, Chongyang, Zhang, Weichen, Wang, Yi, and Cheung, Ray C. C.
- Abstract
Rotation Region Proposal Networks (RRPN) are used to generate rotated proposals with the information of text angle for arbitrary oriented scene text detection (STD). However, the computational complexity of RRPN inference is relatively high compared with other methods, which makes it difficult for massive deployment. In this paper, the first full-stack FPGA-CPU heterogeneous system design of RRPN-based STD algorithm is proposed. A hardware/software partition method is presented to analyze and split the tasks to enhance the computation efficiency of hardware. The fast 2D Winograd algorithm and block floating point are utilized to reduce computation complexity while maintaining a relatively high precision. The implementation results show that the peak performance of MAC arrays in the proposed architecture reaches 655.4 GOPS and the energy efficiency achieves 64.9 GOPS/W. By fully exploiting the parallel and pipelined merits in the algorithms, the first hardware architectures for skew non-maximum suppression (S-NMS) layer and rotation region-of-interest (RRoI) polling layer are proposed. The throughput of the proposed hardware/software heterogeneous system achieves 40 times and 1.4 times improvements compared with CPU and GPU, respectively. Moreover, the comprehensive operating expense ratio of pure CPU, GPU, and the proposed system is 80.7:2.5:1, which indicates that it is suitable for massive deployment. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
32. A Universal RRAM-Based DNN Accelerator With Programmable Crossbars Beyond MVM Operator.
- Author
-
Zhang, Zihan, Jiang, Jianfei, Zhu, Yongxin, Wang, Qin, Mao, Zhigang, and Jing, Naifeng
- Subjects
- *
ALGORITHMS , *GRAPHICS processing units , *CENTRAL processing units - Abstract
Resistive-RAM (RRAM)-based deep neural network (DNN) accelerator has shown a great potential as it is good at the matrix–vector multiplication (MVM) operator. However, it does not benefit non-MVM operators, such as transcendental activation or elementwise operations, which often require customized CMOS circuits in conventional DNN accelerator designs. In this article, we propose a new RRAM-based DNN inference accelerator, which leverages the proposed RRAM-CORDIC and RRAM-MLP algorithms to make the transcendental and elementwise operators calculable in the RRAM crossbar just like MVM. Both algorithms can exploit the higher multiply-and-accumulation (MAC) parallelism that is traditionally expensive in CMOS but now efficient in the RRAM crossbar. Then, we further propose an intercrossbar pipelining scheme, which can balance the number of crossbars for MVM and non-MVM operations and orchestrate them in pursuing higher DNN computing throughput. The experimental results show that both algorithms can sustain a high arithmetic accuracy and deliver less than 1% DNN accuracy loss on typical inference workloads. The elimination of expensive CMOS circuits, in turn, can trade more crossbar resources in the same area to speed up the performance by $1.16\times $ to $2.33\times $. With the extended operators, the RRAM-based DNN accelerator can switch crossbar functions at will, and apply for a diverse of DNN models in a unified in-memory accelerator architecture. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
33. iCOS: A Deep Reinforcement Learning Scheme for Wireless-Charged MEC Networks.
- Author
-
Wan, Changwei, Guo, Songtao, He, Jing, Liu, Guiyan, and Zhou, Pengzhan
- Subjects
- *
REINFORCEMENT learning , *DEEP learning , *WIRELESS power transmission , *CENTRAL processing units , *MOBILE computing , *ENERGY consumption - Abstract
Computation offloading is an effective method in mobile edge computing (MEC) to relieve user equipment (UE) from the limited computation resource and battery capacity. Meanwhile, simultaneous wireless information and power transmission (SWIPT) can be applied to MEC to extend the operating time of the equipment. However, in multi-user network environment, diverse computation task requirements and changeable network channel states make it challenging to obtain offloading strategy timely and accurately. To address the issue, we propose an intelligent computation offloading scheme (iCOS) based on enhanced priority deep deterministic policy gradient (EPDDPG) algorithm to minimize the energy consumption of all the UEs by jointly optimizing the offloading decision, the central processing unit (CPU) frequency and the power split ratio in a dynamic SWIPT-MEC network. In particular, we improve the traditional fully-connected network structure to obtain both discrete and continuous action outputs, and accelerate neural network parameter updates by using prioritized experience tuples. Furthermore, we use dynamic voltage and frequency scaling (DVFS) technology to dynamically adjust the CPU frequency of local computing, and employ SWIPT technology to balance the charging and communication according to the obtained strategy. Simulation results show that the algorithm proposed in this paper can effectively reduce the energy cost of UEs, and complete more computation tasks within the delay limit. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. Can Dynamic TDD Enabled Half-Duplex Cell-Free Massive MIMO Outperform Full-Duplex Cellular Massive MIMO?
- Author
-
Chowdhury, Anubhab, Chopra, Ribhu, and Murthy, Chandra R.
- Subjects
- *
CHANNEL estimation , *CENTRAL processing units , *GREEDY algorithms , *SIGNAL processing - Abstract
We consider a dynamic time division duplex (DTDD) enabled cell-free massive multiple-input multiple-output (CF-mMIMO) system, where each half-duplex (HD) access point (AP) is scheduled to operate in the uplink (UL) or downlink (DL) mode based on the data demands of the user equipments (UEs), with the goal of maximizing the sum UL-DL spectral efficiency (SE). We develop a new, low complexity, greedy algorithm for the combinatorial AP scheduling problem, with an optimality guarantee theoretically established via showing that a lower bound of the sum UL-DL SE is sub-modular. We also consider pilot sequence reuse among the UEs to limit the channel estimation overhead. In CF systems, all the APs estimate the channel from every UE, making pilot allocation problem different from the cellular case. We develop a novel algorithm that iteratively minimizes the maximum pilot contamination across the UEs. We compare the performance of our solutions, both theoretically and via simulations, against a full duplex (FD) multi-cell mMIMO system. Our results show that, due to the joint processing of the signals at the central processing unit, CF-mMIMO with dynamic HD AP-scheduling significantly outperforms cellular FD-mMIMO in terms of the sum SE and 90% likely SE. Thus, DTDD enabled HD CF-mMIMO is a promising alternative to cellular FD-mMIMO, without the cost of hardware for self-interference suppression. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
35. A Survey of Deep Learning on CPUs: Opportunities and Co-Optimizations.
- Author
-
Mittal, Sparsh, Rajput, Poonam, and Subramoney, Sreenivas
- Subjects
- *
DEEP learning , *COMPUTER architecture , *ARTIFICIAL intelligence , *CENTRAL processing units , *GRAPHICS processing units , *PARALLEL programming - Abstract
CPU is a powerful, pervasive, and indispensable platform for running deep learning (DL) workloads in systems ranging from mobile to extreme-end servers. In this article, we present a survey of techniques for optimizing DL applications on CPUs. We include the methods proposed for both inference and training and those offered in the context of mobile, desktop/server, and distributed systems. We identify the areas of strength and weaknesses of CPUs in the field of DL. This article will interest practitioners and researchers in the area of artificial intelligence, computer architecture, mobile systems, and parallel computing. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
36. Numerical Assessment of Turbulence-Cascade Noise Reduction and Aerodynamic Penalties from Serrations.
- Author
-
Buszyk, Martin, Polacsek, Cyril, Le Garrec, Thomas, Barrier, Raphaël, and Bailly, Christophe
- Abstract
This work is related to the investigation of innovative stator designs aiming to reduce the dominant interaction noise in aeroengines. The study of turbulent structures definition is crucial for the accurate prediction of broadband noise radiation from passive treatments, as leading-edge serrations studied here. A modified Fourier modes-based methodology is proposed to obtain a fully three-dimensional incompressible turbulence field, while taking into account periodic and wall-boundary conditions. A low-noise geometry is examined along with the reference profile on a rectilinear seven-vane cascade rig using a hybrid computational fluid dynamics/computational aeroacoustics method. Numerically assessed noise reductions from the serrated airfoils are favorably compared with an analytical solution and a semi-empirical law. An overall sound power-level reduction around 4 to 6 dB is obtained at three acoustic certification points. Finally, the aerodynamic performances are also evaluated through Reynolds-averaged Navier-Stokes computations, and an improved variant of the initial treatment is proposed, allowing for acceptable penalties at the aerodynamic design point. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
37. A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems.
- Author
-
Cheng, Lin, Pan, Peitian, Zhao, Zhongyuan, Ranjan, Krithik, Weber, Jack, Veluri, Bandhav, Ehsani, Seyed Borna, Ruttenberg, Max, Jung, Dai Cheol, Ivanov, Preslav, Richmond, Dustin, Taylor, Michael B., Zhang, Zhiru, and Batten, Christopher
- Subjects
- *
OPEN source software , *CENTRAL processing units , *COMPUTER architecture , *COPROCESSORS , *ENERGY consumption - Abstract
Future CPU-manycore heterogeneous systems can provide high peak throughput by integrating thousands of simple, independent, energy-efficient cores in a single die. However, there are two key challenges to translating this high peak throughput into improved end-to-end workload performance: 1) manycore co-processors rely on simple hardware putting significant demands on the software programmer and 2) manycore co-processors use in-order cores that struggle to tolerate long memory latencies. To address the manycore programmability challenge, this article presents a dense and sparse tensor processing framework based on PyTorch that enables domain experts to easily accelerate off-the-shelf workloads on CPU-manycore heterogeneous systems. To address the manycore memory latency challenge, we use our extended PyTorch framework to explore the potential for decoupled access/execute (DAE) software and hardware mechanisms. More specifically, we propose two software-only techniques, naïve-software DAE and systolic-software DAE, along with a lightweight hardware access accelerator to further improve area-normalized throughput. We evaluate our techniques using a combination of PyTorch operator microbenchmarking and real-world PyTorch workloads running on a detailed register-transfer-level model of a 128-core manycore architecture. Our evaluation on three real-world dense and sparse tensor workloads suggests these workloads can achieve approximately 2– $6\times $ performance improvement when scaled to a future 2000-core CPU-manycore heterogeneous system compared to an 18-core out-of-order CPU baseline, while potentially achieving higher area-normalized throughput and improved energy efficiency compared to general-purpose graphics processing units. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
38. Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems.
- Author
-
Zhang, Xiaofan, Ma, Yuan, Xiong, Jinjun, Hwu, Wen-Mei W., Kindratenko, Volodymyr, and Chen, Deming
- Subjects
- *
PYTHON programming language , *PARTICIPATORY design , *STREAMING video & television , *FIELD programmable gate arrays - Abstract
Deep neural network (DNN)-based video analysis has become one of the most essential and challenging tasks to capture implicit information from video streams. Although DNNs significantly improve the analysis quality, they introduce intensive compute and memory demands and require dedicated hardware for efficient processing. The customized heterogeneous system is one of the promising solutions with general-purpose processors (CPUs) and specialized processors (DNN Accelerators). Among various heterogeneous systems, the combination of CPU and FPGA has been intensively studied for DNN inference with improved latency and energy consumption compared to CPU + GPU schemes and with increased flexibility and reduced time-to-market cost compared to CPU + ASIC designs. However, deploying DNN-based video analysis on CPU + FPGA systems still presents challenges from the tedious RTL programming, the intricate design verification, and the time-consuming design space exploration. To address these challenges, we present a novel framework, called EcoSys, to explore co-design and optimization opportunities on CPU-FPGA heterogeneous systems for accelerating video analysis. Novel technologies include 1) a coherent memory space shared by the host and the customized accelerator to enable efficient task partitioning and online DNN model refinement with reduced data transfer latency; 2) an end-to-end design flow that supports high-level design abstraction and allows rapid development of customized hardware accelerators from Python-based DNN descriptions; 3) a design space exploration (DSE) engine that determines the design space and explores the optimized solutions by considering the targeted heterogeneous system and user-specific constraints; and 4) a complete set of co-optimization solutions, including a layer-based pipeline, a feature map partition scheme, and an efficient memory hierarchical design for the accelerator and multithreading programming for the CPU. In this article, we demonstrate our design framework to accelerate the long-term recurrent convolution network (LRCN), which analyzes the input video and output one semantic caption for each frame. EcoSys can deliver 314.7 and 58.1 frames/s by targeting the LRCN model with AlexNet and VGG-16 backbone, respectively. Compared to the multithreaded CPU and pure FPGA design, EcoSys achieves $20.6\times $ and $5.3\times $ higher throughput performance. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
39. Clock-Gated Variable Frequency Signaling to Alleviate Power Supply Noise in a Packaged IC.
- Author
-
Bhattacharjee, Pritam, Rana, Prerna, Bhattacharyya, Bidyut K., and Majumder, Alak
- Subjects
- *
POWER resources , *CENTRAL processing units , *INTEGRATED circuits , *ELECTRIC lines , *NOISE , *MULTICORE processors - Abstract
Power supply noise (PSN) in central processing unit (CPU) cores or any integrated circuit (IC) chips is a problem of power line voltage degradation impelled due to the parasitic components (viz., resistance, capacitance, and inductance) across the packaging. This lowers the operating frequency of the silicon chip during high volume manufacturing and thereby, affects the overall performance. After numerous attempts to minimize the parasitic impacts of power delivery networks (PDNs) in IC packages, researchers also started to look for circuital configurations that can curb down the coefficients of PSN (i.e., the instantaneous current $i(t)$ and the current ramp $[({di(t)})/{dt}]$ during OFF to ON switching of packaged CPU/IC). Though variable frequency clock (VFC) has emerged as a potential solution, its design is found to be complex and power hungry. Hence, a novel and simple configuration of VFC embedded with leakage control transistor-based clock gating (LCT-CG) is tendered in this article, where the pertaining circuits and integrated system have been designed using UMC 65 nm CMOS with a supply of 1.1 V. The performance of the design is tested on master-slave flip-flop (MSFF) and a few prominent benchmark circuits (viz., ISCAS’89 s27, s820, s832, s1196, s9234 and ITC’99 b02 and an IEEE 754 complaint single precision FPU) considering the PDN of existing CPU OLGA package. The system-level validation of our proposed IC on A-Z80 CPU chip offers the average power and PSN to be mitigated by 30.06% and 42.63% against the conventional clocking. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
40. Performance Analysis of Cell-Free Massive MIMO Systems in LoS/ NLoS Channels.
- Author
-
Mukherjee, Sudarshan and Chopra, Ribhu
- Subjects
- *
CENTRAL processing units , *MIMO systems , *TELECOMMUNICATION systems , *WIRELESS channels - Abstract
In cellular communication systems, it is conventional to assume the absence of a line of sight (LoS) path between the users and theirassociated access points (APs). This assumption however becomes questionable in the context of recent developments in the direction of cell-free (CF) massive MIMO systems. In the CF massive MIMO, the AP density is assumed to be comparable with the user density, which increases probability of existence of an LoS path between the users and their associated APs. In this paper, we analyze the performance of an uplink CF massive MIMO system, with a probabilistic LoS channel model. Here, we first derive the effective statistics of this channel model, and argue that their behaviour is fundamentally different from that of the conventional rich scattering channels. Utilizing these statistics, we next compare the rates achievable by CF massive MIMO systems, under both stream-wise and joint decoding at the central processing unit. Following this, we also discuss the centralized MMSE based data detection to obtain a complexity/ performance trade-off. Finally, using detailed Monte-Carlo simulations, we validate our analytical results, and evaluate the performance of the three data detection schemes. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
41. X-MAN: A Non-Intrusive Power Manager for Energy-Adaptive Cloud-Native Network Functions.
- Author
-
Xiang, Zuo, Howeler, Malte, You, Dongho, Reisslein, Martin, and Fitzek, Frank H. P.
- Abstract
Emerging microservices demand flexible low-latency processing of network functions in virtualized environments, e.g., as containerized network functions (CNFs). While ensuring highly responsive low-latency CNF processing, the computing environments should conserve energy to reduce costs. In this systems integration study, we develop and evaluate the novel XDP-Monitoring Energy-Adaptive Network Functions (X-MAN) framework for managing the CPU operational states (P-states) so as to reduce the power consumption while prioritizing low-latency service. Architecturally, X-MAN consists of lightweight traffic monitors that are attached to the virtual network interfaces in the kernel space for per-CNF traffic monitoring and a power manager in user space with a global view of the CNFs on a CPU core. Algorithmically, X-MAN monitors the CPU core utilization via hybrid simple and weighted moving average prediction fed by the traffic monitors and a power management based on step-based CPU core frequency (P-state) adjustments. We evaluate X-MAN through extensive measurements in a real physical testbed operating at up to 10 Gbps. We find that X-MAN incurs significantly shorter and more consistent monitoring latencies for the CPU utilization than a state-of-the-art CPU hardware counter approach. Also, X-MAN achieves more responsive CPU core frequency adjustments and more pronounced reductions of the CPU power consumption than a state-of-the-art code instrumentation approach. We make the X-MAN source code publicly available. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
42. Advanced Parallelism of DGTD Method With Local Time Stepping Based on Novel MPI + MPI Unified Parallel Algorithm.
- Author
-
Ban, Zhen Guo, Shi, Yan, and Wang, Peng
- Subjects
- *
COMPUTER workstation clusters , *MESSAGE passing (Computer science) , *DATA transmission systems , *PARALLEL processing , *PARALLEL algorithms , *PARALLEL programming , *CENTRAL processing units - Abstract
In this communication, a novel message passing interface (MPI) parallel algorithm for nodal discontinuous Galerkin time-domain (NDGTD) method has been developed. A unified MPI + MPI technique has been introduced for extreme parallelism on a large-scale computer cluster. Through the data transmission between CPU nodes using MPI persistent nonblocking two-side communication and the direct data connection between processors in the same node via MPI shared memory windows, a two-layered parallel architecture is implemented to minimize the communication. To further accelerate the solution of the multiscale problems, the local time stepping (LTS) technique has been employed in the NDGTD method. A fast time step estimation method has been presented in this communication. With high overlap between the information transmission and the data calculation, the proposed MPI + MPI scheme overcomes the degradation of the parallel efficiency of the pure MPI technique in the scenario of the LTS technique and the large-scale CPU cores. Up to 94% parallel efficiency in 6400 CPU cores is achieved for the average single-core loading about 1700 finite elements, and 18 times acceleration for time step estimation can be obtained with the fourth-order basis function. Three practical complex examples are given to demonstrate a good performance of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
43. Enabling Homomorphically Encrypted Inference for Large DNN Models.
- Author
-
Lloret-Talavera, Guillermo, Jorda, Marc, Servat, Harald, Boemer, Fabian, Chauhan, Chetan, Tomishima, Shigeki, Shah, Nilesh N., and Pena, Antonio J.
- Subjects
- *
DYNAMIC random access memory , *MACHINE learning , *SERVICE learning , *HYBRID systems , *CENTRAL processing units - Abstract
The proliferation of machine learning services in the last few years has raised data privacy concerns. Homomorphic encryption (HE) enables inference using encrypted data but it incurs 100x–10,000x memory and runtime overheads. Secure deep neural network (DNN) inference using HE is currently limited by computing and memory resources, with frameworks requiring hundreds of gigabytes of DRAM to evaluate small models. To overcome these limitations, in this paper we explore the feasibility of leveraging hybrid memory systems comprised of DRAM and persistent memory. In particular, we explore the recently-released Intel® Optane™ PMem technology and the Intel® HE-Transformer nGraph® to run large neural networks such as MobileNetV2 (in its largest variant) and ResNet-50 for the first time in the literature. We present an in-depth analysis of the efficiency of the executions with different hardware and software configurations. Our results conclude that DNN inference using HE incurs on friendly access patterns for this memory configuration, yielding efficient executions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
44. Detecting Outlier Machine Instances Through Gaussian Mixture Variational Autoencoder With One Dimensional CNN.
- Author
-
Su, Ya, Zhao, Youjian, Sun, Ming, Zhang, Shenglin, Wen, Xidao, Zhang, Yongsu, Liu, Xian, Liu, Xiaozhou, Tang, Junliang, Wu, Wenfei, and Pei, Dan
- Subjects
- *
GAUSSIAN mixture models , *TIME series analysis , *QUALITY of service , *STREET addresses , *MACHINERY , *CENTRAL processing units - Abstract
Today's large datacenters house a massive number of machines, each of which is being closely monitored with multivariate time series (e.g., CPU idle, memory utilization) to ensure service quality. Detecting outlier machine instances with multivariate time series is crucial for service management. However, it is a challenging task due to the multiple classes and various shapes, high dimensionality, and lack of labels of multivariate time series. In this article, we propose DOMI, a novel unsupervised model that combines Gaussian mixture VAE with 1D-CNN, to detect outlier machine instances. Its core idea is to capture the normal patterns of machine instances by learning their latent representations that consider the shape characteristics, reconstruct input data by the learned representations, and apply reconstruction probabilities to determine outliers. Moreover, DOMI interprets the detected outlier instance based on the reconstruction probability changes of univariate time series. Extensive experiments have been conducted on the dataset collected from 1821 machines with a 1.5-month-period, which are deployed in ByteDance, a top global content service provider. DOMI achieves the best F1-Score of 0.94 and AUC score of 0.99, significantly outperforming the best performing baseline method by 0.08 and 0.03, respectively. Moreover, its interpretation accuracy is up to 0.93. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
45. Residual Energy Maximization for Wireless Powered Mobile Edge Computing Systems With Mixed-Offloading.
- Author
-
Wu, Mengru, Qi, Weijing, Park, Junhee, Lin, Peng, Guo, Lei, and Lee, Inkyu
- Subjects
- *
MOBILE computing , *COMPUTER systems , *WIRELESS power transmission , *EDGE computing , *LINEAR programming , *TIME management , *ENERGY harvesting - Abstract
This paper studies a joint design of resource allocation and task offloading in a wireless powered mobile edge computing network involving different types of computation tasks. To deal with diverse computation tasks, we explore a mixed-offloading paradigm to support the coexistence of partial and binary offloading modes. Specifically, devices harvest energy from an access point (AP) via wireless power transfer (WPT) and utilize the harvested energy to execute their computation tasks using partial or binary offloading. Based on a practical non-linear energy harvesting model, a residual energy maximization problem is formulated by jointly optimizing the transmit power of the AP, the offloading power of devices, the time allocation on WPT and task offloading, and the task partitions and the binary offloading decisions of devices, which turn out to be a non-convex mixed-integer non-linear programming problem. Thus, we develop an efficient dual-layer optimization algorithm by decomposing the optimization problem into an inner and outer layer structure that aims to obtain resource allocation and offloading decisions. Simulation results show that our proposed scheme achieves residual energy gains compared to existing schemes. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
46. Distributed Receiver Processing for Extra-Large MIMO Arrays: A Message Passing Approach.
- Author
-
Amiri, Abolfazl, Rezaie, Sajad, Manchon, Carles Navarro, and de Carvalho, Elisabeth
- Abstract
We study the design of receivers in extra-large scale MIMO (XL-MIMO) systems, i.e. systems in which the base station is equipped with an antenna array of extremely large dimensions. While XL-MIMO can significantly increase the system’s spectral efficiency, they present two important challenges. One is the increased computational cost of multi-antenna processing. The second one is the presence of spatial non-stationarities in the channel response, which imply that the mean energy of a given user’s signal varies across the array. Such non-stationarities limit the performance of the system. In this paper, we propose a distributed receiver for such an XL-MIMO system that can address both challenges. Based on variational message passing (VMP), we propose a set of receiver options providing a range of complexity-performance characteristics to adapt to different requirements. Furthermore, we distribute the processing into local processing units (LPU), that can perform most of the complex processing in parallel, before sharing their outcome with a central processing unit (CPU). Our designs are specifically tailored to exploit the spatial non-stationarities and require lower computations than linear receivers. Our simulation study, performed with a channel model accounting for the special characteristics of XL-MIMO channels, confirms the superior performance of our proposals compared to the state of the art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
47. RIA-CSM: A Real-Time Impact-Aware Correlative Scan Matching Using Heterogeneous Multi-Core SoC.
- Author
-
Bao, Minjie, Wang, Ke, Li, Ruifeng, Ma, Baoteng, and Fan, Zhendong
- Abstract
In dynamic scenarios with a large flow of people, robots are extremely vulnerable to impacts. Conventional correlative scan matching (CSM) is imperfect. The algorithm obtains incorrect priors caused by impact. Meanwhile, extremely high computational complexity remains a challenging issue. In this paper, we propose a computationally efficient and robust loosely-coupled LIDAR-IMU-wheel fusion method named RIA-CSM, using heterogeneous multi-core SoC. The proposed impact-aware sensor integration module provides robust priors for CSM. Instead of occupancy grids, truncated signed distance functions (TSDFs) are used to represent a map. The most time-consuming part of CSM is mapped to an FPGA-based CSM accelerator, while the remaining part is calculated by multi-core CPU. The public data set is used to test the localization accuracy and real-time performance of the proposed RIA-CSM, which shows that the real-time performance of conventional CSM has been improved by 30 times. When the robot is impacted, experimental results on a mobile robot platform are qualitatively analyzed. Compared with the state of art, our method realizes more robust localization. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
48. Observing the Invisible: Live Cache Inspection for High-Performance Embedded Systems.
- Author
-
Tarapore, Dharmesh, Roozkhosh, Shahin, Brzozowski, Steven, and Mancuso, Renato
- Subjects
- *
SYSTEMS on a chip , *PHASOR measurement , *CACHE memory , *RANDOM access memory , *CENTRAL processing units , *RESOURCE allocation - Abstract
The vast majority of high-performance embedded systems implement multi-level CPU cache hierarchies. But the exact behavior of these CPU caches has historically been opaque to system designers. Absent expensive hardware debuggers, an understanding of cache makeup remains tenuous at best. This enduring opacity further obscures the complex interplay among applications and OS-level components, particularly as they compete for the allocation of cache resources. Notwithstanding the relegation of cache comprehension to proxies such as static cache analysis, performance counter-based profiling, and cache hierarchy simulations, the underpinnings of cache structure and evolution continue to elude software-centric solutions. In this article, we explore a novel method of studying cache contents and their evolution via snapshotting. Our method complements extant approaches for cache profiling to better formulate, validate, and refine hypotheses on the behavior of modern caches. We leverage cache introspection interfaces provided by vendors to perform live cache inspections without the need for external hardware. We present CacheFlow, a proof-of-concept Linux kernel module which snapshots cache contents on an NVIDIA Tegra TX1 system on chip and a Hardkernel Odroid XU4. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
49. Low-Latency PON PHY Implementation on GPUs for Fully Software-Defined Access Networks.
- Author
-
Suzuki, Takahiro, Kim, Sang-Yuep, Kani, Jun-ichi, and Yoshida, Tomoaki
- Subjects
- *
PASSIVE optical networks , *SOFTWARE-defined networking , *CENTRAL processing units , *GRAPHICS processing units - Abstract
In order to respond to the increasing diversification of requirements, studies on software-defined access networks to enhance the flexibility of network systems continues. For fully software-defined access networks to efficiently accommodate various passive optical networks (PONs), other interfaces, and services, the softwarization of physical-layer processing has been studied. However, current techniques do not achieve enough performance for practical applications, and the total latency of upstream processing and downstream processing is 39.0 ms. The load imposed by conventional inter-ruption-based implementation makes it difficult to reduce buffer size. This article proposes an accelerator-based polling implementation method, which has lower load than the conventional interrupt approach, and thus has superior latency. We evaluate the latency performance in real time when implementing the 10G-EPON functions. Our implementation realizes graphic processing unit (CPU) polling by adding a flag to the received data and repeatedly checking the flag on a dedicated core. Demonstration results show that a prototype system successfully implemented our polling method on CPUs with low latency of 0.586 ms. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
50. λ DNN : Achieving Predictable Distributed DNN Training With Serverless Architectures.
- Author
-
Xu, Fei, Qin, Yiling, Chen, Li, Zhou, Zhi, and Liu, Fangming
- Subjects
- *
COST functions - Abstract
Serverless computing is becoming a promising paradigm for Distributed Deep Neural Network (DDNN) training in the cloud, as it allows users to decompose complex model training into a number of functions without managing virtual machines or servers. Though provided with a simpler resource interface (i.e., function number and memory size), inadequate function resource provisioning (either under-provisioning or over-provisioning) easily leads to unpredictable DDNN training performance in serverless platforms. Our empirical studies on AWS Lambda indicate that, such unpredictable performance of serverless DDNN training is mainly caused by the resource bottleneck of Parameter Servers (PS) and small local batch size. In this article, we design and implement $\lambda$ λ DNN, a cost-efficient function resource provisioning framework to provide predictable performance for serverless DDNN training workloads, while saving the budget of provisioned functions. Leveraging the PS network bandwidth and function CPU utilization, we build a lightweight analytical DDNN training performance model to enable our design of $\lambda$ λ DNN resource provisioning strategy, so as to guarantee DDNN training performance with serverless functions. Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that, $\lambda$ λ DNN can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7 percent, compared with the state-of-the-art resource provisioning strategies, yet with an acceptable runtime overhead. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.