Descriptor: "Program processors" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Program processors"' showing total 2,684 results

Start Over Descriptor "Program processors"

2,684 results on '"Program processors"'

1. OpenPiton4HPC: Optimizing OpenPiton Toward High-Performance Manycores

Author: Leyva, Neiel, Monemi, Alireza, Oliete-Escuín, Noelia, López-Paradís, Guillem, Abancens, Xabier, Balkind, Jonathan, Vallejo, Enrique, Moretó, Miquel, and Alvarez, Lluc
Subjects: Engineering, Electrical Engineering, Electronics, Sensors and Digital Hardware, Multicore processing, Optimization, Memory management, Circuits and systems, Random access memory, Program processors, Bandwidth, Many-core, network-on-chip, optimization, OpenPiton, RISC-V, Electrical engineering
Published: 2024

2. A Graph Attention Network Approach to Partitioned Scheduling in Real-Time Systems.

Author: Lee, Seunghoon and Lee, Jinkyu
Abstract: Machine learning methods have been used to solve real-time scheduling problems but none has yet made an architecture that utilizes influences between real-time tasks as input features. This letter proposes a novel approach to partitioned scheduling in real-time systems using graph machine learning. We present a graph representation of real-time task sets that enable graph machine-learning schemes to capture the influence between real-time tasks. By using a graph attention network (GAT) with this method, our model successfully partitioned-schedule task sets that were previously deemed unschedulable by state-of-the-art partitioned scheduling algorithms. The GAT is used to establish relationships between nodes in the graph, which represent real-time tasks, and to learn how these relationships affect the schedulability of the system. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. Design and Implementation of Digital Down Converter for WiFi Network.

Author: Datta, Debarshi and Dutta, Himadri Sekhar
Abstract: This letter introduces a field-programmable gate array (FPGA)-based digital down converter (DDC) processing a sampling frequency of about 3.64 GHz to a down-converted frequency of 28.4375 MHz to match the IEEE 802.11ah WiFi HaLow standard. The proposed DDC uses a polyphase mixer (PM) and a resampling filter. The PM adopts parallel coordinate rotation digital computer (CORDIC) processors and lowpass filter arrays to reduce high-speed data rates with minimum resource utilization. Again, the resampling filter employs a cascaded integrator comb (CIC) filter associated with a parallel prefixed adder (PPA) and a multichannel systolic finite impulse response (FIR) filter implemented with canonical expression, attaining optimum hardware cost. Converting floating-point to fixed-point data types provides significant resource savings. Finally, the improved design is coded in the Xilinx Vivado synthesis tool and successfully tested on the FPGA Kintex-7 device. In contrast to other recent architectures, the proposed design substantially reduces area requirements and power utilization. The MATLAB tool verifies the design to achieve an acceptable spurious-free dynamic range (SFDR) of 115 dB. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Fast Settling Phase-Locked Loops: A Comprehensive Survey of Applications and Techniques [Feature].

Author: Ali, Zeeshan, Paliwal, Pallavi, Ahmad, Meraj, Heidari, Hadi, and Gupta, Shalabh
Abstract: Fast settling phase locked loops (PLLs) play a pivotal role in many applications requiring rapid attainment of a stable frequency and phase. In modern communication standards, these PLLs are extensively utilized to guarantee precise compliance with dynamic resource allocation requirements. In processors, these PLLs manage dynamic voltage frequency scaling. Moreover, the fast-settling PLLs expedite the scanning of frequency spectra in sophisticated electronic radar set-ups, proving particularly advantageous for imaging and scanning radar applications. The rapid response exhibited by these PLLs is also harnessed in quantum technologies, catering to the urgent need for precise frequency adjustments to manipulate qubit states effectively. The strategies employed to attain fast-settling PLLs are primarily classified into five broad techniques in this article: enhanced phase frequency detection, hybrid multiple subsystems, VCO start-up, gear shift, and look-up table or finite state machine. This article explores the fundamental operational principles encompassing these techniques and presents optimal settling times for each method reported in the literature. Finally, the architectures utilizing these techniques will be evaluated based on their figure of merit (FoM), settling time, and tuning range. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. Revisiting Topographic Horizons in the Era of Big Data and Parallel Computing

Author: Dozier, Jeff
Subjects: Physical Geography and Environmental Geoscience, Earth Sciences, Engineering, Geomatic Engineering, Geoinformatics, Azimuth, Parallel processing, Program processors, Matlab, Computers, Computational modeling, Codes, Big data applications, digital elevation models, parallel processing, surface topography, Artificial Intelligence and Image Processing, Electrical and Electronic Engineering, Geological & Geomatics Engineering, Physical geography and environmental geoscience, Geomatic engineering
Published: 2022

6. Modular High-Performance Computing Using Chiplets.

Author: Vinnakota, Bapi, Shalf, John M., and Mohrer, Kathryn
Subjects: APPLICATION-specific integrated circuits, COMPUTER systems, CLOUD computing, HIGH performance computing, COMPUTER architecture
Abstract: The performance growth rate of high-performance computing (HPC) systems has fallen from 1000× to just 10× every eleven years. The HPC world, like large cloud service provider data centers, has turned to heterogeneous acceleration to deliver continued performance growth through specialization. Chiplets offer a new, compelling approach to scaling performance through adding workload-specific processors and massive bandwidth to memory into computing systems. If design and manufacturing challenges are resolved, chiplets can offer a cost-effective path for combining die from multiple function-optimized process nodes, and even from multiple vendors, into a single application-specific integrated circuit (ASIC). This article explores opportunities for building and improving the performance of bespoke HPC architectures using open-modular "chiplet" building blocks. The hypothesis developed is to use chiplets to extend the functional and physical modularity of modern HPC systems to within the semiconductor package. This planning can reduce the complexity and cost of assembling chiplets into an ASIC product and make it easier to build multiple product variants. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

7. Improving MPI Collective I/O for High Volume Non-Contiguous Requests with Intra-Node Aggregation

Author: Kang, Q, Lee, S, Hou, K, Ross, R, Agrawal, A, Choudhary, A, and Liao, WK
Subjects: Performance evaluation, Benchmark testing, Program processors, Libraries, Production, Aggregates, Writing, Parallel I, O, MPI collective I, two-phase I, non-contiguous I, Distributed Computing, Computer Software, Communications Technologies
Abstract: Two-phase I/O is a well-known strategy for implementing collective MPI-IO functions. It redistributes I/O requests among the calling processes into a form that minimizes the file access costs. As modern parallel computers continue to grow into the exascale era, the communication cost of such request redistribution can quickly overwhelm collective I/O performance. This effect has been observed from parallel jobs that run on multiple compute nodes with a high count of MPI processes on each node. To reduce the communication cost, we present a new design for collective I/O by adding an extra communication layer that performs request aggregation among processes within the same compute nodes. This approach can significantly reduce inter-node communication contention when redistributing the I/O requests. We evaluate the performance and compare it with the original two-phase I/O on Cray XC40 parallel computers (Theta and Cori) with Intel KNL and Haswell processors. Using I/O patterns from two large-scale production applications and an I/O benchmark, we show our proposed method effectively reduces the communication cost and hence maintains the scalability for a large number of processes.
Published: 2020

8. Monitoring Software Execution Flow Through Power Consumption and Dynamic Time Warping.

Author: Vidal, Boris, Moreno, Carlos, Fischmeister, Sebastian, and Carvajal, Gonzalo
Abstract: This letter presents a technique for nonintrusive code execution tracking using side-channel signals of power consumption. Using a nearest-neighbor classifier that integrates the dynamic time warping distance with information from the control flow graph, it is possible to identify executed basic blocks from a trace of power consumption that exhibits temporal distortions due to assembly-level artifacts and varying operational conditions. Experimental results show that the proposed technique achieves over 95% precision when inferring the runtime execution flow of a cruise control application using unmarked traces of power consumption collected from different processors. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

9. A Python Multiprocessing Approach for Fast Geostatistical Simulations of Subglacial Topography.

Author: Schoedl, Nathan W., MacKie, Emma J., Field, Michael J., Stubbs, Eric A., Zhang, Allan, Hibbs, Matthew, and Gravey, Mathieu
Subjects: TOPOGRAPHY, ICE sheets, PYTHONS, PYTHON programming language, TOPOGRAPHIC maps, GRID cells
Abstract: Realistically rough stochastic realizations of subglacial bed topography are crucial for improving our understanding of basal processes and quantifying uncertainty in sea level rise projections with respect to topographic uncertainty. This can be achieved with sequential Gaussian simulation (SGS), which is used to generate multiple nonunique realizations of geological phenomena that sample the uncertainty space. However, SGS is very CPU intensive, with a computational complexity of O($\boldsymbol{N}\boldsymbol{k}$Nk3), where $N$N is the number of grid cells to simulate, and $\boldsymbol{k}$k is the number of neighboring points used for conditioning. This complexity makes SGS prohibitively time-consuming to implement at ice sheet scales or fine resolutions. To reduce the time cost, we implement and test a multiprocess version of SGS using Python's multiprocessing module. By parallelizing the calculation of the weight parameters used in SGS, we achieve a speedup of 9.5 running on 16 processors for an $\boldsymbol{N}$N of 128,097. This speedup—as well as the speedup from using multiple processors—increases with $\boldsymbol{N}$N. This speed improvement makes SGS viable for large-scale topography mapping and ensemble ice sheet modeling. Additionally, we have made our code repository and user tutorials publicly available (GitHub and Zenodo) so that others can use our multiprocess implementation of SGS on different datasets. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

10. Novel Low Memory Footprint DNN Models for Edge Classification of Surgeons’ Postures.

Author: Hanneman, Alex, Fawden, Terry, Branciforte, Marco, Virzi, Maria Celvisia, Moss, Esther L., Ost, Luciano, and Zecca, Massimiliano
Abstract: Skill assessment is fundamental to enhance current laparoscopic surgical training and reduce the incidence of musculoskeletal injuries from performing these procedures. Recently, deep neural networks (DNNs) have been used to improve human posture and surgeons’ skills training. While they work well in the lab, they normally require significant computational power which makes it impossible to use them on edge devices. This letter presents two low memory footprint DNN models used for classifying laparoscopic surgical skill levels at the edge. Trained models were deployed on three Arm Cortex-M processors using the X-Cube-AI and Tensorflow Lite Micro (TFLM) libraries. Results show that the CUBE-AI-based models give the best relative performance, memory footprint, and accuracy tradeoffs when executed on the Cortex-M7. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

11. Tag-Sharer-Fusion Directory: A Scalable Coherence Directory With Flexible Entry Formats.

Author: Qiu, Yudi, Jiao, Jie, Zeng, Xiaoyang, and Fan, Yibo
Subjects: *SCALABILITY, *DIRECTORIES, *OVERHEAD costs, *MULTIPROCESSORS
Abstract: In large-scale chip multiprocessors (CMPs), the scalability of a coherence directory becomes more important as the number of cores increases. However, previously proposed scalable coherence directories typically reduce the directory storage overhead at the cost of one or more aspects of performance, accuracy, and complexity. In this article, we propose the tag-sharer-fusion (TSF) directory, a scalable coherence directory with low hardware complexity, as well as with high performance and accuracy. Each directory entry has just enough bits to store a single sharer pointer and is divided into two primary formats: tag and sharer, where sharer entries store sharers but not tags. Each private block is tracked by a tag entry, and each shared block is tracked by a combination of a tag entry and a sharer entry in the same set. Simulation of a 128-core chip-multiprocessor with the PARSEC and SPLASH-2x benchmarks shows that the TSF directory requires only a quarter of the area of a non-scalable full-map sparse directory to achieve similar performance and network traffic, both with an average overhead within 1%. The TSF directory outperforms the state-of-the-art Pool and way-combining directory proposals in terms of storage overhead, performance, and network traffic. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

12. The High Faulty Tolerant Capability of the Alternating Group Graphs.

Author: Zhang, Hui, Hao, Rong-Xia, Qin, Xiao-Wen, Lin, Cheng-Kuan, and Hsieh, Sun-Yuan
Subjects: *FAULT tolerance (Engineering), *WIRELESS communications
Abstract: The matroidal connectivity and conditional matroidal connectivity are novel indicators to measure the real faulty tolerability. In this paper, for the $n$ n -dimensional alternating group graph $AG_{n}$ A G n , the structure properties and (conditional) matroidal connectivity are studied based on the dimensional partition of $E(AG_{n})$ E (A G n) . We prove that for $S\subseteq E(AG_{n})$ S ⊆ E (A G n) under some limitation on the number of faulty edges in each dimensional edge set, if $|S|\leq (n-1)!-1$ | S | ≤ (n - 1) ! - 1 , then $AG_{n}-S$ A G n - S is connected. We study the value of matroidal connectivity and conditional matroidal connectivity of $AG_{n}$ A G n . Furthermore, simulations have been carried out to compare the matroidal connectivity with other types of conditional connectivity in $AG_{n}$ A G n . The simulation result shows that the matroidal connectivity significantly improves these known fault-tolerant capability of alternating group graphs. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

13. Scheduling Parallel Real-Time Tasks on Virtual Processors.

Author: Jiang, Xu, Liang, Haochun, Guan, Nan, Tang, Yue, Qiao, Lei, and Wang, Yi
Subjects: *PARALLEL programming, *SCHEDULING, *MULTIPROCESSORS
Abstract: In many popular parallel programming models, e.g., OpenMP (OpenMP, 2013), applications are usually dispatched into several dedicated scheduling entities (named ”threads” in common) for which the processor time of physical platform is provided through the OS schedulers. This behavior requires for a hierarchical scheduling framework, considering each thread as a virtual processor (VP). Moreover, hierarchical scheduling allow separate applications to execute together on a common hardware platform, with each application having the “illusion” of executing on a dedicated component. However, the problem for scheduling parallel real-time tasks on virtual multiprocessor platform has not been addressed yet. An analogous approach to virtual scheduling for parallel real-time tasks is federeted scheudling, where each task exclusively executes on a set of dedicated physical processors. However, federated scheduling suffers significant resource wasting. In this article, we study the scheduling of real-time parallel task on virtual multiprocessors. As a physical processor is shared by virtual processors, tasks effectively share processors with each other. We conduct comprehensive performance evaluation to compare our proposed approach with existing methods of different types. Experiment results show that our approach consistently outperforms existing methods to a considerable extent under a wide range of parameter settings. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

14. Resilient Control for Multiagent Systems With a Sampled-Data Model Against DoS Attacks.

Author: Fang, Fang, Li, Jiayu, Liu, Yajuan, and Park, Ju H.
Abstract: To reduce the computational burden and resist the denial-of-service (DoS) attacks, a resilient distributed sampled-data control scheme is proposed for multiagent systems. The agent states are sampled periodically by the sensors. DoS attacks disrupt the data communication from transmitters to controllers randomly or periodically with a limited duration time. Information on DoS attacks can be obtained by introducing novel logic processors embedded in corresponding controllers. Next, the problem of resilient control can be converted into one concerned with the upper and lower bound of the sampling interval of an aperiodic sampled-data control system. Some sufficient criteria for developing resilient distributed controllers are derived using the novel looped Lyapunov functional approach and the free-matrix-based inequality method. Finally, two illustrative examples, unmanned aerial vehicles and the two-mass-spring systems, are provided to demonstrate the efficiency of the proposed resilient distributed sampled-data control protocols against the DoS attacks. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

15. A Parallel Boundary Element Method for the Electromagnetic Analysis of Large Structures With Lossy Conductors.

Author: Marek, Damian, Sharma, Shashwat, and Triverio, Piero
Subjects: *BOUNDARY element methods, *CONDUCTORS (Musicians), *SKIN effect, *COMPUTATIONAL electromagnetics, *ELECTROMAGNETIC coupling
Abstract: In this article, we propose an efficient parallelization strategy for the boundary element method (BEM) solvers that perform the electromagnetic analysis of structures with lossy conductors. The proposed solver is accelerated with the adaptive integral method, can model both homogeneous and multilayered background media, and supports excitation via lumped ports or an incident field. Unlike existing parallel BEM solvers, we use a formulation that rigorously models the skin effect, which results in two coupled computational workloads. The external-problem workload models electromagnetic coupling between conductive objects, while the internal-problem workload describes field distributions within them. We propose a parallelization strategy that distributes these two workloads evenly over thousands of processing cores. The external-problem workload is balanced in the same manner as existing parallel solvers that employ approximate models for conductive objects. However, we assert that the internal-problem workload should be balanced by algorithms from the scheduling theory. The parallel scalability of the proposed solver is tested on three different structures found in both integrated circuits and metasurfaces. The proposed parallelization strategy runs efficiently on distributed-memory computers with thousands of CPU cores and outperforms competing strategies derived from existing methods. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. Stereo: Assignment and Scheduling in MPSoC Under Process Variation by Combining Stochastic and Decomposition Approaches.

Author: Khodabandeloo, Behnam, Khonsari, Ahmad, Behnam, Payman, Majidi, Alireza, and Hajiesmaili, Mohammad Hassan
Subjects: *STEREO vision (Computer science), *MIXED integer linear programming, *SCHEDULING, *ELECTRONIC design automation
Abstract: Aggressive scaling in integrated circuits creates new challenges such as an increase in power density, temperature, and especially process variation in designing Multiprocessor Systems-on-Chip (MPSoC). While most of the previous works attempt to mitigate the process variation effects at the system level, the eventual design still suffers from the variability of frequency and leakage power. In this paper, we propose a method called Stereo that combines stochastic and decomposition to solve task assignment and scheduling under process variation in MPSoCs. In our previous work, we formulated a Mixed Integer Linear Programming (MILP) problem for variation-aware task assignment and scheduling to optimize energy consumption while meeting the real-time constraints. To capture the stochastic behavior of process variation, we employed a chance-constrained programming technique to turn the problem into a corresponding stochastic optimization that can be solved by typical ILP solvers. However, it had a scalability problem. To address this issue, in this work, we leverage a Logic-based Benders Decomposition (LBD) approach to improve the running time for finding an optimal solution of assignments and schedulings under process variation phenomenon). We carried out extensive experiments using Embedded System Synthesis Benchmarks Suite (E3S). The experimental results of the Stereo method evince considerable improvements compared to the baseline method in terms of performance-yield and run-time. The Stereo-based MILP method ameliorates performance-yield up to 2× and run-time by 532×. Moreover, for manifold applications, the Stereo-based LBD method archives 3.47×-91.49× run-time improvement compared to the Stereo-based MILP approach and is capable of assigning and scheduling of more than 50 tasks on 9 processors. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

17. Towards a Tractable Exact Test for Global Multiprocessor Fixed Priority Scheduling.

Author: Burmyakov, Artem, Bini, Enrico, and Lee, Chang-Gun
Subjects: *MULTIPROCESSORS, *SCHEDULING, *NUMBER systems, *SCIENTIFIC community, *NP-hard problems
Abstract: Scheduling algorithms are called “global” if they can migrate tasks between cores. Global scheduling algorithms are the de-facto standard practice for general purpose Operating Systems, to balance the workload between cores. However, the exact schedulability analysis of real-time applications for these algorithms is proven to be weakly NP-hard. Despite such a hardness, the research community keeps investigating the methods for an exact schedulability analysis for its relevance and to tightly estimate the execution requirements of real-time systems. Due to the NP-hardness, the available exact tests are very time and memory demanding even for sets of a few tasks. On another hand, the available sufficient tests are very pessimistic, despite consuming less resources. Motivated by these observations, we propose an exact schedulability test for constrained-deadline sporadic tasks under global multiprocessor fixed-priority scheduling scheduler, which is significantly faster and consumes less memory, compared to any other available exact test. To derive a faster test, we exploit the idea of a state-space pruning, aiming at reducing the number of feasible system states to be examined by the test. The resulted test is multiple orders of magnitude faster with respect to other state-of-the-art exact tests. Our C++ implementation is publicly available. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

18. Quantifying Information Leakage for Security Verification of Compiler Optimizations.

Author: Panigrahi, Priyanka, Paul, Abhik, and Karfa, Chandan
Subjects: *COMPILERS (Computer programs), *INFORMATION technology security, *LEAKS (Disclosure of information), *SECURITY management, *LEAKAGE
Abstract: Compiler optimizations can be functionally correct but not secure. In this work, we attempt to quantify the information leakage in a program for the security verification of compiler optimizations. Our work has the following contributions. We demonstrate that static taint analysis is applicable for security verification of compile optimizations. We develop a completely automated approach for quantifying the information leak in a program in the context of compiler optimizations. Our method avoids many false-positives scenarios due to implicit flow. It can handle leaks in a loop and propagates leaks over paths using the leak propagation vector. With our quantification parameters, we verify the relative security of source and transformed programs considering the optimizations phase of a compiler as a black box. Our experimental evaluations on benchmarks for various compiler optimizations in SPARK show that the SPARK compiler is actually leaky. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

19. Toward Register Spilling Security Using LLVM and ARM Pointer Authentication.

Author: Fanti, Andrea, Chinea Perez, Carlos, Denis-Courmont, Remi, Roascio, Gianluca, and Ekberg, Jan-Erik
Subjects: *REDUCED instruction set computers, *MEDICAL registries, *DESIGN protection, *COMPILERS (Computer programs)
Abstract: Modern reduced instruction set computer processors are based on a load/store architecture, where all computations are performed on register operands. Compilers therefore allocate registers based on demand, and when occupancy is at maximum, register contents are spilled onto the stack and then retrieved later as data is needed. This phenomenon has security implications that cannot be ignored, as data on the stack is subject to well-known memory corruption attacks. Moreover, works presented so far are mainly targeting protection of pointers to code (e.g., return addresses), but are ineffective for protecting other context data in the stack. This article presents a security solution for spilled registers, generalizing the use of ARM pointer authentication (PA) for this purpose. The protection is enforced by the LLVM compiler via additional compiler passes and modifications. The solution provides guarantees for both integrity and confidentiality protection, and also addressing reuse attack problems associated with PA usage. Experimental data collected demonstrates the effectiveness of the solution against corruption and eavesdropping. We test our solution using SPEC CPU 2017, which confirms the functional viability of our solution. Additionally, we expose real-world performance overhead metrics of our protection design on a ARM-PA-enabled processor. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

20. Horae: A Hybrid I/O Request Scheduling Technique for Near-Data Processing-Based SSD.

Author: Li, Jiali, Chen, Xianzhang, Liu, Duo, Li, Lin, Wang, Jiapin, Zeng, Zhaoyang, Tan, Yujuan, and Qiao, Lei
Subjects: *SOLID state drives, *DATABASES, *RECOMMENDER systems, *SCHEDULING, *ELECTRONIC data processing, *RANDOM access memory
Abstract: Near-data processing (NDP) architecture is promised to break the bottleneck of data movement in many scenarios (e.g., databases and recommendation systems), which limits the efficiency of data processing. Different from traditional SSD, NDP-based SSD not only needs to handle normal I/Os (e.g., read and write), but also needs to handle NDP requests that contain data processing operations. NDP and normal I/O requests share some function units of NDP-based SSD, such as flash chips and embedded processors. However, existing works ignore the resource competition between normal I/Os and NDP requests, which drastically degrades the performance. In this article, we propose a novel scheduling technique called Horae, which can efficiently schedule hybrid NDP-normal I/O requests in NDP-based SSD to improve performance. Horae exploits the critical paths on critical resources to maximize the parallelism of multiple stages of requests. The experimental results on typical workloads show that Horae can significantly improve the performance of hybrid NDP-normal I/O requests over the state-of-the-art scheduling algorithms of NDP-based SSDs. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

21. Explainable Machine Learning for Intrusion Detection via Hardware Performance Counters.

Author: Kuruvila, Abraham Peedikayil, Meng, Xingyu, Kundu, Shamik, Pandey, Gaurav, and Basu, Kanad
Subjects: *MACHINE learning, *COMPUTER architecture, *INTRUSION detection systems (Computer security), *SYSTEM failures, *ANTIVIRUS software, *COMPUTER systems
Abstract: The exponential proliferation of Malware over the past decade has threatened system security across a plethora of Internet of Things (IoT) devices. Furthermore, the improvements in computer architectures to include speculative branching and out-of-order executions have engendered new opportunities for adversaries to carry out microarchitectural attacks in these devices. Both Malware and microarchitectural attacks are imperative threats to computing systems, as their behaviors range from stealing sensitive data to total system failure. With the cat-and-mouse game between Anti-Virus Software (AVS) and attackers, the frequent bolstering of AVS induces large computational overhead. Consequently, hardware performance counter (HPC)-based detection strategies augmented with machine learning (ML) classifiers have gained popularity as a low overhead solution in identifying these malicious threats. However, ML models are operated as black boxes, which results in decisions that are not human understandable. Clarity of the models’ results facilitates the development of more robust systems. Existing explainable frameworks are only capable of determining each feature’s impact on a prediction which does not provide meaningful interpretable outcomes for HPC-based intrusion detection. In this article, we address this issue by proposing an explainable HPC-based double regression (HPCDR) ML framework. Our proposed technique provides relevant transparency through isolation of the most malevolent transient window of an application, thereby allowing a user to efficiently locate the pernicious instructions within the program. We evaluated HPCDR on five microarchitectural attacks and two Malware. HPCDR was successfully able to identify the most malicious function manifested in each intrusive application. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

22. Schedulability Analysis for Coscheduling Real-Time Tasks on Multiprocessors.

Author: Dong, Zheng and Liu, Cong
Subjects: *MULTIPROCESSORS, *HETEROGENEOUS computing, *PARALLEL processing, *TASK analysis, *INTEGRATED circuits
Abstract: The real-time coscheduling problem, where tasks may have multiple phases executing on different types of processors, is known to be hard. The (already hard) self-suspending task scheduling simplifies the coscheduling problem by assuming that the latency a task may experience on the other type of processors is naturally bounded, which is unfortunately not true in practice. In this article, we present a novel analysis technique, namely, the vertical view analysis, for analyzing the schedulability of coscheduling sporadic tasks under global earliest-deadline-first (GEDF) on a heterogeneous multiprocessor consisting of two types of processors. We derive both hard (no deadline miss) and soft (bounded response times) real-time utilization-based tests. To the best of our knowledge, these results are the first-of-its-kind for the coscheduling problem and may allow real-time schedulability analysis to be carried out on more practical scenarios under heterogeneous computing. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

23. Simplified Approach for Acquisition of Submodule Capacitor Voltages of the Modular Multilevel Converter Using Low Sampling Rate Sensing and Estimation.

Author: Purkayastha, Bishwajyoti and Bhattacharya, Tanmoy
Subjects: *DATA transmission systems, *VOLTAGE, *CAPACITORS, *MATHEMATICAL analysis
Abstract: The modular multilevel converter (MMC), used in high-voltage dc applications, consists of hundreds of submodules in each arm. The operation and control of the MMC require sensing of the capacitor voltage of each submodule. The transmission of the sensed data from hundreds of submodules over individual channels leads to wiring complexity. This work suggests that time-multiplexing can be used to transmit the sensed data of multiple submodules across a common serial interface bus. Time multiplexing, however, decreases the sampling rate. Since the submodule capacitor voltage has both ac and dc components, an adequate sampling rate must be maintained to avoid the aliasing of ac components. In this article, an estimation scheme is investigated for reconstructing the ac components of the submodule capacitor voltages within the processing platform, leaving just the dc component to be transmitted over the common bus. The mathematical analysis of the data transmission system and proposed estimation scheme is performed, and the influence on the closed-loop performance is studied. The proposed scheme is experimentally verified in the laboratory. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

24. Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning.

Author: Meng, Jintao, Zhuang, Chen, Chen, Peng, Wahib, Mohamed, Schmidt, Bertil, Wang, Xiao, Lan, Haidong, Wu, Dou, Deng, Minwen, Wei, Yanjie, and Feng, Shengzhong
Subjects: *CONVOLUTIONAL neural networks, *DEEP learning, *CONFIGURATION space, *ARTIFICIAL intelligence, *ONLINE algorithms
Abstract: We present FastConv, a template-based code auto-generation open-source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes. FastConv is based on the Winograd algorithm, which is reportedly the highest performing algorithm for the time-consuming layers of convolutional neural networks. ARM CPUs cover a wide range of designs and specifications, from embedded devices to HPC-grade CPUs. The leads to the dilemma of how to consistently optimize Winograd-based convolution solvers for convolution layers of different shapes. FastConv addresses this problem by using templates to auto-generate multiple shapes of tuned kernels variants suitable for skinny tall matrices. As a performance portable library, FastConv transparently searches for the best combination of kernel shapes, cache tiles, scheduling of loop orders, packing strategies, access patterns, and online/offline computations. Auto-tuning is used to search the parameter configuration space for the best performance for a given target architecture and problem size. Results show 1.02x to 1.40x, 1.14x to 2.17x, and 1.22x and 2.48x speedup is achieved over NNPACK, ARM NN, and FeatherCNN on Kunpeng 920. Furthermore, performance portability experiments with various convolution shapes show that FastConv achieves 1.2x to 1.7x speedup and 2x to 22x speedup over NNPACK and ARM NN inference engine using Winograd on Kunpeng 920. CPU performance portability evaluation on VGG–16 show an average speedup over NNPACK of 1.42x, 1.21x, 1.26x, 1.37x, 2.26x, and 11.02x on Kunpeng 920, Snapdragon 835, 855, 888, Apple M1, and AWS Graviton2, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

25. Scalable Unsupervised ML: Latency Hiding in Distributed Sparse Tensor Decomposition.

Author: Abubaker, Nabil, Karsavuran, M. Ozan, and Aykanat, Cevdet
Subjects: *BIN packing problem, *MATRIX decomposition, *PARALLEL algorithms, *SPARSE matrices, *SCALABILITY
Abstract: Latency overhead in distributed-memory parallel CPD-ALS scales with the number of processors, limiting the scalability of computing CPD of large irregularly sparse tensors. This overhead comes in the form of sparse reduce and expand operations performed on factor-matrix rows via point-to-point messages. We propose to hide the latency overhead through embedding all of the point-to-point messages incurred by the sparse reduce and expand into dense collective operations which already exist in the CPD-ALS. The conventional parallel CPD-ALS algorithm is not amenable for embedding so we propose a computation/communication rearrangement to enable the embedding. We embed the sparse expand and reduce into a hypercube-based ALL-REDUCE operation to limit the latency overhead to $O(\log _2 K)$ O (log 2 K) for a $K$ K -processor system. The embedding comes with the cost of increased bandwidth overhead due to the multi-hop routing of factor-matrix rows during the embedded-ALL-REDUCE. We propose an embedding scheme that takes advantage of the expand/reduce properties to reduce this overhead. Furthermore, we propose a novel recursive bipartitioning framework that enables simultaneous hypergraph partitioning and subhypergraph-to-subhypercube mapping to achieve subtensor-to-processor assignment with the objective of reducing the bandwidth overhead during the embedded-ALL-REDUCE. We also propose a bin-packing-based algorithm for factor-matrix row to processor assignment aiming at reducing processors’ maximum send and receive volumes during the embedded-ALL-REDUCE. Experiments on up to 4096 processors show that the proposed framework scales significantly better than the state-of-the-art point-to-point method. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

26. High Bandwidth Thermal Covert Channel in 3-D-Integrated Multicore Processors.

Author: Dhananjay, Krithika, Pavlidis, Vasilis F., Coskun, Ayse K., and Salman, Emre
Subjects: MULTICORE processors, THREE-dimensional integrated circuits, BANDWIDTHS, INTEGRATED circuits, VERTICAL integration, COMPUTER security
Abstract: Exploiting thermal coupling among the cores of a processor to secretly communicate sensitive information is a serious threat in mobile, desktop, and server platforms. Existing works on temperature-based covert communication typically rely on controlling the execution of high-power CPU stressing programs to transmit confidential information. Such covert channels with high-power programs are typically easier to detect as they cause significant rise in temperature. In this work, we demonstrate that by leveraging vertical integration, it is sufficient to execute typical SPLASH-2 benchmark applications to transfer 200 bits per second (bps) of secret data via thermal covert channels. The strong vertical thermal coupling among the cores of a 3-D multicore processor increases the rates of covert communication by $3.4\times $ compared to covert communication in conventional 2-D integrated circuits (ICs). Furthermore, we show that the bandwidth of this thermal communication in 3-D ICs is more resilient to thermal interference caused by applications running in other cores. This reduced interference significantly increases the danger posed by such attacks. We also investigate the effect of reducing intertier overlap between colluded cores and show that the covert channel bandwidth is reduced by up to 62% with no overlap. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

27. ESSA: Design of a Programmable Efficient Sparse Spiking Neural Network Accelerator.

Author: Kuang, Yisong, Cui, Xiaoxin, Wang, Zilin, Zou, Chenglong, Zhong, Yi, Liu, Kefei, Dai, Zhenhui, Yu, Dunshan, Wang, Yuan, and Huang, Ru
Subjects: ARTIFICIAL neural networks, APPLICATION-specific integrated circuits, GATE array circuits, FIELD programmable gate arrays, DATA compression
Abstract: Spiking neural networks (SNNs) have been witnessing the developing trends to reduce the model size and improve the hardware efficiency for area- and energy-based applications, which are processed by model pruning and data compressions. However, it is challenging to exploit the unstructured sparsity of SNNs for the dense neuromorphic processors. In this article, we present an efficient sparse SNN accelerator (ESSA), which leverages both the temporal sparsity of spike events and the spatial sparsity of weights in SNN inference. It provides both the compressed weights for sparse SNNs and the uncompressed weights for compact SNNs. The self-adaptive spike compression is proposed for sparse spike scenarios, leading to the improvement of throughput by $3.2\times $. ESSA executes a flexible fan-in–fan-out tradeoff by using combinable dendrites, which overcomes the fan-in limitation in neuromorphic systems. Furthermore, a low-latency intrachip spike multicast method is adopted to reduce the resource overhead. Implemented on the Xilinx Kintex Ultrascale field-programmable gate array (FPGA), ESSA achieves an equivalent performance of 253.1 GSOP/s and an energy efficiency of 32.1 GSOP/W for 75% weight sparsity at 140 MHz. The implementation of a four-layer fully connected SNN is expected to perform $2.6~\mu \text{s}$ per time step and the energy consumption is $14.6~\mu \text{J}$. Our results demonstrate that ESSA outperforms several state-of-the-art application-specific integrated circuit (ASIC) or FPGA neuromorphic processors. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

28. Multitimescale Mitigation for Performance Variability Improvement in Time-Critical Systems.

Author: Lin, Ji-Yung, Weckx, Pieter, Mishra, Subrat, Spessot, Alessio, and Catthoor, Francky
Subjects: TIME management
Abstract: Ensuring a timing guarantee is crucial for time-critical applications. However, this task becomes more challenging with the increasing performance variability generated by complicated modern hardware and software. A widespread solution to the problem is real-time scheduling, which depends on worst-case execution time (WCET) and dynamic voltage frequency scaling (DVFS). Although these techniques provide the necessary guarantees, they also exhibit important limitations from the long switching time of DVFS and the overly pessimistic execution time model of WCET. In this work, a multitimescale mitigation methodology is proposed to improve the way of tackling performance variability in both timing guarantee and energy saving. By using both the DVFS and heterogeneous datapath (HDP) knobs, this methodology can push the timescale of mitigation down to the submillisecond level. Moreover, this methodology can calculate a tight upper bound of execution time at run-time using dynamic scenarios (DSs). Simulation shows that the proposed methodology can ensure zero deadline misses with a smaller safety time margin than the method using only DVFS and WCET. This advantage can translate into an energy reduction by half compared to the conventional WCET-based method with a single DVFS knob. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

29. Quantum Computing and High-Performance Computing: Compilation Stack Similarities.

Author: Alarcon, Sonia Lopez, Elster, Anne C., Elster, Anne, and Lopez Alarcon, Sonia
Subjects: QUANTUM computing, QUANTUM computers, GATE array circuits, PROGRAMMING languages, HIGH performance computing
Abstract: There is a great deal of focus on how quantum computing as an accelerator differs from other traditional high-performance computing (HPC) resources, including accelerators like GPUs and field-programmable gate arrays. In classical computing, how to design the interfaces that connect the different layers of the software stack, from the applications and high-level programming language description, through compilers and schedulers, and down to the hardware and gate level, has been critical. Likewise, quantum computing's interfaces enable access to quantum technology as a viable accelerator. From the ideation of the quantum application to the manipulation of the quantum chip, each interface has its challenges. In this article, we discuss the structure of this set of quantum interfaces, their many similarities to the traditional HPC compilation stack, and how these interfaces impact the potential of quantum computers as HPC accelerators. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

30. Trojan Resilient Computing in COTS Processors Under Zero Trust.

Author: Hasan, Mahmudul, Cruz, Jonathan, Chakraborty, Prabuddha, Bhunia, Swarup, and Hoque, Tamzidul
Subjects: TRUST, SYSTEMS design, COMPILERS (Computer programs), SUPPLY chains
Abstract: The commercial off-the-shelf (COTS) component-based ecosystem provides an attractive system design paradigm due to the drastic reduction in development time and cost compared to custom solutions. However, it brings in a growing concern of trustworthiness arising from the possibility of malicious embedded logic or hardware Trojans in COTS components. Existing hardware Trojan countermeasures are typically not applicable to COTS hardware due to the need for zero trust consideration for all supply chain entities, absence of golden models, and lack of observability of internal signals within the component. In this work, we propose a novel approach for runtime Trojan detection and resilience in untrusted COTS processors through judicious modifications in the software. The proposed approach does not rely on any hardware redundancy or architectural modification and hence seamlessly integrates with the COTS-based system design process. Trojan resilience is achieved through the execution of multiple functionally equivalent software variants. We have developed and implemented a solution for compiler-based automatic generation of program variants, metric-guided selection of variants, and their integration in a single executable. To evaluate the proposed approach, we first analyzed the effectiveness of program variants in avoiding the activation of a random pool of Trojans. Then, by implementing several Trojans in an OpenRISC 1000 processor, we analyzed the detectability and resilience under Trojan activation in both single and multiple variants. We also present delay and code size overhead for the automatically generated variants for several programs and discuss future research directions. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

31. A Comparison of Performance on WebGPU and WebGL in the Godot Game Engine

Author: Fransson, Emil, Hermansson, Jonatan, Hu, Yan, Fransson, Emil, Hermansson, Jonatan, and Hu, Yan
Abstract: WebGL has been the standard API for rendering graphics on the web over the years. A new technology, WebGPU, has been set to release in 2023 and utilizes many of the novel rendering approaches and features common for the native modern graphics APIs, such as Vulkan. Currently, very limited research exists regarding WebGPU's rasterization capabilities. In particular, no research exists about its capabilities when used as a rendering backend in game engines. This paper aims to investigate performance differences between WebGL and WebGPU. It is done in the context of the game engine Godot, and the measured performance is that of the CPU and GPU frame time. The results show that WebGPU performs better than WebGL when used as a rendering backend in Godot, for both the games tests and the synthetic tests. The comparisons clearly show that WebGPU performs faster in mean CPU and GPU frame time. © 2024 IEEE.
Published: 2024
Full Text: View/download PDF

32. Infrastructure for Exploring SIMT Architecture in General-Purpose Processors

Author: Ferdman, M., Mai, H., Canpolat, O., Ratnasegar, N., Scott, D., Karman, N., Wei, K., Ferdman, M., Mai, H., Canpolat, O., Ratnasegar, N., Scott, D., Karman, N., and Wei, K.
Abstract: AMD; IEEE Computer Society; IEEE Computer Society Technical Committee on Computer Architecture; IEEE Computer Society Technical Community on Microprogramming and Microarchitecture (TCuARCH); Intel; NSF, 2024 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2024 -- 5 May 2024 through 7 May 2024 - Indianapolis -- 201090, Single-Instruction Multiple-Thread (SIMT) computing has enabled a revolution in graphics, high-performance computing, and artificial intelligence. However, despite its benefits in these domains, SIMT processing has been relegated to accelerators rather than becoming a feature of general-purpose computing. Although a number of recent works have explored the potential benefits of 'G PSIMT,' current research infrastructures to study high performance general-purpose CPUs provide no support for the SIMT architecture and its programming model. This work presents our initial efforts toward developing a full-system GPSIMT research infrastructure. We first describe how we extend the QEMU emulator with SIMT features, enabling ISA pathfinding for GPSIMT hardware and providing a platform for the rapid development of system software for GPSIMT. We then present our approach to leveraging the Chip yard hardware gen-eration framework to develop a full-system GPSIMT exploration platform on an FPGA by extending the RISC- V Rocket Core. © 2024 IEEE.
Published: 2024

33. A Fully-Focused SAR Omega-K Closed-Form Algorithm for the Sentinel-6 Radar Altimeter: Methodology and Applications

Author: Hernandez-Burgos, Sergi (author), Gibert, Ferran (author), Broquetas, Antoni (author), Kleinherenbrink, M. (author), De la Cruz, Adrian Flores (author), Gomez-Olive, Adria (author), Garcia-Mondejar, Albert (author), Aparici, Monica Roca i. (author), Hernandez-Burgos, Sergi (author), Gibert, Ferran (author), Broquetas, Antoni (author), Kleinherenbrink, M. (author), De la Cruz, Adrian Flores (author), Gomez-Olive, Adria (author), Garcia-Mondejar, Albert (author), and Aparici, Monica Roca i. (author)
Abstract: The 2-D frequency-based omega-K method is known to be a suitable algorithm for fully focused SAR (FF-SAR) radar altimeter processors, as its computational efficiency is much higher than equivalent time-based alternatives without much performance degradation. In this article, we provide a closed-form description of a 2-D frequency-domain omega-K algorithm specific for instruments such as Poseidon-4 onboard Sentinel-6. The processor is validated with real data from point targets and over the open ocean. Applications such as ocean swell retrieval and lead detection are demonstrated, showing the potentiality of the processor for future operational global-scale products., Mathematical Geodesy and Positioning
Published: 2024
Full Text: View/download PDF

34. An $O(\log _3N)$ Algorithm for Reliability Assessment of 3-Ary $n$ -Cubes Based on $h$ -Extra Edge Connectivity.

Author: Xu, Liqiong, Zhou, Shuming, and Hsieh, Sun-Yuan
Subjects: *ALGORITHMS, *MULTIPROCESSORS, *FAULT tolerance (Engineering)
Abstract: Reliability evaluation of multiprocessor systems is of great significance to the design and maintenance of these systems. As two generalizations of traditional edge connectivity, extra edge connectivity and component edge connectivity are two important parameters to evaluate the fault-tolerant capability of multiprocessor systems. Fast identifying the extra edge connectivity and the component edge connectivity of high order remains a scientific problem for many useful multiprocessor systems. In this article, we determine the $h$ -extra edge connectivity of the 3-ary $n$ -cube $Q_n^3$ for $h\in [1, \frac{3^n-1}{2}]$. Specifically, we divide the interval $[1, \frac{3^n-1}{2}]$ into some subintervals and characterize the monotonicity of $\lambda _h(Q_n^3)$ in these subintervals and then deduce a recursive closed formula of $\lambda _h(Q_n^3)$. Based on this formula, an efficient algorithm with complexity $O(\log _3\,N)$ is designed to determine the exact values of $h$ -extra edge connectivity of the 3-ary $n$ -cube $Q_n^3$ for $h\in [1, \frac{3^n-1}{2}]$ completely. Moreover, we also determine the $g$ -component edge connectivity of the 3-ary $n$ -cube $Q_n^3(n\geq 6$) for $1\leq g\leq 3^{\lceil \frac{n}{2}\rceil }$. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

35. Strong Reliability of Star Graphs Interconnection Networks.

Author: Lin, Limei, Huang, Yanze, Hsieh, Sun-Yuan, and Xu, Li
Subjects: *DISTRIBUTED computing, *FAULT tolerance (Engineering), *PARALLEL programming, *HYPERCUBES
Abstract: For interconnection network losing processors, it is considerable to calculate the number of vertices in the maximal component in the surviving network. Moreover, the component connectivity is a significant indicator for reliability of a network in the presence of failing processors. In this article, we first prove that when a set $M$ of at most $3n-7$ processors is deleted from an $n$ -star graph, the surviving graph has a large component of size greater or equal to $n!-|M|-3$. We then prove that when a set $M$ of at most $4n-9$ processors is deleted from an $n$ -star graph, the surviving graph has a large component of size greater or equal to $n!-|M|-5$. Finally, we also calculate the $r$ -component connectivity of the $n$ -star graph for $2\leq r\leq 5$. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

36. Evaluation of Cache Attacks on Arm Processors and Secure Caches.

Author: Deng, Shuwen, Matyunin, Nikolay, Xiong, Wenjie, Katzenbeisser, Stefan, and Szefer, Jakub
Subjects: *ARM microprocessors, *CACHE memory, *RADIO frequency
Abstract: Timing-based side and covert channels in processor caches continue to be a threat to modern computers. This work shows for the first time, a systematic, large-scale analysis of Arm devices and the detailed results of attacks the processors are vulnerable to. Compared to x86, Arm uses different architectures, microarchitectural implementations, cache replacement policies, etc., which affects how attacks can be launched, and how security testing for the vulnerabilities should be done. To evaluate security, this paper presents security benchmarks specifically developed for testing Arm processors and their caches. The benchmarks are evaluated with sensitivity tests, which examine how sensitive the benchmarks are to having a correct configuration in the testing phase. Further, to evaluate a large number of devices, this work leverages a novel approach of using a cloud-based Arm device testbed for architectural and security research on timing channels and runs the benchmarks on 34 different physical devices. In parallel, there has been much interest in secure caches to defend the various attacks. Consequently, this paper also investigates secure cache architectures using proposed benchmarks. Especially, this paper implements and evaluates secure PL and RF caches, showing the security of PL and RF caches, but also uncovers new weaknesses. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

37. Bypassing Multicore Memory Bugs With Coarse-Grained Reconfigurable Logic.

Author: Lee, Doowon and Bertacco, Valeria
Subjects: *CACHE memory, *FINITE state machines, *ARM microprocessors, *MEMORY, *SYSTEMS design, *LOGIC
Abstract: Multicore systems deploy sophisticated memory hierarchies to improve memory operations’ throughput and latency by exploiting multiple levels of cache hierarchy and several complex memory-access instructions. As a result, the functional verification of the memory subsystem is one of the most challenging tasks in the overall system design effort, leading to many bugs in the released product. In this work, we propose MemPatch, a novel reconfigurable hardware solution to bypass such escaped bugs. To design MemPatch, we first analyzed publicly available errata documents and classified memory-related bugs by root cause and symptoms. We then leveraged that learning to design a specialized, reconfigurable detection fabric, implementing finite state machines that can model the bug-triggering events at the microarchitectural level. Finally, we complemented this detection logic with hardware offering multiple bug-bypassing options. Our evaluation of MemPatch mapped a multicore RISC-V out-of-order processor, augmented with our logic, to a Xilinx ZCU102 FPGA board. When configured to detect up to 32 distinct bugs, MemPatch entails 7.6% area and 7.3% power overheads. An estimate on a commercial ARM Cortex-A57 processor target indicates that the area overhead would be much lower, 1.0%. The performance impact was found to be no more than 1% in all cases. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

38. BlueVisor: Time-Predictable Hardware Hypervisor for Many-Core Embedded Systems.

Author: Jiang, Zhe, Wei, Ran, Dong, Pan, Zhuang, Yan, Audsley, Neil C., and Gray, Ian
Subjects: *COMPUTING platforms, *HYPERVISOR (Computer software), *COMPUTER systems, *VIRTUAL machine systems, *HARDWARE
Abstract: Whilst virtualization was once restricted to large-scale computing platforms, it is now widely deployed on modern embedded computing systems. This has been driven by the availability of hardware support which alleviates the performance penalties incurred by traditional software virtualization technologies. In the domain of hard real-time systems, specialist virtualization technology which respects restricted timing requirements and constraints can be deployed to allow sharing of processors. However, other aspects of the embedded system (I/O, memory, and communication) are harder to analyze. In this paper, we argue that in order to support real-time virtualization on modern embedded systems, additional system-wide hardware support is required. We propose BlueVisor, an analyzable and scalable hardware hypervisor for many-core embedded systems, which enables time-predictable CPU, memory, and I/O virtualization, as well as supporting a fast interrupt handler, and inter-VM communication. We describe the design and implementation of the real-time hypervisor, and demonstrate how a BlueVisor-based virtualization system can be leveraged to meet real-time requirements with significant improvement in system performance, and with a low-performance cost when executing different types of software. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

39. Leaking Secrets Through Modern Branch Predictors in the Speculative World.

Author: Chowdhuryy, Md Hafizul Islam and Yao, Fan
Subjects: *COMPUTER systems, *SPECULATION, *TRANSIENT analysis, *SQUASHES
Abstract: Transient execution attacks that exploit speculation have raised significant concerns in computer systems. Typically, branch predictors are leveraged to trigger mis-speculation in transient execution attacks. In this work, we demonstrate a new class of speculation-based attacks that targets the branch prediction unit (BPU). We find that speculative resolution of conditional branches (i.e., in nested speculation) alter the states of pattern history table (PHT) in modern processors, which are not restored after the corresponding branches are later squashed. Such characteristic allows attackers to exploit the BPU as the secret transmitting medium in transient execution attacks. To evaluate the discovered vulnerability, we build a novel attack framework, BranchSpectre, that enables exfiltration of unintended secrets through observing speculative PHT updates (in the form of covert and side channels). We further investigate the PHT collision mechanism in the history-based predictor and the branch prediction mode transitions in Intel processors. Built upon such knowledge, we implement an ultra-high speed covert channel (BranchSpectre-cc) as well as two side channels (i.e., BranchSpectre-v1 and BranchSpectre-v2) that merely rely on BPU for mis-speculation trigger and secret inference in the speculative domain. Notably, BranchSpectre side channels can take advantage of much simpler code patterns than those used in Spectre attacks. We present an extensive BranchSpectre code gadget analysis on a set of popular real-world application code bases followed by a demonstration of side channel attack on OpenSSL. The evaluation results show substantially wider existence and higher exploitability of BranchSpectre code patterns in real-world software. Finally, we discuss several secure branch prediction mechanisms that can mitigate transient execution attacks exploiting modern branch predictors. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

40. Blocks: Challenging SIMDs and VLIWs With a Reconfigurable Architecture.

Author: Wijtvliet, M., Kumar, A., and Corporaal, H.
Subjects: *FIELD programmable gate arrays, *ENERGY consumption
Abstract: Demand for coarse grain reconfigurable architectures (CGRAs) has significantly increased in recent years as architectures need to be both energy efficient and flexible. However, most CGRAs are optimized for performance instead of energy efficiency. In this work, a novel paradigm for reconfigurable architectures, Blocks, is presented. Blocks uses two separate circuit-switched networks, one for control and one for the data path. This enables the runtime construction of energy-efficient application-specific VLIW-SIMD processors on a reconfigurable fabric. Its energy efficiency is demonstrated by comparing Blocks to four reference architectures, a VLIW, an SIMD, a commercial low-power microprocessor, and a traditional CGRA. All comparisons are based on commercial low-power 40-nm CMOS layout, including memories. Results show that Blocks can achieve a mean total energy reduction of $2.05\times $ , $1.84\times $ , $8.01\times $ , and $1.22\times $ over a VLIW, an SIMD, an energy-efficient microprocessor and a traditional CGRA, respectively. At the same time, Blocks delivers equal or higher performance per area due to its ability to adapt to applications by reconfiguration. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

41. Locking Protocols for Parallel Real-Time Tasks With Semaphores Under Federated Scheduling.

Author: Wang, Yang, Jiang, Xu, Guan, Nan, Tang, Yue, and Liu, Weichen
Subjects: *TASKS, *SCHEDULING
Abstract: Suspension-based locks are widely used in real-time systems to coordinate simultaneous accesses to exclusive shared resources. Although suspension-based locks have been well studied for sequential real-time tasks, little work has been done on this topic for parallel real-time tasks. This article for the first time studies the problem of how to extend existing sequential-task locking protocols and their analysis techniques to the parallel task model. More specifically, we extend two locking protocols OMLP and OMIP, which were designed for clustered scheduling of sequential real-time tasks, to federated scheduling of parallel real-time tasks. We present corresponding blocking analysis techniques, and develop path-oriented techniques to analyze and count blocking time. Schedulability tests with different efficiency and accuracy are further developed. Experiments are conducted to evaluate the performance of our proposed approaches against the state-of-the-art. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

42. Unifying Spatial Accelerator Compilation With Idiomatic and Modular Transformations.

Author: Weng, Jian, Liu, Sihao, Kupsh, Dylan, and Nowatzki, Tony
Subjects: *COMPILERS (Computer programs), *CENTRAL processing units, *ENERGY consumption, *MICROELECTROMECHANICAL systems
Abstract: Spatial accelerators provide high performance, energy efficiency, and flexibility. Recent design frameworks enable these architectures to be quickly designed and customized to a domain. However, constructing a compiler for this immense design space is challenging, first because accelerators express programs with high-level idioms that are difficult to recognize. Second, it is unpredictable whether certain transformations are beneficial or will lead to infeasible hardware mappings. Our work develops a general spatial-accelerator compiler with two key ideas. First, we propose an approach to recognize and represent useful dataflow idioms, along with a novel idiomatic memory representation. Second, we propose the principle of modular compilation, which combines hardware-aware transformation selection and an iterative approach to handle uncertainty. Our compiler achieves 2.3× speedup, and 98.7× area-normalized speedup over high-end server central processing unit (CPU). [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

43. Secure Register Allocation for Trusted Code Generation.

Author: Panigrahi, Priyanka, Sahithya, Vemuri, Karfa, Chandan, and Mishra, Prabhat
Abstract: In this letter, we investigate the inherent vulnerability of register allocation (RA) in which the variables of a source program are mapped to the registers in hardware. This letter makes three important contributions. Specifically, we show that RA is secure if there is no spilling. Next, we show that RA with spilling does not preserve the security properties of the source program. Our experimental evaluation using a wide variety of benchmarks demonstrates that RA in LLVM is not secure. Finally, we propose a secure RA in LLVM that will not introduce additional information leaks in the generated code due to spilling. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

44. Robust and Accurate Fine-Grain Power Models for Embedded Systems With No On-Chip PMU.

Author: Nikov, Kris, Martinez, Marcos, Wegener, Simon, Nunez-Yanez, Jose, Chamski, Zbigniew, Georgiou, Kyriakos, and Eder, Kerstin
Abstract: This letter a novel approach to event-based power modeling for embedded platforms that do not have a performance monitoring unit (PMU). The method involves complementing the target hardware platform, where the physical power data is measured, with another platform on which the CPU performance data, that is needed for model generation, can be collected. The methodology is used to generate accurate fine-grain power models for the Gaisler GR712RC dual-core LEON3 fault-tolerant SPARC processor with onboard power sensors and no PMU. A Kintex UltraScale field-programmable gate array (FPGA) is used as the support platform to obtain the required CPU performance data, by running a soft-core representation of the dual-core LEON3 as on the GR712RC but with a PMU implementation. Both platforms execute the same benchmark set and data collection is synchronized using per-sample timestamps so that the power sensor data from the GR712RC board can be matched to the PMU data from the FPGA. The synchronized samples are then processed by the Robust Energy and Power Predictor Selection (REPPS) software in order to generate power models. The models achieve less than 2% power estimation error when validated on an industrial use case and can follow program phases, which makes them suitable for runtime power profiling during development. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

45. Energy-minimized Scheduling of Real-time Parallel Workflows on Heterogeneous Distributed Computing Systems.

Author: Hu, Biao, Cao, Zhengcai, and Zhou, MengChu
Abstract: Today's large-scale parallel workflows are often processed on heterogeneous distributed computing platforms. From an economic perspective, computing resource providers should minimize the cost while offering high service quality. It has become well-recognized that energy consumption accounts for a large part of a computing system's total cost, and timeliness and reliability are two important service indicators. This work studies the problem of scheduling a parallel workflow that minimizes the system energy consumption under the constraints of response time and reliability. We first mathematically formulate this problem as a Non-linear Mixed Integer Programming problem. Since this problem is hard to solve directly, we present some highly-efficient heuristic solutions. Specifically, we first develop an algorithm that minimizes the schedule length while meeting reliability requirement, on top of which we propose a processor-merging algorithm and a slack time reclamation algorithm using a dynamic voltage frequency scaling (DVFS) technique to reduce energy consumption. The processor-merging algorithm tries to turn off some energy-inefficient processors such that energy consumption can be minimized. The DVFS technique is applied to scale down the processor frequency at both processor and task levels to reduce energy consumption. Experimental results on two real-life workflows and extensive synthetic parallel workflows demonstrate their effectiveness. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

46. Early Soft Error Reliability Analysis on RISC-V.

Author: Lodea, Nicolas, Nunes, Willian, Zanini, Vitor, Sartori, Marcos, Ost, Luciano, Calazans, Ney, Garibotti, Rafael, and Marcon, Cesar
Abstract: The adoption of RISC-V processors bloomed in recent years, mainly due to its open standard and free instruction set architecture. However, much remains to help software engineers deliver high-reliability and bug-free applications and systems based on RISC-V IP designs. This work proposes an early soft error reliability assessment of a RISC-V processor, extending the previously proposed SOFIA fault injection framework. Results from 850k fault injections show that choosing the compiler flag -O2 to optimize performance causes 96% more Hang failures than -O0. Software engineers must evaluate compilation parameters on a case-by-case basis to find the best balance between performance and reliability. This work helps software engineers develop fault-tolerant RISC-V-based systems and applications more efficiently. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

47. Quality-Aware Transcoding Task Allocation Under Limited Power in Live-Streaming Systems.

Author: Lee, Dayoung and Song, Minseok
Abstract: Transcoding in video live-streaming systems requires a lot of computation and, hence, a lot of power. Putting a limit on the power drawn by each of the transcoding processors in a server reduces the overall power consumption, but it also hinders the efficient allocation of transcoding tasks. We address this with a dynamic programming algorithm, together with a heuristic, which maximizes total processing capacity while limiting power consumption in a server with heterogeneous processors. A further greedy algorithm determines the bitrates, at which content is transcoded for each channel, and allocates transcoding tasks to processors, while taking video quality, popularity, and workload balance into account. The initial assumption is that all contents are transcoded to all bitrates for every channel. Then, the algorithm gradually reduces the number of versions to be produced by transcoding, while minimizing the consequent reduction in popularity-weighted video quality, as well as balancing the workload across processors. Experimental results show that our scheme improves aggregate popularity-weighted video quality under a power constraint by 3.82%–39.12%, compared to benchmark methods. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

48. Olympus: Reaching Memory-Optimality on DNN Processors.

Author: Cai, Xuyi, Wang, Ying, Tu, Kaijie, Gao, Chengsi, and Zhang, Lei
Subjects: *MATHEMATICAL optimization, *COMPUTER architecture
Abstract: In DNN processors, main memory consumes much more energy than arithmetic operations. Therefore, many memory-oriented network scheduling (MONS) techniques are introduced to exploit on-chip data reuse opportunities and reduce accesses to memory. However, to derive the theoretical lower bound of memory overhead for DNNs is still a significant challenge, which also sheds light on how to reach memory-level optimality by means of network scheduling. Prior work on MONS mainly focused on disparate optimization techniques or missed some of the data reusing opportunities in diverse network models, thus their results are likely to deviate from the true memory-optimality that can be achieved in processors. This paper introduces Olympus, which comprehensively considers the entire memory-level DNN scheduling space, formally analyzes the true memory-optimality and also how to reach the memory-optimal schedules for an arbitrary DNN running on a DNN processor. The key idea behind Olympus is to derive a true memory lower-bound regarding both the intra-layer and inter-layer reuse opportunities, which has not been simultaneously explored by prior works. Evaluation on SOTA DNN processors of different architectures shows that Olympus can guarantee the minimum off-chip memory access, and it reduces 12.3-85.6% DRAM access and saves 7.4-70.3% energy on the latest network models. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

49. A Fluid Scheduling Algorithm for DAG Tasks With Constrained or Arbitrary Deadlines.

Author: Guan, Fei, Peng, Long, and Qiao, Jiaqing
Subjects: *DEADLINES, *PRODUCTION scheduling, *SCHEDULING, *ALGORITHMS, *TASKS, *PARALLEL algorithms, *GAUSSIAN channels
Abstract: A number of scheduling algorithms have been proposed for real-time parallel tasks modeled as Directed Acyclic Graphs (DAGs). Many of them focus on scheduling DAG tasks with implicit deadlines. Fewer studies have considered DAG tasks with constrained deadlines or arbitrary deadlines. In this study, we propose a scheduling strategy based on fluid scheduling theory and we target DAG tasks with constrained or arbitrary deadlines. We prove that the proposed algorithm has a capacity augmentation bound of $\frac{1}{2}(1+\beta +\sqrt{(1+\beta)^2-\frac{4}{m}})$ 1 2 (1 + β + (1 + β) 2 - 4 m ) when scheduling multiple DAG tasks with constrained deadlines, in which $m$ m is the number of processors and $\beta$ β is the maximum ratio of task period to deadline. This value is lower than the current best result $\beta +2\sqrt{(\beta +1-\frac{1}{m})(1-\frac{1}{m})}$ β + 2 (β + 1 - 1 m) (1 - 1 m) . We also prove that a capacity augmentation bound of $\frac{1}{2}(1+\sqrt{2}+\sqrt{(1+\sqrt{2})^2-\frac{4\sqrt{2}}{m}})$ 1 2 (1 + 2 + (1 + 2) 2 - 4 2 m ) is guaranteed by our algorithm in the case of scheduling multiple DAG tasks with deadlines greater than periods. To the best of our knowledge, this is the first capacity augmentation bound that has been proven for scheduling multiple DAG tasks with deadlines greater than periods. Our experiments show that our algorithm outperforms the state of the art scheduling algorithms in the percentage of schedulable task sets. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

50. A New Many-Objective Evolutionary Algorithm Based on Generalized Pareto Dominance.

Author: Zhu, Shuwei, Xu, Lihong, Goodman, Erik D., and Lu, Zhichao
Abstract: In the past several years, it has become apparent that the effectiveness of Pareto-dominance-based multiobjective evolutionary algorithms deteriorates progressively as the number of objectives in the problem, given by $M$ , grows. This is mainly due to the poor discriminability of Pareto optimality in many-objective spaces (typically $M\geq 4$). As a consequence, research efforts have been driven in the general direction of developing solution ranking methods that do not rely on Pareto dominance (e.g., decomposition-based techniques), which can provide sufficient selection pressure. However, it is still a nontrivial issue for many existing non-Pareto-dominance-based evolutionary algorithms to deal with unknown irregular Pareto front shapes. In this article, a new many-objective evolutionary algorithm based on the generalization of Pareto optimality (GPO) is proposed, which is simple, yet effective, in addressing many-objective optimization problems. The proposed algorithm used an “($M-1$) + 1” framework of GPO dominance, ($M-1$)-GPD for short, to rank solutions in the environmental selection step, in order to promote convergence and diversity simultaneously. To be specific, we apply $M$ symmetrical cases of ($M-1$)-GPD, where each enhances the selection pressure of $M-1$ objectives by expanding the dominance area of solutions, while remaining unchanged for the one objective left out of that process. Experiments demonstrate that the proposed algorithm is very competitive with the state-of-the-art methods to which it is compared, on a variety of scalable benchmark problems. Moreover, experiments on three real-world problems have verified that the proposed algorithm can outperform the others on each of these problems. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,684 results on '"Program processors"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources