13 results on '"Weichen Liu"'
Search Results
2. CRIMP: Compact & Reliable DNN Inference on In-Memory Processing via Crossbar-Aligned Compression and Non-ideality Adaptation.
- Author
-
SHUO HUAI, HAO KONG, XIANGZHONG LUO, SHIQING LI, SUBRAMANIAM, RAVI, MAKAYA, CHRISTIAN, QIAN LIN, and WEICHEN LIU
- Subjects
COMPUTATIONAL complexity - Abstract
Crossbar-based In-Memory Processing (IMP) accelerators have been widely adopted to achieve high-speed and low-power computing, especially for deep neural network (DNN) models with numerous weights and high computational complexity. However, the floating-point (FP) arithmetic is not compatible with crossbar architectures. Also, redundant weights of current DNN models occupy too many crossbars, limiting the efficiency of crossbar accelerators. Meanwhile, due to the inherent non-ideal behavior of crossbar devices, like write variations, pre-trained DNN models suffer fromaccuracy degradation when it is deployed on a crossbarbased IMP accelerator for inference. Although some approaches are proposed to address these issues, they often fail to consider the interaction among these issues, and introduce significant hardware overhead for solving each issue. To deploy complex models on IMP accelerators, we should compact the model and mitigate the influence of device non-ideal behaviors without introducing significant overhead from each technique. In this paper, we first propose to reuse bit-shift units in crossbars for approximatelymultiplying scaling factors in our quantization scheme to avoid using FP processors. Second, we propose to apply kernel-group pruning and crossbar pruning to eliminate the hardware units for data aligning.We also design a zerorize-recover training process for our pruning method to achieve higher accuracy. Third, we adopt the runtime-aware nonideality adaptation with a self-compensation scheme to relieve the impact of non-ideality by exploiting the feature of crossbars. Finally, we integrate these three optimization procedures into one training process to form a comprehensive learning framework for co-optimization, which can achieve higher accuracy. The experimental results indicate that our comprehensive learning framework can obtain significant improvements over the original model when inferring on the crossbar-based IMP accelerator, with an average reduction of computing power and computing area by 100.02x and 17.37x, respectively. Furthermore, we can obtain totally integer-only, pruned, and reliable VGG-16 and ResNet-56 models for the Cifar-10 dataset on IMP accelerators, with accuracy drops of only 2.19% and 1.26%, respectively, without any hardware overhead. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
3. TAB: Unified and Optimized Ternary, Binary, and Mixed-precision Neural Network Inference on the Edge.
- Author
-
SHIEN ZHU, DUON, LUAN H. K., and WEICHEN LIU
- Subjects
DATA warehousing ,GRAPHICS processing units ,INTEGERS ,MATHEMATICAL convolutions - Abstract
Ternary Neural Networks (TNNs) and mixed-precision Ternary Binary Networks (TBNs) have demonstrated higher accuracy compared to Binary Neural Networks (BNNs) while providing fast, low-power, and memory efficient inference. Related works have improved the accuracy of TNNs and TBNs, but overlooked their optimizations on CPU and GPU platforms. First, there is no unified encoding for the binary and ternary values in TNNs and TBNs. Second, existing works store the 2-bit quantized data sequentially in 32/64-bit integers, resulting in bit-extraction overhead. Last, adopting standard 2-bit multiplications for ternary values leads to a complex computation pipeline, and efficient mixed-precision multiplication between ternary and binary values is unavailable. In this article, we propose TAB as a unified and optimized inference method for ternary, binary, and mixed precision neural networks. TAB includes unified value representation, efficient data storage scheme and novel bitwise dot product pipelines on CPU/GPU platforms. We adopt signed integers for consistent value representation across binary and ternary values. We introduce a bitwidth-last data format that stores the first and second bits of the ternary values separately to remove the bit extraction overhead. We design the ternary and binary bitwise dot product pipelines based on Gated-XOR using up to 40% fewer operations than State-Of The-Art (SOTA) methods. Theoretical speedup analysis shows that our proposed TAB-TNN is 2.3× fast as the SOTA ternary method RTN, 9.8× fast as 8-bit integer quantization (INT8), and 39.4× fast as 32-bit full-precision convolution (FP32). Experiment results on CPU and GPU platforms show that our TAB-TNN has achieved up to 34.6× speedup and 16× storage size reduction compared with FP32 layers. TBN, Binary-activation Ternary-weight Network (BTN), and BNN in TAB are up to 40.7×, 56.2×, and 72.2× as fast as FP32. TAB-TNN is up to 70.1% faster and 12.8% more power-efficient than RTN on Darknet-19 while keeping the same accuracy. TAB is open source as a PyTorch Extension1 for easy integration with existing CNN models [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
4. MARCO: A High-performance Task Mapping and Routing Co-optimization Framework for Point-to-Point NoC-based Heterogeneous Computing Systems.
- Author
-
HUI CHEN, ZIHAO ZHANG, PENG CHEN, XIANGZHONG LUO, SHIQING LI, and WEICHEN LIU
- Subjects
HETEROGENEOUS computing ,COMPUTER systems ,REINFORCEMENT learning ,ROUTING algorithms ,ALGORITHMS ,MAP design - Abstract
Heterogeneous computing systems (HCSs), which consist of various processing elements (PEs) that vary in their processing ability, are usually facilitated by the network-on-chip (NoC) to interconnect its components. The emerging point-to-point NoCs which support single-cycle-multi-hop transmission, reduce or eliminate the latency dependence on distance, addressing the scalability concern raised by high latency for long-distance transmission and enlarging the design space of the routing algorithm to search the non-shortest paths. For such point-to-point NoC-based HCSs, resource management strategies which are managed by compilers, scheduler, or controllers, e.g., mapping and routing, are complicated for the following reasons: (i) Due to the heterogeneity, mapping and routing need to optimize computation and communication concurrently (for homogeneous computing systems, only communication). (ii) Conducting mapping and routing consecutively cannot minimize the schedule length in most cases since the PEs with high processing ability may locate in the crowded area and suffer from high resource contention overhead. (iii) Since changing the mapping selection of one task will reconstruct the whole routing design space, the exploration of mapping and routing design space is challenging. Therefore, in this work, we propose MARCO, the mapping and routing co-optimization framework, to decrease the schedule length of applications on point-to-point NoC-based HCSs. Specifically, we revise the tabu search to explore the design space and evaluate the quality of mapping and routing. The advanced reinforcement learning (RL)algorithm, i.e., advantage actor-critic, is adopted to efficiently compute paths. We perform extensive experiments on various real applications, which demonstrates that the MARCO achieves a remarkable performance improvement in terms of schedule length (+44.94% ~ +50.18%) when compared with the state-of-the-art mapping and routing co-optimization algorithm for homogeneous computing systems. We also compare MARCO with different combinations of state-of-the-art mapping and routing approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
5. Attack Mitigation of Hardware Trojans for Thermal Sensing via Micro-ring Resonator in Optical NoCs.
- Author
-
JUN ZHOU, MENGQUAN LI, PENGXING GUO, and WEICHEN LIU
- Subjects
OPTICAL resonators ,ARTIFICIAL neural networks ,SECURITY classification (Government documents) ,COMPUTER network security - Abstract
As an emerging role in new-generation on-chip communication, optical networks-on-chip (ONoCs) provide ultra-high bandwidth, low latency, and low power dissipation for data transfers. However, the thermo-optic effects of the photonic devices have a great impact on the operating performance and reliability of ONoCs, where the thermal-aware control with accurate measurements, e.g., thermal sensing, is typically applied to alleviate it. Besides, the temperature-sensitive ONoCs are prone to be attacked by the hardware Trojans (HTs) covertly embedded in the counterfeit integrated circuits (ICs) from the malicious third-party vendors, leading to performance degradation, denial-of-service (DoS), or even permanent damages. In this article, we focus on the tampering and snooping attacks during the thermal sensing via micro-ring resonator (MR) in ONoCs. Based on the provided workflow and attack model, a new structure of the anti-HT module is proposed to verify and protect the obtained data from the thermal sensor for attacks in its optical sampling and electronic transmission processes. In addition, we present the detection scheme based on the spiking neural networks (SNNs) to implement an accurate classification of the network security statuses for further high-level control. Evaluation results indicate that, with less than 1% extra area of a tile, our approach can significantly enhance the hardware security of thermal sensing for ONoC with trivial costs of up to 8.73%, 5.32%, and 6.14% in average latency, execution time, and energy consumption, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
6. Scheduling and Analysis of Parallel Real-Time Tasks with Semaphores.
- Author
-
Xu Jiang, Nan Guan, Weichen Liu, and Maolin Yang
- Subjects
SCHEDULING ,DIRECTED acyclic graphs ,SOFTWARE failures ,DATA structures ,REAL-time control - Abstract
This paper for the first time studies the scheduling and analysis of parallel real-time tasks with semaphores. In parallel task systems, each task may issue multiple requests to a semaphore, which raises new challenges to the design and analysis problems. We propose a new locking protocol LPP that limits the maximal number of requests to a semaphore by a task that can block other tasks at any time. We develop analysis techniques to safely bound the task response times, with which we prove that the best real-time performance is achieved if only one request to a semaphore by a task is allowed to block other tasks at a time. Experiments under different parameter settings are conducted to compare our proposed protocol and analysis techniques with the state-of-the-art spinlock protocol and analysis techniques for parallel real-time tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
7. Hardware-Software Collaborative Thermal Sensing in Optical Network-on-Chip--based Manycore Systems.
- Author
-
MENGQUAN LI, WEICHEN LIU, NAN GUAN, YIYUAN XIE, and YAOYAO YE
- Subjects
TEMPERATURE distribution ,OPTICAL devices ,RELIABILITY in engineering - Abstract
Continuous technology scaling in manycore systems leads to severe overheating issues. To guarantee system reliability, it is critical to accurately yet efficiently monitor runtime temperature distribution for effective chip thermal management. As an emerging communication architecture for new-generation manycore systems, optical network-on-chip (ONoC) satisfies the communication bandwidth and latency requirements with low power dissipation. Moreover, observation shows that it can be leveraged for runtime thermal sensing. In this article, we propose a brand-new on-chip thermal sensing approach for ONoC-based manycore systems by utilizing the intrinsic thermal sensitivity of optical devices and the inter-processor communications in ONoCs. It requires no extra hardware but utilizes existing optical devices in ONoCs and combines them with lightweight software computation in a hardware-software collaborative manner. The effectiveness of the our approach is validated both at the device level and the system level through professional photonic simulations. Evaluation results based on synthetic communication traces and realistic benchmarks showthat our approach achieves an average temperature inaccuracy of only 0.6648 K compared to ground-truth values and is scalable to be applied for large-size ONoCs. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
8. Timing-Anomaly Free Dynamic Scheduling of Conditional DAG Tasks on Multi-Core Systems.
- Author
-
PENG CHEN, WEICHEN LIU, XU JIANG, QINGQIANG HE, and NAN GUAN
- Subjects
REACTION time ,SCHEDULING ,TASKS ,TARDINESS - Abstract
In this paper, we propose a novel approach to schedule conditional DAG parallel tasks, with which we can derive safe response time upper bounds significantly better than the state-of-the-art counterparts. The main idea is to eliminate the notorious timing anomaly in scheduling parallel tasks by enforcing certain order constraints among the vertices, and thus the response time bound can be accurately predicted off-line by somehow "simulating" the runtime scheduling. A key challenge to apply the timing-anomaly free scheduling approach to conditional DAG parallel tasks is that at runtime it may generate exponentially many instances from a conditional DAG structure. To deal with this problem, we develop effective abstractions, based on which a safe response time upper bound is computed in polynomial time. We also develop algorithms to explore the vertex orders to shorten the response time bound. The effectiveness of the proposed approach is evaluated by experiments with randomly generated DAG tasks with different parameter configurations. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
9. An Efficient UAV Hijacking Detection Method Using Onboard Inertial Measurement Unit.
- Author
-
ZHIWEI FENG, NAN GUAN, MINGSONG LV, WEICHEN LIU, QINGXU DENG, XUE LIU, and WANG YI
- Subjects
GPS receivers ,GLOBAL Positioning System ,COMPUTER engineering ,CYBER physical systems - Published
- 2018
- Full Text
- View/download PDF
10. Task Mapping on SMART NoC: Contention Matters, Not the Distance.
- Author
-
Lei Yang, Weichen Liu, Peng Chen, Nan Guan, and Mengquan Li
- Subjects
NETWORK operating system ,LOCAL area networks ,HEURISTIC algorithms ,MICROPROCESSORS ,MICROTECHNOLOGY - Abstract
On-chip communication is the bottleneck of system performance for NoC-based MPSoCs. SMART, a recently proposed NoC architecture, enables single-cycle multi-hop communications. In SMART NoCs, unconflicted messages can go through an express bypass and the communication efficiency is significantly improved, while conflicted messages have to be buffered for guaranteed delivery with extra delays. Therefore, that performance of SMART NoC may be seriously degraded when communication contention increases. In this paper, we present task mapping techniques to address this problem for SMART NoCs, with the consideration of communication contention, rather than interprocessor distance, by minimizing conflicts and thus maximizing bypass utilization. We first model the entire problem by ILP formulations to find the theoretically optimal solution, and further propose polynomial-time algorithms for contention-aware task mapping and message priority assignment. Communicating tasks can be mapped to distant processors in SMART NoCs as long as conflict-free communication paths can be established and bypass can be enabled. Evaluation results on real benchmarks show an average of 44:1% and 32:8% improvement in communication efficiency and application performance compared to state-ofthe- art techniques. The proposed heuristic algorithms only introduce 1:9% performance difference compared to the ILP model and are more scalable to large-size NoCs. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
11. An Efficient Technique of Application Mapping and Scheduling on Real-Time Multiprocessor Systems for Throughput Optimization.
- Author
-
WEICHEN LIU and CHUNHUA XIAO
- Subjects
APPLICATION software ,COMPUTER scheduling ,MATHEMATICAL mappings ,REAL-time computing ,MULTIPROCESSORS ,MATHEMATICAL optimization ,UBIQUITOUS computing - Abstract
Multiprocessor systems are becoming ubiquitous in today's embedded systems design. In this article, we address the problem of mapping an application represented by a Homogeneous Synchronous Dataflow (HSDF) graph onto a real-time multiprocessor platform with the objective of maximizing total throughput. We propose that the optimal solution to the problem is composed of three components: actor-to-processor mapping, retiming, and actor ordering on each processor. The entire problem is systematically modeled into a Boolean Satisfiability (SAT) problem such that the optimal solution can be guaranteed theoretically. In order to explore the vast solution space more efficiently, we develop a specific HSDF theory solver based on the special characteristics of the timed HSDF, and integrate it into the general search framework of the SAT solver. Two alternative integration methods based on branch-and-bound are presented to achieve early branch pruning in the search space; thus, the scalability is greatly improved. Extensive performance evaluation on synthetic examples and a case study on the realistic H.264 Video Decoder show that our approach provides as much as 76.9% throughput improvement, and is scalable to industry-sized applications. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
12. Crosstalk Noise and Bit Error Rate Analysis for Optical Network-on-Chip.
- Author
-
Yiyuan Xie, Nikdast, Mahdi, Jiang Xu, Wei Zhang, Qi Li, Xiaowen Wu, Yaoyao Ye, Xuan Wang, and Weichen Liu
- Subjects
NETWORKS on a chip ,OPTICAL communications ,CROSSTALK ,SIGNAL-to-noise ratio ,BIT rate ,ERROR rates - Abstract
Crosstalk noise is an intrinsic characteristic of photonic devices used by optical networks-on-chip (ONoCs) as well as a potential issue. For the first time, this paper analyzed and modeled the crosstalk noise, signal-to-noise ratio (SNR), and bit error rate (BER) of optical routers and ONoCs. The analytical models for crosstalk noise, minimum SNR, and maximum BER in meshbased ONoCs are presented. An automated crosstalk analyzer for optical routers is developed. We find that crosstalk noise significantly limits the scalability of ONoCs. For example, due to crosstalk noise, the maximum BER is 10
-3 on the 8×8 meshbased ONoC using an optimized crossbar-based optical router. To achieve the BER of 10-9 for reliable transmissions, the maximum ONoC size is 6×6. A novel compact high-SNR optical router is proposed to improve the maximum ONoC size to 8×8. [ABSTRACT FROM AUTHOR]- Published
- 2010
13. On-Chip Sensor Networks for Soft-Error Tolerant Real-Time Multiprocessor Systems-on-Chip.
- Author
-
WEICHEN LIU, XUAN WANG, JIANG XU, WEI ZHANG, YAOYAO YE, XIAOWEN WU, MAHDI NIKDAST, and ZHEHUI WANG
- Subjects
NETWORKS on a chip ,SENSOR networks ,TOLERANCE analysis (Engineering) ,MULTIPROCESSORS ,RELIABILITY in engineering - Abstract
As transistor density continues to increase with the advent of nanotechnology, reliability issues raised by the more frequent appearance of soft errors are becoming critical for future embedded multiprocessor systems design. State-of-the-art techniques for soft error protections targetingmultiprocessor systems result either high chip cost and area overhead or high performance degradation and energy consumption, and do not fulfill the increasing requirements for high performance and dependability. In this article we present a systematic approach, that is, the Sensor Networks-on-Chip (SENoC), to collaboratively and efficiently manage on-chip applications and overcome reliability threats to Multiprocessor Systems-on-Chip (MPSoC). A hardware-software collaborative approach is proposed to solve soft error problems: a hardware-based on-chip sensor network is built for soft error detection, and a software-based recovery mechanism is applied for soft error correction. A two-step scheduling scheme is presented for reliable application and chip management, combining an off-line static optimization stage for application performance maximization and an online lightweight dynamic adjustment stage to handle runtime variations and exceptions. This strategy introduces only trivial overhead on hardware design and much lower overhead on software control and execution, and hence performance degradation and energy consumption is greatly reduced. We build a cycle-accurate simulator using SystemC, and verify the effectiveness of our technique by comparing performance with related techniques on several real-world applications. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.