106 results on '"Yuan Xie"'
Search Results
2. 2022 ICCAD CAD Contest Problem C
- Author
-
Sicheng Li, Chen Bai, Xuechao Wei, Bizhao Shi, Yen-Kuang Chen, and Yuan Xie
- Published
- 2022
3. Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation
- Author
-
Miaoyu Li, Yachao Zhang, Yuan Xie, Zuodong Gao, Cuihua Li, Zhizhong Zhang, and Yanyun Qu
- Published
- 2022
4. Hierarchical Walking Transformer for Object Re-Identification
- Author
-
Xudong Tian, Jun Liu, Zhizhong Zhang, Chengjie Wang, Yanyun Qu, Yuan Xie, and Lizhuang Ma
- Published
- 2022
5. Not All Pixels Are Matched
- Author
-
Hanzhe Sun, Jun Liu, Zhizhong Zhang, Chengjie Wang, Yanyun Qu, Yuan Xie, and Lizhuang Ma
- Published
- 2022
6. Rethinking the Metric in Few-shot Learning
- Author
-
Jinxiang Lai, Siqian Yang, Guannan Jiang, Xi Wang, Yuxi Li, Zihui Jia, Xiaochen Chen, Jun Liu, Bin-Bin Gao, Wei Zhang, Yuan Xie, and Chengjie Wang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Few-shot learning problem focuses on recognizing unseen classes given a few labeled images. In recent effort, more attention is paid to fine-grained feature embedding, ignoring the relationship among different distance metrics. In this paper, for the first time, we investigate the contributions of different distance metrics, and propose an adaptive fusion scheme, bringing significant improvements in few-shot classification. We start from a naive baseline of confidence summation and demonstrate the necessity of exploiting the complementary property of different distance metrics. By finding the competition problem among them, built upon the baseline, we propose an Adaptive Metrics Module (AMM) to decouple metrics fusion into metric-prediction fusion and metric-losses fusion. The former encourages mutual complementary, while the latter alleviates metric competition via multi-task collaborative learning. Based on AMM, we design a few-shot classification framework AMTNet, including the AMM and the Global Adaptive Loss (GAL), to jointly optimize the few-shot task and auxiliary self-supervised task, making the embedding features more robust. In the experiment, the proposed AMM achieves 2% higher performance than the naive metrics fusion module, and our AMTNet outperforms the state-of-the-arts on multiple benchmark datasets.
- Published
- 2022
7. Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision
- Author
-
Jiawei Zhan, Jun Liu, Wei Tang, Guannan Jiang, Xi Wang, Bin-Bin Gao, Tianliang Zhang, Wenlong Wu, Wei Zhang, Chengjie Wang, and Yuan Xie
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Multi-label image classification, which can be categorized into label-dependency and region-based methods, is a challenging problem due to the complex underlying object layouts. Although region-based methods are less likely to encounter issues with model generalizability than label-dependency methods, they often generate hundreds of meaningless or noisy proposals with non-discriminative information, and the contextual dependency among the localized regions is often ignored or over-simplified. This paper builds a unified framework to perform effective noisy-proposal suppression and to interact between global and local features for robust feature learning. Specifically, we propose category-aware weak supervision to concentrate on non-existent categories so as to provide deterministic information for local feature learning, restricting the local branch to focus on more high-quality regions of interest. Moreover, we develop a cross-granularity attention module to explore the complementary information between global and local features, which can build the high-order feature correlation containing not only global-to-local, but also local-to-local relations. Both advantages guarantee a boost in the performance of the whole network. Extensive experiments on two large-scale datasets (MS-COCO and VOC 2007) demonstrate that our framework achieves superior performance over state-of-the-art methods., 12 pages, 10 figures, published in ACMMM 2022
- Published
- 2022
8. Self-supervised Exclusive Learning for 3D Segmentation with Cross-Modal Unsupervised Domain Adaptation
- Author
-
Yachao Zhang, Miaoyu Li, Yuan Xie, Cuihua Li, Cong Wang, Zhizhong Zhang, and Yanyun Qu
- Published
- 2022
9. Adjustable Memory-efficient Image Super-resolution via Individual Kernel Sparsity
- Author
-
Xiaotong Luo, Mingliang Dai, Yulun Zhang, Yuan Xie, Ding Liu, Yanyun Qu, Yun Fu, and Junping Zhang
- Published
- 2022
10. INSPIRE
- Author
-
Jilan Lin, Ling Liang, Zheng Qu, Ishtiyaque Ahmad, Liu Liu, Fengbin Tu, Trinabh Gupta, Yufei Ding, and Yuan Xie
- Published
- 2022
11. A synthesis framework for stitching surface code with superconducting quantum devices
- Author
-
Anbang Wu, Gushu Li, Hezi Zhang, Gian Giacomo Guerreschi, Yufei Ding, and Yuan Xie
- Published
- 2022
12. DIMMining
- Author
-
Guohao Dai, Zhenhua Zhu, Tianyu Fu, Chiyue Wei, Bangyan Wang, Xiangyu Li, Yuan Xie, Huazhong Yang, and Yu Wang
- Published
- 2022
13. Hyperscale FPGA-as-a-service architecture for large-scale distributed graph neural network
- Author
-
Shuangchen Li, Dimin Niu, Yuhao Wang, Wei Han, Zhe Zhang, Tianchan Guan, Yijin Guan, Heng Liu, Linyong Huang, Zhaoyang Du, Fei Xue, Yuanwei Fang, Hongzhong Zheng, and Yuan Xie
- Published
- 2022
14. DOTA: detect and omit weak attentions for scalable transformer acceleration
- Author
-
Zheng Qu, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, and Yuan Xie
- Published
- 2022
15. A one-for-all and o ( v log( v ))-cost solution for parallel merge style operations on sorted key-value arrays
- Author
-
Bangyan Wang, Lei Deng, Fei Sun, Guohao Dai, Liu Liu, Yu Wang, and Yuan Xie
- Published
- 2022
16. Efficient tensor core-based GPU kernels for structured sparsity under reduced precision
- Author
-
Liu Liu, Zhaodong Chen, Yuan Xie, Zheng Qu, and Yufei Ding
- Subjects
Speedup ,Artificial neural network ,Computer science ,Computation ,Tensor (intrinsic definition) ,Kernel (statistics) ,Parallel computing ,General-purpose computing on graphics processing units ,Sparse matrix ,Block (data storage) - Abstract
The success of DNN comes at the expense of excessive memory/computation cost, which can be addressed by exploiting reduced precision and sparsity jointly. Existing sparse GPU kernels, however, fail to achieve practical speedup over cuBLASHgemm under half-precision. Those for fine-grained sparsity suffer from low data reuse, and others for coarse-grained sparsity are limited by the wrestling between kernel performance and model quality under different grain sizes. We propose column-vector-sparse-encoding that has a smaller grain size under the same reuse rate compared with block sparsity. Column-vector-sparse-encoding can be applied to both SpMM & SDDMM, two major sparse DNN operations. We also introduce the Tensor-Core-based 1D Octet Tiling that has efficient memory access and computation patterns under small grain size. Based on these, we design SpMM and SDDMM kernels and achieve 1.71-7.19x speedup over cuSPARSE. Practical speedup is achieved over cuBLASHgemm under >70% and >90% sparsity with 4x1 grain size and half-precision.
- Published
- 2021
17. ENMC: Extreme Near-Memory Classification via Approximate Screening
- Author
-
Zheng Qu, Liu Liu, Yuan Xie, Jilan Lin, and Yufei Ding
- Subjects
Speedup ,Computational complexity theory ,Computer engineering ,business.industry ,Computer science ,Deep learning ,Component (UML) ,Classifier (linguistics) ,Artificial intelligence ,Language model ,Central processing unit ,business ,Power budget - Abstract
Extreme classification (XC) is the essential component of large-scale Deep Learning Systems for a wide range of application domains, including image recognition, language modeling, and recommendation. As classification categories keep scaling in real-world applications, the classifier’s parameters could reach several thousands of Gigabytes, way exceed the on-chip memory capacity. With the advent of near-memory processing (NMP) architectures, offloading the XC component onto NMP units could alleviate the memory-intensive problem. However, naive NMP design with limited area and power budget cannot afford the computational complexity of full classification. To tackle the problem, we first propose a novel screening method to reduce the computation and memory consumption by efficiently approximating the classification output and identifying a small portion of key candidates that require accurate results. Then, we design a new extreme-classification-tailored NMP architecture, namely ENMC, to support both screening and candidates-only classification. Overall, our approximate screening method achieves 7.3 × speedup over the CPU baseline, and ENMC further improves the performance by 7.4 × and demonstrates 2.7 × speedup compared with the state-of-the-art NMP baseline.
- Published
- 2021
18. Improving Streaming Graph Processing Performance using Input Knowledge
- Author
-
Abanti Basak, Zeshan A. Chishti, Jilan Lin, Zheng Qu, Yuan Xie, Alaa R. Alameldeen, and Yufei Ding
- Subjects
Acceleration ,Software ,Degree (graph theory) ,Computer science ,business.industry ,Computation ,Big data ,Locality ,Granularity ,Parallel computing ,business ,Complement (set theory) - Abstract
Streaming graphs are ubiquitous in today’s big data era. Prior work has improved the performance of streaming graph workloads without taking input characteristics into account. In this work, we demonstrate that input knowledge-driven software and hardware co-design is critical to optimize the performance of streaming graph processing. To improve graph update efficiency, we first characterize the performance trade-offs of input-oblivious batch reordering. Guided by our findings, we propose input-aware batch reordering to adaptively reorder input batches based on their degree distributions. To complement adaptive batch reordering, we propose updating graphs dynamically, based on their input characteristics, either in software (via update search coalescing) or in hardware (via acceleration support). To improve graph computation efficiency, we present input-aware work aggregation which adaptively modulates the computation granularity based on inter-batch locality characteristics. Evaluated across 260 workloads, our input-aware techniques provide on average 4.55 × and 2.6 × improvement in graph update performance for different input types (on top of eliminating the performance degradation from input-oblivious batch reordering). The graph compute performance is improved by 1.26 × (up to 2.7 ×).
- Published
- 2021
19. Faster-PPN: Towards Real-Time Semantic Segmentation with Dual Mutual Learning for Ultra-High Resolution Images
- Author
-
Kai Li, Tong Wu, Yuan Xie, Bicheng Dai, Yanyun Qu, Kaisheng Wu, and Yun Fu
- Subjects
Artificial neural network ,Pixel ,Feature (computer vision) ,Computer science ,business.industry ,Selection (linguistics) ,Inference ,Segmentation ,Pattern recognition ,Artificial intelligence ,Resolution (logic) ,Representation (mathematics) ,business - Abstract
Despite recent progress on semantic segmentation, there still exist huge challenges in high or ultra-high resolution images semantic segmentation. Although the latest collaborative global-local semantic segmentation methods such as GLNet [4] and PPN [18] have achieved impressive results, they are inefficient and not fit for practical applications. Thus, in this paper, we propose a novel and efficient collaborative global-local framework on the basis of PPN named Faster-PPN for high or ultra-high resolution images semantic segmentation which makes a better trade-off between the efficient and effectiveness towards the real-time speed. Specially, we propose Dual Mutual Learning to improve the feature representation of global and local branches, which conducts knowledge distillation mutually between the global and local branches. Furthermore, we design the Pixel Proposal Fusion Module to conduct the fine-grained selection mechanism which further reduces the redundant pixels for fusion resulting in the improvement of inference speed. The experimental results on three challenging high or ultra-high resolution datasets DeepGlobe, ISIC and BACH demonstrate that Faster-PPN achieves the best performance on accuracy, inference speed and memory usage compared with state-of-the-art approaches. Especially, our method achieves real-time and near real-time speed with 36 FPS and 17.7 FPS on ISIC and DeepGlobe, respectively.
- Published
- 2021
20. On the Co-Design of Quantum Software and Hardware
- Author
-
Gushu Li, Yunong Shi, Ali Javadi-Abhari, Anbang Wu, Yuan Xie, and Yufei Ding
- Subjects
Hardware architecture ,Exploit ,business.industry ,Computer science ,Design flow ,TheoryofComputation_GENERAL ,Optimizing compiler ,Software ,ComputerSystemsOrganization_MISCELLANEOUS ,Software system ,business ,Quantum ,Computer hardware ,Quantum computer - Abstract
A quantum computing system naturally consists of two components, the software system and the hardware system. Quantum applications are programmed using the quantum software and then executed on the quantum hardware. However, the performance of existing quantum computing system is still limited. Solving a practical problem that is beyond the capability of classical computers on a quantum computer has not yet been demonstrated. In this review, we point out that the quantum software and hardware systems should be designed collaboratively to fully exploit the potential of quantum computing. We first review three related works, including one hardware-aware quantum compiler optimization, one application-aware quantum hardware architecture design flow, and one co-design approach for the emerging quantum computational chemistry. Then we discuss some potential future directions following the co-design principle.
- Published
- 2021
21. I <scp>RON</scp> M <scp>AN</scp>
- Author
-
Cong Hao, Yuan Xie, and Nan Wu
- Subjects
010302 applied physics ,Computer science ,Design space exploration ,Pareto principle ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Data flow diagram ,Computer engineering ,High-level synthesis ,0103 physical sciences ,Simulated annealing ,Genetic algorithm ,0202 electrical engineering, electronic engineering, information engineering ,Resource allocation ,Reinforcement learning - Abstract
Despite the great success of High-Level Synthesis (HLS) tools, we observe several unresolved challenges: 1) the high-level abstraction of programming styles in HLS conceals optimization opportunities; 2) existing HLS tools do not provide flexible trade-offs among different objectives and constraints; 3) the actual quality of the resulting RTL designs is hard to predict. To this end, we propose an end-to-end framework, IRONMAN. The primary goal is to enable a flexible and automated design space exploration (DSE), which can provide either optimized solutions under user-specified constraints, or Pareto trade-offs among different objectives (e.g., resource types, area, and latency). IronMan consists of three components: GPP (a highly accurate graph-neural-network-based performance predictor), RLMD (a reinforcement-learning-based DSE engine that explores the optimized resource allocation strategy), and CT (a code transformer that assists RLMD and GPP by extracting data flow graphs from original HLS C/C++). Experimental results show that, 1) GPP achieves high prediction accuracy, reducing prediction errors of HLS tools by 10.9X in resource usage and 5.7X in timing; 2) RLMD obtains optimized or Pareto solutions outperforming genetic algorithm and simulated annealing by 12.7% and 12.9%, respectively; 3) IronMan can find optimized solutions perfectly matching various DSP constraints, with 2.54X fewer DSPs and up to 6X shorter latency than those of HLS tools. IronMan is also up to 400X faster than meta-heuristic techniques and HLS tools.
- Published
- 2021
22. NEST
- Author
-
Wenqin Huangfu, Peng Gu, Yuan Xie, Malladi Krishna T, and Shuangchen Li
- Subjects
Task (computing) ,Computer science ,business.industry ,Embedded system ,Scalability ,Memory bandwidth ,DIMM ,Field-programmable gate array ,business ,Scheduling (computing) - Abstract
With the ability to help wildlife conservation, precise medical care, and disease understanding, genomics analysis is becoming more and moe important. Recently, with the development and wide adoption of the Next-Generation Sequencing (NGS) technology, bio-data grows exponentially, putting forward great challenges for k-mer counting - a widely used application in genomics analysis. Many hardware approaches have been explored to accelerate k-mer counting. Most of those approaches are compute-centric, i.e., based on CPU/GPU/FPGA. However, the space for performance improvement is limited for compute-centric accelerators, because k-mer counting is a memory-bound application. By integrating memory and computation close together and embracing higher memory bandwidth, Near-Data-Processing (NDP) is a good candidate to accelerate k-mer counting. Unfortunately, due to challenges of communication, bandwidth utilization, workload balance, and redundant memory accesses, previous NDP accelerators for k-mer counting cannot fully unleash the power of NDP. To build a practical, scalable, high-performance, and energy-efficient NDP accelerator for k-mer counting, we perform hardware/software co-design and propose the DIMM based Near-Data-Processing Accelerator for k-mer counting (NEST). To fully unleash the potential of NEST architecture, we modify the k-mer counting algorithm and propose a dedicated workflow to support efficient parallelism. Moreover, the proposed algorithm and workflow are able to reduce unnecessary inter-DIMM communication. To improve memory bandwidth utilization, we propose a novel address mapping scheme. The challenge of workload balance is addressed with the proposed task scheduling technique. In addition, scattered memory access and task switching are proposed to eliminate redundant memory access. Experimental results show that NEST provides 677.33x/27.24x/6.02x performance improvement and 1076.14x/62.26x/4.30x energy reduction, compared with a 48-thread CPU, a CPU/GPU hybrid approach, and a state-of-the-art NDP accelerator, respectively.
- Published
- 2020
23. fuseGNN
- Author
-
Mingyu Yan, Guoqi Li, Yuan Xie, Maohua Zhu, Lei Deng, Shuangchen Li, and Zhaodong Chen
- Subjects
Kernel (linear algebra) ,CUDA ,Speedup ,Artificial neural network ,Computer science ,Node (networking) ,Kernel (statistics) ,Parallel computing ,General-purpose computing on graphics processing units ,Throughput (business) ,Convolutional neural network ,Graph - Abstract
Graph convolutional neural networks (GNN) have achieved state-of-the-art performance on tasks like node classification. It has become a new workload family member in data-centers. GNN works on irregular graph-structured data with three distinct phases: Combination, Graph Processing, and Aggregation. While Combination phase has been well supported by sgemm kernels in cuBLAS, the other two phases are still inefficient on GPGPU due to the lack of optimized CUDA kernels. In particular, Aggregation phase introduces large volume of DRAM storage footprint and data movement, and both Aggregation and Graph Processing phases suffer from high kernel launching time. These inefficiencies not only decrease training throughput but also limit users from training GNNs on larger graphs on GPGPU. Although these problems have been partially alleviated by recent studies, their optimizations are still not sufficient. In this paper, we propose fuseGNN, an extension of PyTorch that provides highly optimized APIs and CUDA kernels for GNN. First, two different programming abstractions for Aggregation phase are utilized to handle graphs with different average degrees. Second, dedicated GPGPU kernels are developed for Aggregation and Graph Processing in both forward and backward passes, in which kernel-fusion along with other optimization strategies are applied to reduce kernel launching time and latency as well as exploit data reuse opportunities. Evaluation on multiple benchmarks shows that fuseGNN achieves up to 5.3× end-to-end speedup over state-of-the-art frameworks, and the DRAM storage footprint is reduced by several orders of magnitude on large datasets.
- Published
- 2020
24. MNSIM 2.0: A Behavior-Level Modeling Tool for Memristor-based Neuromorphic Computing Systems
- Author
-
Yuan Xie, X. Sharon Hu, Kaizhong Qiu, Xiaoming Chen, Gokul Krishnan, Niu Dimin, Zhenhua Zhu, Lixue Xia, Yu Wang, Yu Cao, Guohao Dai, Huazhong Yang, and Hanbo Sun
- Subjects
010302 applied physics ,Structure (mathematical logic) ,Artificial neural network ,Computer science ,Design space exploration ,Inference ,02 engineering and technology ,Memristor ,01 natural sciences ,020202 computer hardware & architecture ,law.invention ,Neuromorphic engineering ,Computer architecture ,law ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Architecture ,Efficient energy use - Abstract
Memristor based neuromorphic computing systems give alternative solutions to boost the computing energy efficiency of Neural Network (NN) algorithms. Because of the large-scale applications and the large architecture design space, many factors will affect the computing accuracy and system's performance. In this work, we propose a behavior-level modeling tool for memristor-based neuromorphic computing systems, MNSIM 2.0, to model the performance and help researchers to realize an early-stage design space exploration. Compared with the former version and other benchmarks, MNSIM 2.0 has the following new features: 1. In the algorithm level, MNSIM 2.0 supports the inference accuracy simulation for mixed-precision NNs considering non-ideal factors. 2. In the architecture level, a hierarchical modeling structure for PIM systems is proposed. Users can customize their designs from the aspects of devices, interfaces, processing units, buffer designs, and interconnections. 3. Two hardware-aware algorithm optimization methods are integrated in MNSIM 2.0 to realize software-hardware co-optimization.
- Published
- 2020
25. DeepSniffer
- Author
-
Lei Deng, Ling Liang, Pengfei Zuo, Shuangchen Li, Timothy Sherwood, Xing Hu, Xinfeng Xie, Yu Ji, Chang Liu, Yufei Ding, and Yuan Xie
- Subjects
Network architecture ,Memory hierarchy ,Relation (database) ,Computer science ,business.industry ,Context (language use) ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,020202 computer hardware & architecture ,Robustness (computer science) ,Complete information ,Bus sniffing ,0202 electrical engineering, electronic engineering, information engineering ,Leverage (statistics) ,Artificial intelligence ,business ,computer ,0105 earth and related environmental sciences - Abstract
As deep neural networks (DNNs) continue their reach into a wide range of application domains, the neural network architecture of DNN models becomes an increasingly sensitive subject, due to either intellectual property protection or risks of adversarial attacks. Previous studies explore to leverage architecture-level events disposed in hardware platforms to extract the model architecture information. They pose the following limitations: requiring a priori knowledge of victim models, lacking in robustness and generality, or obtaining incomplete information of the victim model architecture. Our paper proposes DeepSniffer, a learning-based model extraction framework to obtain the complete model architecture information without any prior knowledge of the victim model. It is robust to architectural and system noises introduced by the complex memory hierarchy and diverse run-time system optimizations. The basic idea of DeepSniffer is to learn the relation between extracted architectural hints (e.g., volumes of memory reads/writes obtained by side-channel or bus snooping attacks) and model internal architectures. Taking GPU platforms as a show case, DeepSniffer conducts model extraction by learning both the architecture-level execution features of kernels and the inter-layer temporal association information introduced by the common practice of DNN design. We demonstrate that DeepSniffer works experimentally in the context of an off-the-shelf Nvidia GPU platform running a variety of DNN models. The extracted models are directly helpful to the attempting of crafting adversarial inputs. Our experimental results show that DeepSniffer achieves a high accuracy of model extraction and thus improves the adversarial attack success rate from 14.6%$\sim$25.5% (without network architecture knowledge) to 75.9% (with extracted network architecture). The DeepSniffer project has been released in Github.
- Published
- 2020
26. Joint-attention Discriminator for Accurate Super-resolution via Adversarial Training
- Author
-
Rong Chen, Yanyun Qu, Cuihua Li, Yuan Xie, and Xiaotong Luo
- Subjects
Discriminator ,Visual perception ,Joint attention ,Computer science ,business.industry ,Pattern recognition ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Superresolution ,Weighting ,Feature (computer vision) ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,0105 earth and related environmental sciences - Abstract
Tremendous progress has been witnessed on single image super-resolution (SR), where existing deep SR models achieve impressive performance in objective criteria, e.g., PSNR and SSIM. However, most of the SR methods are limited in visual perception, for example, they look too smooth. Generative adversarial network (GAN) favors SR visual effects over most of the deep SR models but is poor in objective criteria. In order to trade off the objective and subjective SR performance, we design a joint-attention discriminator with which GAN improves the SR performance in PSNR and SSIM, as well as maintaining the visual effect compared with non-attention GAN based SR models. The joint-attention discriminator contains dense channel-wise attention and cross-layer attention blocks. The former is applied in the shallow layers of the discriminator for channel-wise weighting combination of feature maps. The latter is employed to select feature maps in some middle and deep layers for effective discrimination. Extensive experiments are conducted on six benchmark datasets and the experimental results show that our proposed discriminator combining with different generators can achieve more realistic visual performances.
- Published
- 2019
27. Alleviating Irregularity in Graph Analytics Acceleration
- Author
-
Xin Ma, Xiaochun Ye, Han Li, Lei Deng, Abanti Basak, Zhimin Zhang, Peng Gu, Dongrui Fan, Mingyu Yan, Shuangchen Li, Yuan Xie, Itir Akgun, Xing Hu, and Yujing Feng
- Subjects
010302 applied physics ,Speedup ,Computer science ,Memory bandwidth ,02 engineering and technology ,Dynamic priority scheduling ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,Microarchitecture ,Data dependency ,0103 physical sciences ,Datapath ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,General-purpose computing on graphics processing units - Abstract
Graph analytics is an emerging application which extracts insights by processing large volumes of highly connected data, namely graphs. The parallel processing of graphs has been exploited at the algorithm level, which in turn incurs three irregularities onto computing and memory patterns that significantly hinder an efficient architecture design. Certain irregularities can be partially tackled by the prior domain-specific accelerator designs with well-designed scheduling of data access, while others remain unsolved. Unlike prior efforts, we fully alleviate these irregularities at their origin---the data-dependent program behavior. To achieve this goal, we propose GraphDynS, a hardware/software co-design with decoupled datapath and data-aware dynamic scheduling. Aware of data dependencies extracted from the decoupled datapath, GraphDynS can elaborately schedule the program on-the-fly to maximize parallelism. To extract data dependencies at runtime, we propose a new programming model in synergy with a microarchitecture design that supports datapath decoupling. Through data dependency information, we present several data-aware strategies to dynamically schedule workloads, data accesses, and computations. Overall, GraphDynS achieves 4.4× speedup and 11.6× less energy on average with half the memory bandwidth compared to a state-of-the-art GPGPU-based solution. Compared to a state-of-the-art graph analytics accelerator, GraphDynS also achieves 1.9× speedup and 1.8× less energy on average using the same memory bandwidth.
- Published
- 2019
28. Sparse Tensor Core
- Author
-
Tao Zhang, Yuan Xie, Maohua Zhu, and Zhenyu Gu
- Subjects
010302 applied physics ,Contextual image classification ,Artificial neural network ,Computer science ,business.industry ,Computation ,02 engineering and technology ,01 natural sciences ,Object detection ,020202 computer hardware & architecture ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Multiplication ,Tensor ,Pruning (decision trees) ,business ,Algorithm ,Decoding methods ,Computer hardware ,Sparse matrix - Abstract
Deep neural networks have become the compelling solution for the applications such as image classification, object detection, speech recognition, and machine translation. However, the great success comes at the cost of excessive computation due to the over-provisioned parameter space. To improve the computation efficiency of neural networks, many pruning techniques have been proposed to reduce the amount of multiply-accumulate (MAC) operations, which results in high sparsity in the networks. Unfortunately, the sparse neural networks often run slower than their dense counterparts on modern GPUs due to their poor device utilization rate. In particular, as the sophisticated hardware primitives (e.g., Tensor Core) have been deployed to boost the performance of dense matrix multiplication by an order of magnitude, the performance of sparse neural networks lags behind significantly. In this work, we propose an algorithm and hardware co-design methodology to accelerate the sparse neural networks. A novel pruning algorithm is devised to improve the workload balance and reduce the decoding overhead of the sparse neural networks. Meanwhile, new instructions and micro-architecture optimization are proposed in Tensor Core to adapt to the structurally sparse neural networks. Our experimental results show that the pruning algorithm can achieve 63% performance gain with model accuracy sustained. Furthermore, the hardware optimization gives an additional 58% performance gain with negligible area overhead.
- Published
- 2019
29. MEDAL
- Author
-
Xing Hu, Shuangchen Li, Peng Gu, Yuan Xie, Wenqin Huangfu, and Xueqi Li
- Subjects
010302 applied physics ,Speedup ,Computer science ,Computational genomics ,02 engineering and technology ,DIMM ,01 natural sciences ,DNA sequencing ,020202 computer hardware & architecture ,chemistry.chemical_compound ,Memory module ,chemistry ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Memory footprint ,Algorithm ,DNA - Abstract
Computational genomics has proven its great potential to support precise and customized health care. However, with the wide adoption of the Next Generation Sequencing (NGS) technology, 'DNA Alignment', as the crucial step in computational genomics, is becoming more and more challenging due to the booming bio-data. Consequently, various hardware approaches have been explored to accelerate DNA seeding - the core and most time consuming step in DNA alignment. Most previous hardware approaches leverage multi-core, GPU, and FPGA to accelerate DNA seeding. However, DNA seeding is bounded by memory and above hardware approaches focus on computation. For this reason, Near Data Processing (NDP) is a better solution for DNA seeding. Unfortunately, existing NDP accelerators for DNA seeding face two grand challenges, i.e., fine-grained random memory access and scalability demand for booming bio-data. To address those challenges, we propose a practical, energy efficient, Dual-Inline Memory Module (DIMM) based, NDP Accelerator for DNA Seeding Algorithm (MEDAL), which is based on off-the-shelf DRAM components. For small databases that can be fitted within a single DRAM rank, we propose the intra-rank design, together with an algorithm-specific address mapping, bandwidth-aware data mapping, and Individual Chip Select (ICS) to address the challenge of fine-grained random memory access, improving parallelism and bandwidth utilization. Furthermore, to tackle the challenge of scalability for large databases, we propose three inter-rank designs (polling-based communication, interrupt-based communication, and Non-Volatile DIMM (NVDIMM)-based solution). In addition, we propose an algorithm-specific data compression technique to reduce memory footprint, introduce more space for the data mapping, and reduce the communication overhead. Experimental results show that for three proposed designs, on average, MEDAL can achieve 30.50x/8.37x/3.43x speedup and 289.91x/6.47x/2.89x energy reduction when compared with a 16-thread CPU baseline and two state-of-the-art NDP accelerators, respectively.
- Published
- 2019
30. Efficient System Architecture in the Era of Monolithic 3D
- Author
-
Wenqin Huangfu, Yuan Xie, Dylan Stow, Gabriel H. Loh, Itir Akgun, and Xueqi Li
- Subjects
010302 applied physics ,Flexibility (engineering) ,Interconnection ,Computer science ,Computation ,Process (computing) ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Coupling (computer programming) ,Computer architecture ,Parallel communication ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Systems architecture - Abstract
Emerging Monolithic Three-Dimensional (M3D) integration technology will not only provide improved circuit density through the high-bandwidth coupling of multiple vertically-stacked layers, but it can also provide new architectural opportunities for on-chip computation, memory, and communication that are beyond the capabilities of existing process and packaging technologies. For example, with massive parallel communication between heterogeneous memory and compute layers, existing processing-in-memory architectures can be optimized and expanded, developing into efficient and flexible near-data processors. Additionally, multiple tiers of interconnect can be dynamically leveraged to provide an efficient, scalable interconnect fabric that spans the three-dimensional system. This work explores some of the challenges and opportunities presented by M3D technology for emerging computer architectures, with focus on improving efficiency and increasing system flexibility.
- Published
- 2019
31. Memory-Bound Proof-of-Work Acceleration for Blockchain Applications
- Author
-
Yu Wang, Kun Wu, Shuangchen Li, Xing Hu, Guohao Dai, Yuan Xie, and Xinfeng Xie
- Subjects
Hardware_MEMORYSTRUCTURES ,Computer science ,Design space exploration ,business.industry ,Hash function ,Memory bandwidth ,02 engineering and technology ,CAS latency ,020202 computer hardware & architecture ,Embedded system ,Proof-of-work system ,Memory architecture ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business ,Context switch - Abstract
Blockchain applications have shown huge potential in various domains. Proof of Work (PoW) is the key procedure in blockchain applications, which exhibits the memory-bound characteristic and hinders the performance improvement of blockchain accelerators. In order to mitigate the "memory wall" and improve the performance of memory-hard PoW accelerators, using Ethash as an example, we optimize the memory architecture from two perspectives: 1) Hiding memory latency. We propose specialized context switch design to overcome the uncertain cycles of repetitive memory requests. 2) Increasing memory bandwidth utilization. We introduce on-chip memory that stores a portion of the Ethash directed acyclic graph (DAG) for larger effective memory bandwidth, and further propose adopting embedded NOR flash to fulfill the role. Then, we conduct extensive experiments to explore the design space of our optimized memory architecture for Ethash, including number of hash cores, on-chip/off-chip memory technologies and specifications. Based on the design space exploration, we finally provide the guidance for designing the memory-bound PoW accelerator. The experiment results show that our optimized designs achieve 8.7% -- 55% higher hash rate and 17% -- 120% higher hash rate per Joule compared with the baseline design in different configurations.
- Published
- 2019
32. CNNWire
- Author
-
Lei Deng, Xing Hu, Jilan Lin, Yuan Xie, and Shuangchen Li
- Subjects
Boosting (machine learning) ,Speedup ,Artificial neural network ,Processing element ,Computer science ,0202 electrical engineering, electronic engineering, information engineering ,Inference ,02 engineering and technology ,Parallel computing ,Convolutional neural network ,020202 computer hardware & architecture ,Efficient energy use ,Resistive random-access memory - Abstract
Resistive random access memory (ReRAM) demonstrates the great potential of in-memory processing for neural network (NN) acceleration. However, since the convolutional neural network (CNN) is widely known as compute-bound, current ReRAM-based accelerators are not able to support CNN efficiently. In this paper, we for the first time propose the CNN accelerator with Winograd's convolution on ReRAM (CNNWire), which minimizes the multiplications to enable fast and efficient CNN inference. We realize the convolution with Winograd Processing Element (WPE) based on convolutional tiles. Interconnections between WPEs are designed aiming to improve the data reuse. Finally, we introduce the full mapping flow to implement the Winograd convolution The results show that CNMWire gains 3.85x energy efficiency boosting and 3.24x speedup on average among different CNN benchmarks, compared with traditional GEMM based mapping.
- Published
- 2019
33. Learning the sparsity for ReRAM
- Author
-
Yuan Xie, Yu Wang, Zhenhua Zhu, and Jilan Lin
- Subjects
010302 applied physics ,Speedup ,Artificial neural network ,Computer science ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Resistive random-access memory ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Pruning (decision trees) ,Crossbar switch ,Quantization (image processing) ,Cluster analysis ,Algorithm ,Sparse matrix - Abstract
With the in-memory processing ability, ReRAM based computing gets more and more attractive for accelerating neural networks (NNs). However, most ReRAM based accelerators cannot support efficient mapping for sparse NN, and we need to map the whole dense matrix onto ReRAM crossbar array to achieve O(1) computation complexity. In this paper, we propose a sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization. Further, we propose crossbar-grained pruning algorithm to remove the crossbars with low utilization. Finally, since most current ReRAM devices cannot achieve high precision, we analyze the effect of quantization precision for sparse NN, and propose to complete high-precision composing in the analog field and design related periphery circuits. In our experiments, we discuss how the system performs with different crossbar sizes to choose the optimized design. Our results show that our mapping scheme for sparse NN with proposed pruning algorithm achieves 3 -- 5X energy efficiency and more than 2.5 -- 6X speedup, compared with those accelerators for dense NN. Also, the accuracy experiments show that our pruning method appears to have almost no accuracy loss.
- Published
- 2019
34. GraphIA
- Author
-
Gushu Li, Yu Wang, Shuangchen Li, Guohao Dai, and Yuan Xie
- Subjects
010302 applied physics ,Interconnection ,Hardware_MEMORYSTRUCTURES ,Speedup ,Computer science ,Computation ,02 engineering and technology ,Parallel computing ,01 natural sciences ,020202 computer hardware & architecture ,Scheduling (computing) ,Memory bank ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Data deduplication ,Dram ,Electronic circuit - Abstract
Graph processing is widely used in various domains, while processing large-scale graphs has always been memory-bound. In-situ processing is a promising solution to overcome the "memory wall" challenges in such memory-intensive applications. Previous accelerator designs for graph processing only focused on integrating more computing units inside memories or using more memory layers, rather than exploiting the huge parallelism lying in memory banks. In this paper, we present GraphIA, an In-situ Accelerator for large-scale graph processing based on DRAM technology. GraphIA couples large-capacity memory and computing resource in DRAM by connecting multiple chips with computation circuits inside. GraphIA chips are organized into a scaling ring interconnection, which is able to maximize the individual bandwidth with minimal connection overheads and scale to larger graphs by using more chips. Banks in DRAM are organized into heterogeneous edge and vertex banks, cooperating with customized peripheral circuits. Data duplication and scheduling schemes in heterogeneous banks are further introduced to overcome the performance loss caused by the irregular local and remote memory access in our multi-chip ring structure, achieving 1.63X and 1.16X speedup respectively. According to our extensive experiments, by adopting GraphIA design, our in-situ accelerator achieves 217X speedup CPU-DRAM designs.
- Published
- 2018
35. RADAR
- Author
-
Wenqin Huangfu, Xing Hu, Yuan Xie, and Shuangchen Li
- Subjects
0301 basic medicine ,Biological data ,Speedup ,Computer science ,Volume (computing) ,Genomics ,Sequence alignment ,02 engineering and technology ,DNA sequencing ,020202 computer hardware & architecture ,law.invention ,Computational science ,03 medical and health sciences ,chemistry.chemical_compound ,030104 developmental biology ,chemistry ,law ,Gene duplication ,0202 electrical engineering, electronic engineering, information engineering ,Sequence alignment algorithm ,Radar ,Energy (signal processing) ,DNA - Abstract
Next Generation Sequencing (NGS) technology has become an indispensable tool for studying genomics, resulting in an exponentially growth of biological data. Booming data volume demands significant computational resources and creates challenges for 'Sequence Alignment', which is the most fundamental application in bioinformatics. Consequently, many researchers exploit both software and hardware methods to accelerate the most widely used sequence alignment algorithm - Basic Local Alignment Search Tool (BLAST). However, prior work suffers from moving huge DNA databases from the storage to computational units. Such data movement is both time and energy consuming. Based on the observation that the bottlenecks of BLAST involve a large amount of comparison operations, we propose a 3D Resistive Random Access Memory (ReRAM) based DNA Alignment Accelerator Architecture (RADAR) which performs most computational operations locally without moving DNA databases. To improve the storage density for various lengths of DNA sequences without damaging the performance, we propose a dense data mapping scheme to handle DNA sequences efficiently and a Tail Bits Duplication (TBD) technique to enable fully parallel computation for RADAR. Experimental results show that RADAR can achieve 5114x speedup and 386x energy reduction when compared to a single CPU. Compared to the Multi-Core/FPGA/GPU based accelerators, RADAR outperforms them between 53x and 1896x in processing speed.
- Published
- 2018
36. SNrram
- Author
-
Yuan Xie, Chi Hong, Yu Ji, Peiqi Wang, Dongsheng Wang, and Yongqiang Lyu
- Subjects
010302 applied physics ,Speedup ,Artificial neural network ,Computer science ,business.industry ,Deep learning ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Memory management ,Computer engineering ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Pruning (decision trees) ,Crossbar switch ,business ,Field-programmable gate array ,Sparse matrix - Abstract
The sparsity in the deep neural networks can be leveraged by methods such as pruning and compression to help the efficient deployment of large-scale deep neural networks onto hardware platforms, such as GPU or FPGA, for better performance and power efficiency. However, for RRAM crossbar-based architectures, the study of efficient methods to consider the network sparsity is still in the early stage. In this study, we propose SNrram, an efficient sparse neural network computation architecture using RRAM, by exploiting the sparsity in both weights and activation. SNrram stores nontrivial weights and organizes them to eliminate zero-value multiplications for better resource utilization. Experimental results show that SNrram can save RRAM resources by 69.8%, reduce the power consumption by 35.9%, and speed up by 2.49× on popular deep learning benchmarks, compared to a state-of-the-art RRAM-based neural network accelerator.
- Published
- 2018
37. Bridge the Gap between Neural Networks and Neuromorphic Hardware with a Neural Network Compiler
- Author
-
Youhui Zhang, Yu Ji, Wenguang Chen, and Yuan Xie
- Subjects
010302 applied physics ,Spiking neural network ,Artificial neural network ,Computer science ,business.industry ,Computation ,02 engineering and technology ,computer.software_genre ,Chip ,01 natural sciences ,020202 computer hardware & architecture ,Transformation (function) ,Software ,Neuromorphic engineering ,Computer engineering ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Compiler ,business ,computer - Abstract
Different from developing neural networks (NNs) for general-purpose processors, the development for NN chips usually faces with some hardware-specific restrictions, such as limited precision of network signals and parameters, constrained computation scale, and limited types of non-linear functions. This paper proposes a general methodology to address the challenges. We decouple the NN applications from the target hardware by introducing a compiler that can transform an existing trained, unrestricted NN into an equivalent network that meets the given hardware's constraints. We propose multiple techniques to make the transformation adaptable to different kinds of NN chips, and reliable for restrict hardware constraints. We have built such a software tool that supports both spiking neural networks (SNNs) and traditional artificial neural networks (ANNs). We have demonstrated its effectiveness with a fabricated neuromorphic chip and a processing-in-memory (PIM) design. Tests show that the inference error caused by this solution is insignificant and the transformation time is much shorter than the retraining time. Also, we have studied the parameter-sensitivity evaluations to explore the tradeoffs between network error and resource utilization for different transformation strategies, which could provide insights for co-design optimization of neuromorphic hardware and software.
- Published
- 2018
38. NEOFog
- Author
-
Yongpan Liu, Jinyang Li, Xueqing Li, Zhibo Wang, Yuan Xie, Tongda Wu, Mahmut Kandemir, Kaisheng Ma, Jack Sampson, and Vijaykrishnan Narayanan
- Subjects
010302 applied physics ,Computer science ,Distributed computing ,020208 electrical & electronic engineering ,02 engineering and technology ,Load balancing (computing) ,01 natural sciences ,Computer Graphics and Computer-Aided Design ,Multiplexing ,020202 computer hardware & architecture ,Fog computing ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Systems architecture ,Energy harvesting ,Wireless sensor network ,Software - Abstract
Nonvolatile processors have emerged as one of the promising solutions for energy harvesting scenarios, among which Wireless Sensor Networks (WSN) provide some of the most important applications. In a typical distributed sensing system, due to difference in location, energy harvester angles, power sources, etc. different nodes may have different amount of energy ready for use. While prior approaches have examined these challenges, they have not done so in the context of the features offered by nonvolatile computing approaches, which disrupt certain foundational assumptions. We propose a new set of nonvolatility-exploiting optimizations and embody them in the NEOFog system architecture. We discuss shifts in the tradeoffs in data and program distribution for nonvolatile processing-based WSNs, showing how non-volatile processing and non-volatile RF support alter the benefits of computation and communication-centric approaches. We also propose a new algorithm specific to nonvolatile sensing systems for load balancing both computation and communication demands. Collectively, the NV-aware optimizations in NEOFog increase the ability to perform in-fog processing by 4.2X and can increase this to 8X if virtualized nodes are 3X multiplexed.
- Published
- 2018
39. Incidental computing on IoT nonvolatile processors
- Author
-
Yongpan Liu, Yuan Xie, Mahmut Kandemir, Kaisheng Ma, Jinyang Li, Xueqing Li, Vijaykrishnan Narayanan, and Jack Sampson
- Subjects
010302 applied physics ,Focus (computing) ,Computer science ,Distributed computing ,Real-time computing ,Volume (computing) ,02 engineering and technology ,01 natural sciences ,020202 computer hardware & architecture ,Variety (cybernetics) ,Backup ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,System on a chip ,Energy (signal processing) - Abstract
Batteryless IoT devices powered through energy harvesting face a fundamental imbalance between the potential volume of collected data and the amount of energy available for processing that data locally. However, many such devices perform similar operations across each new input record, which provides opportunities for mining the potential information in buffered historical data, at potentially lower effort, while processing new data rather than abandoning old inputs due to limited computational energy. We call this approach incidental computing, and highlight synergies between this approach and approximation techniques when deployed on a non-volatile processor platform (NVP). In addition to incidental computations, the backup and restore operations in an incidental NVP provide approximation opportunities and optimizations that are unique to NVPs.We propose a variety of incidental approximation approaches suited to NVPs, with a focus on approximate backup and restore, and approximate recomputation in the face of power interruptions. We perform RTL level evaluation for many frequently used workloads. We show that these incidental techniques provide an average of 4.2X more forward progress than precise NVP execution.CCS CONCEPTS• Computer systems organization → Single instruction, multiple data; Special purpose systems; System on a chip; Embedded hardware
- Published
- 2017
40. DRISA
- Author
-
Yuan Xie, Zheng Hongzhong, Shuangchen Li, Malladi Krishna T, Niu Dimin, and Bob Brennan
- Subjects
Flat memory model ,Speedup ,Computer science ,Registered memory ,02 engineering and technology ,Overlay ,Parallel computing ,01 natural sciences ,CAS latency ,CUDA Pinned memory ,Universal memory ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Interleaved memory ,Computing with Memory ,Static random-access memory ,Massively parallel ,Computer memory ,010302 applied physics ,Hardware_MEMORYSTRUCTURES ,Sense amplifier ,Uniform memory access ,Semiconductor memory ,Memory controller ,020202 computer hardware & architecture ,Extended memory ,Physical address ,Memory rank ,Dram ,Dram memory - Abstract
Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks
- Published
- 2017
41. There and Back Again
- Author
-
Itir Akgun, Gabriel H. Loh, Jieming Yin, Onur Kayiran, Yuan Xie, and Matthew Poremba
- Subjects
Flat memory model ,Computer science ,Registered memory ,02 engineering and technology ,01 natural sciences ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Interleaved memory ,Computing with Memory ,010302 applied physics ,Random access memory ,Distributed shared memory ,Hardware_MEMORYSTRUCTURES ,business.industry ,Cache-only memory architecture ,Uniform memory access ,Semiconductor memory ,Supercomputer ,020202 computer hardware & architecture ,Extended memory ,Non-volatile memory ,Physical address ,Memory management ,Shared memory ,Embedded system ,Distributed memory ,business ,Dram ,Computer network - Abstract
High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory "cubes" with high per-package capacity (from 3D integration) along with high-speed point-to-point interconnects provide a scalable memory system architecture with the potential to deliver both capacity and performance. Multiple such cubes connected together can form a "Memory Network" (MN), but the design space for such MNs is quite vast, including multiple topology types and multiple memory technologies per memory cube.In this work, we first analyze several MN topologies with different mixes of memory package technologies to understand the key tradeoffs and bottlenecks for such systems. We find that most of a MN's performance challenges arise from the interconnection network that binds the memory cubes together. In particular, arbitration schemes used to route through MNs, ratio of NVM to DRAM, and specific topologies used have dramatic impact on performance and energy results. Our initial analysis indicates that introducing non-volatile memory to the MN presents a unique tradeoff between memory array latency and network latency. We observe that placing NVM cubes in a specific order in the MN improves performance by reducing the network size/diameter up to a certain NVM to DRAM ratio. Novel MN topologies and arbitration schemes also provide performance and energy deltas by reducing the hop count of requests and response in the MN. Based on our analyses, we introduce three techniques to address MN latency issues: (1) Distance-based arbitration scheme to improve queuing latencies throughout the network, (2) skip-list topology, derived from the classic data structure, to improve network latency and link usage, and (3) the MetaCube, a denser memory cube that leverages advanced packaging technologies to improve latency by reducing MN size.
- Published
- 2017
42. TIME
- Author
-
Yu Wang, Lixue Xia, Yuan Xie, Ming Cheng, Yi Cai, Zhenhua Zhu, and Huazhong Yang
- Subjects
Random access memory ,Artificial neural network ,Computer science ,020208 electrical & electronic engineering ,Supervised learning ,02 engineering and technology ,Memristor ,Backpropagation ,020202 computer hardware & architecture ,law.invention ,Resistive random-access memory ,Application-specific integrated circuit ,law ,Memory architecture ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering - Abstract
The training of neural network (NN) is usually time-consuming and resource intensive. Memristor has shown its potential in computation of NN. Especially for the metal-oxide resistive random access memory (RRAM), its crossbar structure and multi-bit characteristic can perform the matrix-vector product in high precision, which is the most common operation of NN. However, there exist two challenges on realizing the training of NN. Firstly, the current architecture can only support the inference phase of training and cannot perform the backpropagation (BP), the weights update of NN. Secondly, the training of NN requires enormous iterations and constantly updates the weights to reach the convergence, which leads to large energy consumption because of lots of write and read operations. In this work, we propose a novel architecture, TIME, and peripheral circuit designs to enable the training of NN in RRAM. TIME supports the BP and the weights update while maximizing the reuse of peripheral circuits for the inference operation on RRAM. Meanwhile, a variability-free tuning scheme and gradually-write circuits are designed to reduce the cost of tuning RRAM. We explore the performance of both SL (supervised learning) and DRL (deep reinforcement learning) in TIME, and a specific mapping method of DRL is also introduced to further improve the energy efficiency. Experimental results show that, in SL, TIME can achieve 5.3× higher energy efficiency on average compared with the most powerful application-specific integrated circuits (ASIC) in the literature. In DRL, TIME can perform averagely 126× higher than GPU in energy efficiency. If the cost of tuning RRAM can be further reduced, TIME have the potential of boosting the energy efficiency by 2 orders of magnitude compared with ASIC.
- Published
- 2017
43. Security Threats and Countermeasures in Three-Dimensional Integrated Circuits
- Author
-
Jaya Dofe, Qiaoyan Yu, Peng Gu, Yuan Xie, Eren Kursun, and Dylan Stow
- Subjects
medicine.medical_specialty ,Engineering ,Hardware security module ,business.industry ,Supply chain ,020208 electrical & electronic engineering ,SIGNAL (programming language) ,Computer security compromised by hardware failure ,02 engineering and technology ,Integrated circuit ,Computer security ,computer.software_genre ,020202 computer hardware & architecture ,law.invention ,Attack model ,law ,Hardware Trojan ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Side channel attack ,business ,computer - Abstract
Existing works on Three-dimensional (3D) hardware security focus on leveraging the unique 3D characteristics to address the supply chain attacks that exist in 2D design. However, 3D ICs introduce specific and unexplored challenges as well as new opportunities for managing hardware security. In this paper, we analyze new security threats unique to 3D ICs. The corresponding attack models are summarized for future research. Furthermore, existing representative countermeasures, including split manufacturing, camouflaging, transistor locking, techniques against thermal signal based side-channel attacks, and network-on-chip based shielding plane (NoCSIP) for different hardware threats are reviewed and categorized. Moreover, preliminary countermeasures are proposed to thwart TSV-based hardware Trojan insertion attacks.
- Published
- 2017
44. ODESY
- Author
-
Linuo Xue, Peiyuan Wang, Yuan Xie, Yuanqing Cheng, and Jianlei Yang
- Subjects
010302 applied physics ,Hardware_MEMORYSTRUCTURES ,business.industry ,CPU cache ,Computer science ,02 engineering and technology ,Energy consumption ,Cell design ,01 natural sciences ,020202 computer hardware & architecture ,Memory cell ,Embedded system ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Area density ,Latency (engineering) ,business ,Computer hardware - Abstract
The STT-RAM (Spin-Transfer Torque Magnetic RAM) technology is a promising candidate for cache memory because of its high density, low standy-power, and non-volatility. As technology scales, especially under 40nm technology node, the read disturbance becomes severe since the read current approaches closely to the switching current. In addition, the read latency and access performance degrade significantly as well. The conventional 1T-1MTJ and 2T-2MTJ cell designs cannot address these challenges efficiently. In this paper, we propose a novel 3T-3MTJ cell structure using the advanced perpendicular MTJ technology. This memory cell has higher storage density and better performance, and is particularly suitable for the deeply scaled technology node. A two-stage sensing scheme is also proposed to facilitate the read operation of the 3T-3MTJ cell design. Circuit-level and architecture-level simulations show that the proposed 3T-3MTJ cell structure can achieve a better tradeoff between storage density, access performance, energy consumption, and reliability compared to the prior 1T-1MTJ and 2T-2MTJ cell structures.
- Published
- 2016
45. NVSim-CAM
- Author
-
Peng Gu, Liu Liu, Yuan Xie, Shuangchen Li, and Cong Xu
- Subjects
010302 applied physics ,Magnetoresistive random-access memory ,Hardware_MEMORYSTRUCTURES ,Computer science ,business.industry ,02 engineering and technology ,Content-addressable memory ,01 natural sciences ,020202 computer hardware & architecture ,Resistive random-access memory ,Non-volatile memory ,Embedded system ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Static random-access memory ,business - Abstract
Ternary Content-Addressable Memory (TCAM) is widely used in networking routers, fully associative caches, search engines, etc. While the conventional SRAM-based TCAM suffers from the poor scalability, the emerging nonvolatile memories (NVM, i.e., MRAM, PCM, and ReRAM) bring evolution for the TCAM design. It effectively reduces the cell size, and makes significant energy reduction and scalability improvement. New applications such as associative processors/accelerators are facilitated by the emergence of the nonvolatile TCAM (nvTCAM). However, nvTCAM design is challenging. In addition to the emerging device's uncertainty, the nvTCAM cell structure is so diverse that it results in a design space too large to explore manually. To tackle these challenges, we propose a circuit-level model and develop a simulation tool, NVSim-CAM, which helps researchers to make early design decisions, and to evaluate device/circuit innovations. The tool is validated by HSPICE simulations and data from fabricated chips. We also present a case study to illustrate how NVSim-CAM benefits the nvTCAM design. In the case study, we propose a novel 3D vertical ReRAM based TCAM cell, the 3DvTCAM. We project the advantages/disadvantages and explore the design space for the proposed cell with NVSim-CAM.
- Published
- 2016
46. Cost analysis and cost-driven IP reuse methodology for SoC design based on 2.5D/3D integration
- Author
-
Itir Akgun, Yuan Xie, Peng Gu, Dylan Stow, and Russell Barnes
- Subjects
business.industry ,Computer science ,Process (engineering) ,Scale (chemistry) ,020208 electrical & electronic engineering ,Transistor ,02 engineering and technology ,Reuse ,020202 computer hardware & architecture ,law.invention ,Reliability engineering ,Reduction (complexity) ,Product (business) ,law ,Logic gate ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Cost analysis ,business - Abstract
Due to the increasing fabrication and design complexity with new process nodes, the cost per transistor trend originally identified in Moore's Law is slowing when using traditional integration methods. However, emerging die-level integration technologies may be viable alternatives that can scale the number of transistors per integrated device while reducing the cost per transistor through yield improvements across multiple smaller dies. Additionally, the escalating overheads of non-recurring engineering costs like masks and verification can be curtailed through die integration-enabled reuse of intellectual property across heterogeneous process technologies. In this paper, we present an analytical cost model for 3D and interposer-based 2.5D die integration and employ it to demonstrate the potential cost reductions across semiconductor markets. We also propose a methodology and platform for IP reuse based on these integration technologies and explore the available reductions in overall product cost through reduction in non-recurring engineering effort.
- Published
- 2016
47. Neural network transformation under hardware constraints
- Author
-
Youhui Zhang, Wenguang Chen, Yuan Xie, and Yu Ji
- Subjects
Floating point ,Artificial neural network ,business.industry ,Computer science ,0102 computer and information sciences ,02 engineering and technology ,01 natural sciences ,Backpropagation ,Winner-take-all ,020202 computer hardware & architecture ,Range (mathematics) ,Transformation (function) ,Software ,Neuromorphic engineering ,010201 computation theory & mathematics ,0202 electrical engineering, electronic engineering, information engineering ,business ,Computer hardware - Abstract
There are a number of mature ways to train various kinds of ANNs (artificial neural networks), including the BP (back propagation) based algorithm and so on. These training procedures are usually carried out on some GPU-enabled machine(s); 16-/32-bit-width floating point numbers are used as the NN parameters, without any limitation on the maximum fan-in/fan-out of a single neuron or on the type of activation functions. In contrast, for neuromorphic chips [1][2][3], quite a few hardware-specific constraints (the limited fan-in/fan-out of a single neuron, the limited range of synaptic weights, and the hardware types of neurons or activation functions are usually simpler than the software counterparts) do exist, which makes programming such chips difficult.
- Published
- 2016
48. NVSim-VX s
- Author
-
Ismail Bayram, Yuan Xie, Enes Eken, Cong Xu, Yi Chen, Wujie Wen, and Linghao Song
- Subjects
010302 applied physics ,Random access memory ,Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,Semiconductor memory ,02 engineering and technology ,Energy consumption ,01 natural sciences ,020202 computer hardware & architecture ,Resistive random-access memory ,Non-volatile memory ,Embedded system ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Electronic engineering ,Cache ,business ,Computer memory - Abstract
Spin-transfer torque random access memory (STT-RAM) recently received significant attentions for its promising characteristics in cache and memory applications. As an early-stage modeling tool, NVSim has been widely adopted for simulations of emerging nonvolatile memory technologies in computer architecture research, including STT-RAM, ReRAM, PCM, etc. In this work, we introduce a new member of NVSim family -- NVSim-VXs, which enables statistical simulation of STT-RAM for write performance, errors, and energy consumption. This enhanced model takes into account the impacts of parametric variabilities of CMOS and MTJ devices and the chip operating temperature. It is also calibrated with Monte-Carlo Simulations based on macro-magnetic and SPICE models, covering five technology nodes between 22nm and 90nm. NVSim-VXs strongly supports the fast-growing needs of STT-RAM research on reliability analysis and enhancement, announcing the next important stage of NVSim development.
- Published
- 2016
49. Pinatubo
- Author
-
Yu Lu, Yuan Xie, Qiaosha Zou, Shuangchen Li, Cong Xu, and Jishen Zhao
- Subjects
Flat memory model ,Speedup ,Computer science ,Registered memory ,02 engineering and technology ,Parallel computing ,01 natural sciences ,0103 physical sciences ,Memory architecture ,0202 electrical engineering, electronic engineering, information engineering ,Interleaved memory ,Computing with Memory ,Static random-access memory ,Memory refresh ,Massively parallel ,Conventional memory ,Computer memory ,010302 applied physics ,Random access memory ,Hardware_MEMORYSTRUCTURES ,business.industry ,Sense amplifier ,Cache-only memory architecture ,Uniform memory access ,Semiconductor memory ,Memory map ,020202 computer hardware & architecture ,Non-volatile memory ,Physical address ,Memory management ,Shared memory ,Embedded system ,Computer data storage ,Non-volatile random-access memory ,business ,Dram memory - Abstract
Processing-in-memory (PIM) provides high bandwidth, massive parallelism, and high energy efficiency by implementing computations in main memory, therefore eliminating the overhead of data movement between CPU and memory. While most of the recent work focused on PIM in DRAM memory with 3D die-stacking technology, we propose to leverage the unique features of emerging non-volatile memory (NVM), such as resistance-based storage and current sensing, to enable efficient PIM design in NVM. We propose Pinatubo1, a Processing In Non-volatile memory ArchiTecture for bUlk Bitwise Operations. Instead of integrating complex logic inside the cost-sensitive memory, Pinatubo redesigns the read circuitry so that it can compute the bitwise logic of two or more memory rows very efficiently, and support one-step multi-row operations. The experimental results on data intensive graph processing and database applications show that Pinatubo achieves a ∼500 x speedup, ∼28000x energy saving on bitwise operations, and 1.12× overall speedup, 1.11× overall energy saving over the conventional processor.
- Published
- 2016
50. Fine-granularity tile-level parallelism in non-volatile memory architecture with two-dimensional bank subdivision
- Author
-
Yuan Xie, Matthew Poremba, and Tao Zhang
- Subjects
Flat memory model ,Computer science ,Registered memory ,02 engineering and technology ,Parallel computing ,01 natural sciences ,0103 physical sciences ,Memory architecture ,0202 electrical engineering, electronic engineering, information engineering ,Interleaved memory ,Memory refresh ,Computer memory ,Conventional memory ,010302 applied physics ,Random access memory ,Hardware_MEMORYSTRUCTURES ,Sense amplifier ,Cache-only memory architecture ,Uniform memory access ,Semiconductor memory ,Memory map ,020202 computer hardware & architecture ,Resistive random-access memory ,Extended memory ,Non-volatile memory ,Memory management ,Memory bank ,Shared memory ,Computer architecture ,Non-volatile random-access memory ,Memory rank ,Dram ,Dram memory - Abstract
Emerging memory technologies such as phase-change memory (PCM) and resistive RAMs (RRAM) have been proposed as promising candidates for future DRAM replacements. Due to the nature of how these memories operate, unique properties (such as non-destructive read and current-sensing) can be exploited to further subdivide memory and provide increasing parallelism with negligible overhead. In this work, we leverage these properties to design a finegrained non-volatile memory (FgNVM), featuring two-dimensional bank subdivision for tile-level parallelism (TLP) in a NVM memory bank, with much finer-granularity and increased parallelism than the one-dimensional bank subdivision for subarray-level parallelism (SALP) in a DRAM memory bank. With such new tile-level parallelism, three new memory access modes are proposed for further performance improvement and energy reduction: Partial-Activation, Multi-Activation, and Background Writes. Our experimental results show that the new architecture is highly effective in boosting non-volatile memory performance with significant energy reduction. To the best of our knowledge, this is the first work to study fine-granularity memory access in emerging non-volatile memory architectures.
- Published
- 2016
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.