18 results on '"Jidong Zhai"'
Search Results
2. UniQ: A Unified Programming Model for Efficient Quantum Circuit Simulation
- Author
-
Chen Zhang, Haojie Wang, Zixuan Ma, Lei Xie, Zeyu Song, and Jidong Zhai
- Published
- 2022
3. Mitigating Crosstalk in Quantum Computers through Commutativity-Based Instruction Reordering
- Author
-
Lei Xie, Jidong Zhai, and Weimin Zheng
- Subjects
Crosstalk (biology) ,Quantum gate ,Quantum decoherence ,Computer engineering ,Computer science ,Logic gate ,media_common.quotation_subject ,Fidelity ,Electronic design automation ,Quantum ,Quantum computer ,media_common - Abstract
Crosstalk is one of the major types of noise in quantum computers. To design high-fidelity quantum gates and large-scale quantum computers, effectively suppressing crosstalk is becoming increasingly important. Previous approaches to mitigate crosstalk rely on either hardware strategies, which are only applicable on limited platforms, or software techniques, which, however, cannot fully explore instruction parallelism. To address the above challenges, we propose a commutativity based compile-time instruction reordering approach. We first propose a complete set of generalized commutativity rules for main types of quantum gates, and then we design a two-level bidirectional reordering algorithm to reorder gates according to these rules for effective crosstalk mitigation. Our approach can not only capture both forward and backward instruction correlations but also reduce the effect of decoherence. We evaluate our approach with 117 quantum programs. Compared with the state-of-the-art, our approach can improve program fidelity by up to 120% (18% on average).
- Published
- 2021
4. Accelerating GPU Message Communication for Autonomous Navigation Systems
- Author
-
Jiangming Jin, Hao Wu, Jidong Zhai, Yifan Gong, and Liu Wei
- Subjects
Data processing ,Software ,Speedup ,Computer science ,business.industry ,Autonomous Navigation System ,Distributed computing ,Memory pool ,Throughput ,Latency (engineering) ,business ,Object detection - Abstract
Autonomous navigation systems consist of multiple software modules, such as sensing, object detection, and planning, to achieve traffic perception and fast decision making. Such a system generates a large amount of data and requires data processing and communication in real-time. Although accelerators, such as GPUs, have been exploited to speed up data processing, communicating GPU messages between modules is still lacking support, leading to high communication latency and resource contention. For such a latency-sensitive and resource-limited autonomous navigation system, high performance and lightweight message communication are crucial and demanding. To obtain both high performance and low resource usages, we first propose a novel pub-centric memory pool and an on-the-fly offset conversion algorithm to avoid unnecessary data movement. Secondly, we combine these two techniques and propose an efficient message communication on a single GPU. Finally, we extend this approach to multi-GPU and design a framework that natively supports GPU message communication for Inter-Process Communication. With comprehensive evaluation, results show our approach is able to reduce communication latency by 53.7% for PointCloud and Image messages compared to the state-of-the-art approach. Moreover, in the real autonomous navigation scenario, our approach reduces the end-to-end latency by 29.2% and decreases resource usage up to 58.9%.
- Published
- 2021
5. G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression
- Author
-
Onur Mutlu, Xiaoyong Du, Xipeng Shen, Zaifeng Pan, Jidong Zhai, Yanliang Zhou, and Feng Zhang
- Subjects
FOS: Computer and information sciences ,Lossless compression ,Speedup ,Computer science ,business.industry ,Big data ,Memory pool ,Databases (cs.DB) ,Parallel computing ,Data structure ,Instruction set ,Computer Science - Databases ,Hardware Architecture (cs.AR) ,Synchronization (computer science) ,Computer Science - Hardware Architecture ,business ,Massively parallel - Abstract
Text analytics directly on compression (TADOC) has proven to be a promising technology for big data analytics. GPUs are extremely popular accelerators for data analytics systems. Unfortunately, no work so far shows how to utilize GPUs to accelerate TADOC. We describe G-TADOC, the first framework that provides GPU-based text analytics directly on compression, effectively enabling efficient text analytics on GPUs without decompressing the input data. G-TADOC solves three major challenges. First, TADOC involves a large amount of dependencies, which makes it difficult to exploit massive parallelism on a GPU. We develop a novel fine-grained thread-level workload scheduling strategy for GPU threads, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, in developing G-TADOC, thousands of GPU threads writing to the same result buffer leads to inconsistency while directly using locks and atomic operations lead to large synchronization overheads. We develop a memory pool with thread-safe data structures on GPUs to handle such difficulties. Third, maintaining the sequence information among words is essential for lossless compression. We design a sequence-support strategy, which maintains high GPU parallelism while ensuring sequence information. Our experimental evaluations show that G-TADOC provides 31.1x average speedup compared to state-of-the-art TADOC., Comment: 37th IEEE International Conference on Data Engineering (ICDE 2021)
- Published
- 2021
6. Elan: Towards Generic and Efficient Elastic Training for Deep Learning
- Author
-
Baodong Wu, Jidong Zhai, Xingcheng Zhang, Lei Xie, Yuanbo Wang, Peng Sun, and Shengen Yan
- Subjects
business.industry ,Computer science ,Deep learning ,Distributed computing ,Training system ,Initialization ,020206 networking & telecommunications ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Replication (computing) ,Asynchronous communication ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Resource management ,Artificial intelligence ,business ,0105 earth and related environmental sciences - Abstract
Showing a promising future in improving resource utilization and accelerating training, elastic deep learning training has been attracting more and more attention recently. Nevertheless, existing approaches to provide elasticity have certain limitations. They either fail to fully explore the parallelism of deep learning training when scaling out or lack an efficient mechanism to replicate training states among different devices.To address these limitations, we design Elan, a generic and efficient elastic training system for deep learning. In Elan, we propose a novel hybrid scaling mechanism to make a good trade-off between training efficiency and model performance when exploring more parallelism. We exploit the topology of underlying devices to perform concurrent and IO-free training state replication. To avoid the high overhead of start and initialization, we further propose an asynchronous coordination mechanism. Powered by the above innovations, Elan can provide high-performance (~1s) migration, scaling in and scaling out support with negligible runtime overhead (
- Published
- 2020
7. Edge-Stream: a Stream Processing Approach for Distributed Applications on a Hierarchical Edge-computing System
- Author
-
Guangyu Sun, Ping Han, Jidong Zhai, Zhe Zhou, Xiaoyang Wang, and Tong Meng
- Subjects
Scheme (programming language) ,050101 languages & linguistics ,Information privacy ,Computer science ,business.industry ,Distributed computing ,05 social sciences ,Cloud computing ,02 engineering and technology ,Data sharing ,Stream processing ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,020201 artificial intelligence & image processing ,0501 psychology and cognitive sciences ,Enhanced Data Rates for GSM Evolution ,business ,computer ,Edge computing ,computer.programming_language - Abstract
With the rapid growth of IoT devices, the traditional cloud computing scheme is inefficient for many IoT based applications, mainly due to network data flood, long latency, and privacy issues. To this end, the edge computing scheme is proposed to mitigate these problems. However, in an edge computing system, the application development becomes more complicated as it involves increasing levels of edge nodes. Although some efforts have been introduced, existing edge computing frameworks still have some limitations in various application scenarios. To overcome these limitations, we propose a new programming model called Edge-Stream. It is a simple and programmer-friendly model, which can cover typical scenarios in edge-computing. Besides, we address several new issues, such as data sharing and area awareness, in this model. We also implement a prototype of edge-computing framework based on the Edge-Stream model. A comprehensive evaluation is provided based on the prototype. Experimental results demonstrate the effectiveness of the model.
- Published
- 2020
8. GraphPi: High Performance Graph Pattern Matching through Effective Redundancy Elimination
- Author
-
Mingshu Zhai, Tianhui Shi, Jidong Zhai, and Yi Xu
- Subjects
FOS: Computer and information sciences ,Optimal matching ,Matching (graph theory) ,Computer science ,Computation ,Databases (cs.DB) ,Graph theory ,Graph ,Set (abstract data type) ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Databases ,Redundancy (engineering) ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Pattern matching ,Algorithm - Abstract
Graph pattern matching, which aims to discover structural patterns in graphs, is considered one of the most fundamental graph mining problems in many real applications. Despite previous efforts, existing systems face two main challenges. First, inherent symmetry existing in patterns can introduce a large amount of redundant computation. Second, different matching orders for a pattern have significant performance differences and are quite hard to predict. When these factors are mixed, this problem becomes extremely complicated. High efficient pattern matching remains an open problem currently. To address these challenges, we propose GraphPi, a high performance distributed pattern matching system. GraphPi utilizes a new algorithm based on 2-cycles in group theory to generate multiple sets of asymmetric restrictions, where each set can eliminate redundant computation completely. We further design an accurate performance model to determine the optimal matching order and asymmetric restriction set for efficient pattern matching. We evaluate GraphPi on Tianhe-2A supercomputer. Results show that GraphPi outperforms the state-of-the-art system, by up to $ 105\times$ for 6 real-world graph datasets on a single node. We also scale GraphPi to 1,024 computing nodes (24,576 cores).
- Published
- 2020
9. Enabling Efficient Random Access to Hierarchically-Compressed Data
- Author
-
Xiaoyong Du, Xipeng Shen, Onur Mutlu, Jidong Zhai, and Feng Zhang
- Subjects
Set (abstract data type) ,Tree traversal ,Speedup ,Computer engineering ,Spacetime ,Computer science ,Space (commercial competition) ,Random access - Abstract
Recent studies have shown the promise of direct data processing on hierarchically-compressed text documents. By removing the need for decompressing data, the direct data processing technique brings large savings in both time and space. However, its benefits have been limited to data traversal operations; for random accesses, direct data processing is several times slower than the state-of-the-art baselines. This paper presents a set of techniques that successfully eliminate the limitation, and for the first time, establishes the feasibility of effectively handling both data traversal operations and random data accesses on hierarchically-compressed data. The work yields a new library, which achieves 3.1 × speedup over the state-of-the-art on random data accesses to compressed data, while preserving the capability of supporting traversal operations efficiently and providing large (3.9 ×) space savings.
- Published
- 2020
10. SCALANA: Automating Scaling Loss Detection with Graph Analysis
- Author
-
Teng Yu, Xu Liu, Yuyang Jin, Jidong Zhai, Haojie Wang, Torsten Hoefler, and Xiongchao Tang
- Subjects
FOS: Computer and information sciences ,Power graph analysis ,Computer science ,Distributed computing ,02 engineering and technology ,Tracing ,computer.software_genre ,symbols.namesake ,Root-Cause Detection ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Profiling (computer programming) ,Computer Science - Performance ,Amdahl's law ,Performance analysis ,Static Analysis ,Graph theory ,Scalability Bottleneck ,Root cause ,Static analysis ,Performance (cs.PF) ,symbols ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Compiler ,Root cause analysis ,computer - Abstract
Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads. In this work, we design SCALANA that uses static analysis techniques to achieve the best of both worlds - it enables the analyzability of traces at a cost similar to profiling. SCALANA first leverages static compiler techniques to build a Program Structure Graph, which records the main computation and communication patterns as well as the program's control structures. At runtime, we adopt lightweight techniques to collect performance data according to the graph structure and generate a Program Performance Graph. With this graph, we propose a novel approach, called backtracking root cause detection, which can automatically and efficiently detect the root cause of scaling loss. We evaluate SCALANA with real applications. Results show that our approach can effectively locate the root cause of scaling loss for real applications and incurs 1.73parcent overhead on average for up to 2,048 processes. We achieve up to 11.11parcent performance improvement by fixing the root causes detected by SCALANA on 2,048 processes. © 2020 IEEE, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, ISBN:978-1-7281-9998-6, ISBN:978-1-7281-9999-3
- Published
- 2020
11. CSE: Parallel Finite State Machines with Convergence Set Enumeration
- Author
-
Yanzhi Wang, Youwei Zhuo, Zhongzhi Luan, Jidong Zhai, Xuehai Qian, Jinglei Cheng, and Qinyi Luo
- Subjects
010302 applied physics ,Sequence ,Finite-state machine ,Correctness ,Speedup ,Computer science ,Computation ,02 engineering and technology ,State (functional analysis) ,01 natural sciences ,020202 computer hardware & architecture ,Automaton ,Set (abstract data type) ,0103 physical sciences ,Convergence (routing) ,Path (graph theory) ,0202 electrical engineering, electronic engineering, information engineering ,Enumeration ,Algorithm - Abstract
Finite State Machine (FSM) is known to be embarrassingly sequential because the next state depends on the current state and input symbol. Enumerative FSM breaks the data dependencies by cutting the input symbols into segments and processing all segments in parallel. With unknown starting state (except the first segment), each segment needs to calculate the state transitions, i.e., state→state, for all states, each one is called an enumeration path. The current software and hardware implementations suffer from two drawbacks: 1. large amount of state→state computation for the enumeration paths; and 2. the optimizations are restricted by the need to correctly performing state→state and only achieve limited improvements. This paper proposes CSE, Convergence Set Enumeration based parallel FSM. Unlike prior approaches, CSE is based on a novel computation primitive set→set, which maps N states to M states without giving the specific state→state mappings (which state is mapped to which). The set→set has two key properties: 1. if M is equal to 1, i.e., all N states are mapped to the same state, the state→state for all the N states are computed; 2. using one-hot encoding, the hardware implementation cost of state→state is the same as set→set. The convergence property ensures that M is always less than N. The key idea of CSE is to partition the original all S states into n state sets i.e., convergence sets. Using set→set to process each CS if the states converge to a single state, then we have successfully computed the enumeration path for each state in CS; otherwise, we may need to re-execute the stage when the outcome of the previous stage falls in CS. CSE is realized by two techniques: 1. convergence set prediction, which generates the convergence sets with random input based profiling that maximizes the probability of each CS converging to one state; 2. global re-execution algorithm, which ensures the correctness by re-executing the non-converging stages with known input state. Essentially, CSE reformulates the enumeration paths as set-based rather than singleton-based. %Given a sequence of input symbols, a set FSM maps a state set of N states to another state set of M states without computing any enumeration path. We evaluate CSE with 13 benchmarks. It achieved on average 2.0/2.4x and maximum 8.6/2.7x speedup compared to Lookback Enumeration and Parallel Automata Processor, respectively.
- Published
- 2018
12. BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU
- Author
-
Liu Wei, Su Lei, Jidong Zhai, Yuwei Hu, Yifan Gong, Jiangming Jin, Dinghua Li, and Yuhao Zhu
- Subjects
020203 distributed computing ,Speedup ,Computational complexity theory ,Artificial neural network ,Computer science ,business.industry ,Deep learning ,Binary number ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,01 natural sciences ,0202 electrical engineering, electronic engineering, information engineering ,Single-core ,Artificial intelligence ,0101 mathematics ,business ,Bitwise operation ,Xeon Phi - Abstract
Deep learning has revolutionized computer vision and other fields since its big bang in 2012. However, it is challenging to deploy Deep Neural Networks (DNNs) into real-world applications due to their high computational complexity. Binary Neural Networks (BNNs) dramatically reduce computational complexity by replacing most arithmetic operations with bitwise operations. Existing implementations of BNNs have been focusing on GPU or FPGA, and using the conventional image-to-column method that doesn't perform well for binary convolution due to low arithmetic intensity and unfriendly pattern for bitwise operations. We propose BitFlow, a gemm-operator-network three-level optimization framework for fully exploiting the computing power of BNNs on CPU. BitFlow features a new class of algorithm named PressedConv for efficient binary convolution using locality-aware layout and vector parallelism. We evaluate BitFlow with the VGG network. On a single core of Intel Xeon Phi, BitFlow obtains 1.8x speedup over unoptimized BNN implementations, and 11.5x speedup over counterpart full-precision DNNs. Over 64 cores, BitFlow enables BNNs to run 1.1x faster than counterpart full-precision DNNs on GPU (GTX 1080).
- Published
- 2018
13. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores
- Author
-
Jidong Zhai, Wenguang Chen, Weimin Zheng, Xiongchao Tang, Youwei Zhuo, Wanwang Yin, Bowen Yu, and Heng Lin
- Subjects
020203 distributed computing ,Memory hierarchy ,Computer science ,Maximum flow problem ,Breadth-first search ,Parallel algorithm ,020207 software engineering ,02 engineering and technology ,Parallel computing ,Graph ,Graph traversal ,Shortest path problem ,0202 electrical engineering, electronic engineering, information engineering ,Algorithm design ,Sunway TaihuLight - Abstract
Interest has recently grown in efficiently analyzing unstructured data such as social network graphs and protein structures. A fundamental graph algorithm for doing such task is the Breadth-First Search (BFS) algorithm, the foundation for many other important graph algorithms such as calculating the shortest path or finding the maximum flow in graphs. In this paper, we share our experience of designing and implementing the BFS algorithm on Sunway TaihuLight, a newly released machine with 40,960 nodes and 10.6 million accelerator cores. It tops the Top500 list of June 2016 with a 93.01 petaflops Linpack performance [1]. Designed for extremely large-scale computation and power efficiency, processors on Sunway TaihuLight employ a unique heterogeneous many-core architecture and memory hierarchy. With its extremely large size, the machine provides both opportunities and challenges for implementing high-performance irregular algorithms, such as BFS. We propose several techniques, including pipelined module mapping, contention-free data shuffling, and group-based message batching, to address the challenges of efficiently utilizing the features of this large scale heterogeneous machine. We ultimately achieved 23755.7 giga-traversed edges per second (GTEPS), which is the best among heterogeneous machines and the second overall in the Graph500s June 2016 list [2].
- Published
- 2017
14. FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures
- Author
-
Feng Zhang, Bo Wu, Jidong Zhai, Bingsheng He, and Wenguang Chen
- Published
- 2017
15. A Fast Tridiagonal Solver for Intel MIC Architecture
- Author
-
Yangtong Xu, Xinliang Wang, Weimin Zheng, Hai Xiang Lin, Jidong Zhai, and Wei Xue
- Subjects
020203 distributed computing ,Tridiagonal matrix ,Computer science ,Parallel algorithm ,Double-precision floating-point format ,02 engineering and technology ,Parallel computing ,Solver ,Single-precision floating-point format ,020204 information systems ,Vectorization (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Algorithm design ,Xeon Phi - Abstract
The tridiagonal solver is an important kernel in many scientific and engineering applications. Although quite a few parallel algorithms have been exploited recently, challenges still remain when solving tridiagonal systems on many-core architectures. In this paper, quantitative analysis is conducted to guide the selection of algorithms on different architectures, and a fast register-PCR-pThomas algorithm for Intel MIC architecture is presented to achieve the best utilization of in-core vectorization and registers. By choosing the most suitable number of PCR reductions, we further propose an improved algorithm, named register-PCR-half-pThomas algorithm, which minimizes the computational cost and the number of registers for use. The optimizations of manual prefetching and strength reduction are also discussed for the new algorithm. Evaluation on Intel Xeon Phi 7110P shows that our register-PCR-pThomas can outperform the MKL solver by 7.7 times in single precision and 3.4 times in double precision. Moreover, our optimized register-PCR-half-pThomas can further boost the performance by another 43.9% in single precision and 20.4% in double precision. Our best algorithm can outperform the latest official GPU library (cuSPARSE) on Nvidia K40 by 3.7 times and 2.4 times in single and double precision, respectively.
- Published
- 2016
16. To Co-run, or Not to Co-run: A Performance Study on Integrated Architectures
- Author
-
Feng Zhang, Wenguang Chen, Jidong Zhai, Bingsheng He, and Shuhao Zhang
- Subjects
Computer architecture ,CPU modes ,Computer science ,Suite ,Benchmark (computing) ,Benchmarking ,Central processing unit ,CPU shielding ,General-purpose computing on graphics processing units ,Chip - Abstract
Architecture designers tend to integrate both CPU and GPU on the same chip to deliver energy-efficient designs. To effectively leverage the power of both CPUs and GPUs on integrated architectures, researchers have recently put substantial efforts into co-running a single application on both the CPU and the GPU of such architectures. However, few studies have been performed to analyze a wide range of parallel computation patterns on such architectures. In this paper, we port all programs in Rodinia benchmark suite and co-run these programs on the integrated architecture. We find that co-running results are not always better than running the application on the CPU only or the GPU only. Among the 20 programs, 3 programs can benefit from co-running, 12 programs using GPU only and 2 programs using CPU only achieve the best performance. The remaining 3 programs show no performance preference for different devices. We also characterize the workload and summarize the patterns for the system insights of co-running on integrated architectures.
- Published
- 2015
17. Optimizing Seam Carving on multi-GPU systems for real-time image resizing
- Author
-
Wenguang Chen, Jidong Zhai, Ikjoon Kim, and Yan Li
- Subjects
Speedup ,Seam carving ,business.industry ,Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Computer vision ,Image processing ,Artificial intelligence ,Resizing ,Multi gpu ,business ,ComputingMethodologies_COMPUTERGRAPHICS ,Image (mathematics) - Abstract
Image resizing is increasingly important for picture sharing and exchanging between various personal electronic equipments. Seam Carving is a state-of-the-art approach for effective image resizing because of its content-aware characteristic. However, complex computation and memory access patterns make it time-consuming and prevent its wide usage in real-time image processing. To address these problems, we propose a novel algorithm, called Non-Cumulative Seam Carving (NCSC), which removes main computation bottleneck. Furthermore, we also propose an adaptive multi-seam algorithm for better parallelism on GPU platforms. Finally, we implement our algorithm on a multi-GPU platform. Results show that our approach achieves a maximum 140× speedup on a two-GPU system over the sequential version. It only takes 0.11 second to resize a 1024×640 image by half in width compared to 15.5 seconds with the traditional seam carving.
- Published
- 2014
18. CYPRESS: Combining Static and Dynamic Analysis for Top-Down Communication Trace Compression
- Author
-
Wenguang Chen, Jidong Zhai, Xiongchao Tang, Xiaosong Ma, and Jianfei Hu
- Subjects
Tree (data structure) ,Asynchronous communication ,Computer science ,Overhead (computing) ,Dynamic range compression ,Algorithm design ,Parallel computing ,Static analysis ,Compile time ,Volume (compression) - Abstract
Communication traces are increasingly important, both for parallel applications' performance analysis/optimization, and for designing next-generation HPC systems. Meanwhile, the problem size and the execution scale on supercomputers keep growing, producing prohibitive volume of communication traces. To reduce the size of communication traces, existing dynamic compression methods introduce large compression overhead with the job scale. We propose a hybrid static-dynamic method that leverages information acquired from static analysis to facilitate more effective and efficient dynamic trace compression. Our proposed scheme, Cypress, extracts a program communication structure tree at compile time using inter-procedural analysis. This tree naturally contains crucial iterative computing features such as the loop structure, allowing subsequent runtime compression to "fill in", in a "top-down" manner, event details into the known communication template. Results show that Cypress reduces intra-process and inter-process compression overhead up to 5× and 9× respectively over state-of-the-art dynamic methods, while only introducing very low compiling overhead.
- Published
- 2014
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.