Author: "Zhao, Kaiyong" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhao, Kaiyong"' showing total 131 results

Start Over Author "Zhao, Kaiyong"

131 results on '"Zhao, Kaiyong"'

1. FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

Author: Tang, Zhenheng, Kang, Xueze, Yin, Yiming, Pan, Xinglin, Wang, Yuxin, He, Xin, Wang, Qiang, Zeng, Rongfei, Zhao, Kaiyong, Shi, Shaohuai, Zhou, Amelie Chi, Li, Bo, He, Bingsheng, and Chu, Xiaowen
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
Published: 2024

2. CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning

Author: Yan, Qingsong, Wang, Qiang, Zhao, Kaiyong, Chen, Jie, Li, Bo, Chu, Xiaowen, and Deng, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Neural Radiance Fields (NeRF) have demonstrated impressive performance in novel view synthesis. However, NeRF and most of its variants still rely on traditional complex pipelines to provide extrinsic and intrinsic camera parameters, such as COLMAP. Recent works, like NeRFmm, BARF, and L2G-NeRF, directly treat camera parameters as learnable and estimate them through differential volume rendering. However, these methods work for forward-looking scenes with slight motions and fail to tackle the rotation scenario in practice. To overcome this limitation, we propose a novel \underline{c}amera parameter \underline{f}ree neural radiance field (CF-NeRF), which incrementally reconstructs 3D representations and recovers the camera parameters inspired by incremental structure from motion (SfM). Given a sequence of images, CF-NeRF estimates the camera parameters of images one by one and reconstructs the scene through initialization, implicit localization, and implicit optimization. To evaluate our method, we use a challenging real-world dataset NeRFBuster which provides 12 scenes under complex trajectories. Results demonstrate that CF-NeRF is robust to camera rotation and achieves state-of-the-art results without providing prior information and constraints., Comment: Accepted at the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI24)
Published: 2023

3. FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

Author: Tang, Zhenheng, Wang, Yuxin, He, Xin, Zhang, Longteng, Pan, Xinglin, Wang, Qiang, Zeng, Rongfei, Zhao, Kaiyong, Shi, Shaohuai, He, Bingsheng, and Chu, Xiaowen
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Networking and Internet Architecture
Abstract: The rapid growth of memory and computation requirements of large language models (LLMs) has outpaced the development of hardware, hindering people who lack large-scale high-end GPUs from training or deploying LLMs. However, consumer-level GPUs, which constitute a larger market share, are typically overlooked in LLM due to their weaker computing performance, smaller storage capacity, and lower communication bandwidth. Additionally, users may have privacy concerns when interacting with remote LLMs. In this paper, we envision a decentralized system unlocking the potential vast untapped consumer-level GPUs in pre-training, inference and fine-tuning of LLMs with privacy protection. However, this system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity. To address these challenges, our system design incorporates: 1) a broker with backup pool to implement dynamic join and quit of computing providers; 2) task scheduling with hardware performance to improve system efficiency; 3) abstracting ML procedures into directed acyclic graphs (DAGs) to achieve model and task universality; 4) abstracting intermediate represention and execution planes to ensure compatibility of various devices and deep learning (DL) frameworks. Our performance analysis demonstrates that 50 RTX 3080 GPUs can achieve throughputs comparable to those of 4 H100 GPUs, which are significantly more expensive.
Published: 2023

4. Rethinking Disparity: A Depth Range Free Multi-View Stereo Based on Disparity

Author: Yan, Qingsong, Wang, Qiang, Zhao, Kaiyong, Li, Bo, Chu, Xiaowen, and Deng, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing learning-based multi-view stereo (MVS) methods rely on the depth range to build the 3D cost volume and may fail when the range is too large or unreliable. To address this problem, we propose a disparity-based MVS method based on the epipolar disparity flow (E-flow), called DispMVS, which infers the depth information from the pixel movement between two views. The core of DispMVS is to construct a 2D cost volume on the image plane along the epipolar line between each pair (between the reference image and several source images) for pixel matching and fuse uncountable depths triangulated from each pair by multi-view geometry to ensure multi-view consistency. To be robust, DispMVS starts from a randomly initialized depth map and iteratively refines the depth map with the help of the coarse-to-fine strategy. Experiments on DTUMVS and Tanks\&Temple datasets show that DispMVS is not sensitive to the depth range and achieves state-of-the-art results with lower GPU memory., Comment: Accepted at the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI23)
Published: 2022

5. SphereDepth: Panorama Depth Estimation from Spherical Domain

Author: Yan, Qingsong, Wang, Qiang, Zhao, Kaiyong, Li, Bo, Chu, Xiaowen, and Deng, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The panorama image can simultaneously demonstrate complete information of the surrounding environment and has many advantages in virtual tourism, games, robotics, etc. However, the progress of panorama depth estimation cannot completely solve the problems of distortion and discontinuity caused by the commonly used projection methods. This paper proposes SphereDepth, a novel panorama depth estimation method that predicts the depth directly on the spherical mesh without projection preprocessing. The core idea is to establish the relationship between the panorama image and the spherical mesh and then use a deep neural network to extract features on the spherical domain to predict depth. To address the efficiency challenges brought by the high-resolution panorama data, we introduce two hyper-parameters for the proposed spherical mesh processing framework to balance the inference speed and accuracy. Validated on three public panorama datasets, SphereDepth achieves comparable results with the state-of-the-art methods of panorama depth estimation. Benefiting from the spherical domain setting, SphereDepth can generate a high-quality point cloud and significantly alleviate the issues of distortion and discontinuity., Comment: Conference accept at 3DV 2022
Published: 2022

6. EASNet: Searching Elastic and Accurate Network Architecture for Stereo Matching

Author: Wang, Qiang, Shi, Shaohuai, Zhao, Kaiyong, and Chu, Xiaowen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Recent advanced studies have spent considerable human efforts on optimizing network architectures for stereo matching but hardly achieved both high accuracy and fast inference speed. To ease the workload in network design, neural architecture search (NAS) has been applied with great success to various sparse prediction tasks, such as image classification and object detection. However, existing NAS studies on the dense prediction task, especially stereo matching, still cannot be efficiently and effectively deployed on devices of different computing capabilities. To this end, we propose to train an elastic and accurate network for stereo matching (EASNet) that supports various 3D architectural settings on devices with different computing capabilities. Given the deployment latency constraint on the target device, we can quickly extract a sub-network from the full EASNet without additional training while the accuracy of the sub-network can still be maintained. Extensive experiments show that our EASNet outperforms both state-of-the-art human-designed and NAS-based architectures on Scene Flow and MPI Sintel datasets in terms of model accuracy and inference speed. Particularly, deployed on an inference GPU, EASNet achieves a new SOTA 0.73 EPE on the Scene Flow dataset with 100 ms, which is 4.5$\times$ faster than LEAStereo with a better quality model.
Published: 2022

7. FADNet++: Real-Time and Accurate Disparity Estimation with Configurable Networks

Author: Wang, Qiang, Shi, Shaohuai, Zheng, Shizhen, Zhao, Kaiyong, and Chu, Xiaowen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Deep neural networks (DNNs) have achieved great success in the area of computer vision. The disparity estimation problem tends to be addressed by DNNs which achieve much better prediction accuracy than traditional hand-crafted feature-based methods. However, the existing DNNs hardly serve both efficient computation and rich expression capability, which makes them difficult for deployment in real-time and high-quality applications, especially on mobile devices. To this end, we propose an efficient, accurate, and configurable deep network for disparity estimation named FADNet++. Leveraging several liberal network design and training techniques, FADNet++ can boost its accuracy with a fast model inference speed for real-time applications. Besides, it enables users to easily configure different sizes of models for balancing accuracy and inference efficiency. We conduct extensive experiments to demonstrate the effectiveness of FADNet++ on both synthetic and realistic datasets among six GPU devices varying from server to mobile platforms. Experimental results show that FADNet++ and its variants achieve state-of-the-art prediction accuracy, and run at a significant order of magnitude faster speed than existing 3D models. With the constraint of running at above 15 frames per second (FPS) on a mobile GPU, FADNet++ achieves a new state-of-the-art result for the SceneFlow dataset., Comment: arXiv admin note: substantial text overlap with arXiv:2003.10758
Published: 2021

8. Evolutionary probability density reconstruction of stochastic dynamic responses based on physics-aided deep learning

Author: Xu, Zidong, Wang, Hao, Zhao, Kaiyong, Zhang, Han, Liu, Yun, and Lin, Yuxuan
Published: 2024
Full Text: View/download PDF

9. Efficient simulation of two-spatial dimensional turbulent wind fields based on the factorization of random functions

Author: Liu, Yun, Wang, Hao, Xu, Zidong, and Zhao, Kaiyong
Published: 2024
Full Text: View/download PDF

10. FADNet: A Fast and Accurate Network for Disparity Estimation

Author: Wang, Qiang, Shi, Shaohuai, Zheng, Shizhen, Zhao, Kaiyong, and Chu, Xiaowen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Deep neural networks (DNNs) have achieved great success in the area of computer vision. The disparity estimation problem tends to be addressed by DNNs which achieve much better prediction accuracy in stereo matching than traditional hand-crafted feature based methods. On one hand, however, the designed DNNs require significant memory and computation resources to accurately predict the disparity, especially for those 3D convolution based networks, which makes it difficult for deployment in real-time applications. On the other hand, existing computation-efficient networks lack expression capability in large-scale datasets so that they cannot make an accurate prediction in many scenarios. To this end, we propose an efficient and accurate deep network for disparity estimation named FADNet with three main features: 1) It exploits efficient 2D based correlation layers with stacked blocks to preserve fast computation; 2) It combines the residual structures to make the deeper model easier to learn; 3) It contains multi-scale predictions so as to exploit a multi-scale weight scheduling training technique to improve the accuracy. We conduct experiments to demonstrate the effectiveness of FADNet on two popular datasets, Scene Flow and KITTI 2015. Experimental results show that FADNet achieves state-of-the-art prediction accuracy, and runs at a significant order of magnitude faster speed than existing 3D models. The codes of FADNet are available at https://github.com/HKBU-HPML/FADNet.
Published: 2020

11. IRS: A Large Naturalistic Indoor Robotics Stereo Dataset to Train Deep Models for Disparity and Surface Normal Estimation

Author: Wang, Qiang, Zheng, Shizhen, Yan, Qingsong, Deng, Fei, Zhao, Kaiyong, and Chu, Xiaowen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Indoor robotics localization, navigation, and interaction heavily rely on scene understanding and reconstruction. Compared to the monocular vision which usually does not explicitly introduce any geometrical constraint, stereo vision-based schemes are more promising and robust to produce accurate geometrical information, such as surface normal and depth/disparity. Besides, deep learning models trained with large-scale datasets have shown their superior performance in many stereo vision tasks. However, existing stereo datasets rarely contain the high-quality surface normal and disparity ground truth, which hardly satisfies the demand of training a prospective deep model for indoor scenes. To this end, we introduce a large-scale synthetic but naturalistic indoor robotics stereo (IRS) dataset with over 100K stereo RGB images and high-quality surface normal and disparity maps. Leveraging the advanced rendering techniques of our customized rendering engine, the dataset is considerably close to the real-world captured images and covers several visual effects, such as brightness changes, light reflection/transmission, lens flare, vivid shadow, etc. We compare the data distribution of IRS with existing stereo datasets to illustrate the typical visual attributes of indoor scenes. Besides, we present DTN-Net, a two-stage deep model for surface normal estimation. Extensive experiments show the advantages and effectiveness of IRS in training deep models for disparity estimation, and DTN-Net provides state-of-the-art results for normal estimation compared to existing methods.
Published: 2019

12. Simulation of turbulent wind field in multi-spatial dimensions using a novel non-uniform FFT enhanced stochastic wave-based spectral representation method

Author: Zhao, Kaiyong, Wang, Hao, and Xu, Zidong
Published: 2023
Full Text: View/download PDF

13. Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

Author: Shi, Shaohuai, Tang, Zhenheng, Wang, Qiang, Zhao, Kaiyong, and Chu, Xiaowen
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning
Abstract: To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of "significant" gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results on a 16-GPU cluster show that LAGS-SGD outperforms the original S-SGD and existing sparsified S-SGD without losing obvious model accuracy., Comment: 8 pages. To appear at ECAI 2020
Published: 2019

14. Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training

Author: Wang, Yuxin, Wang, Qiang, Shi, Shaohuai, He, Xin, Tang, Zhenheng, Zhao, Kaiyong, and Chu, Xiaowen
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Deep learning has become widely used in complex AI applications. Yet, training a deep neural network (DNNs) model requires a considerable amount of calculations, long running time, and much energy. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) are designed to improve the performance of AI training. However, processors from different vendors perform dissimilarly in terms of performance and energy consumption. To investigate the differences among several popular off-the-shelf processors (i.e., Intel CPU, NVIDIA GPU, AMD GPU, and Google TPU) in training DNNs, we carry out a comprehensive empirical study on the performance and energy efficiency of these processors by benchmarking a representative set of deep learning workloads, including computation-intensive operations, classical convolutional neural networks (CNNs), recurrent neural networks (LSTM), Deep Speech 2, and Transformer. Different from the existing end-to-end benchmarks which only present the training time, We try to investigate the impact of hardware, vendor's software library, and deep learning framework on the performance and energy consumption of AI training. Our evaluation methods and results not only provide an informative guide for end-users to select proper AI accelerators, but also expose some opportunities for the hardware vendors to improve their software library., Comment: Revised some minor issues
Published: 2019

15. AutoML: A Survey of the State-of-the-Art

Author: He, Xin, Zhao, Kaiyong, and Chu, Xiaowen
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Deep learning (DL) techniques have penetrated all aspects of our lives and brought us great convenience. However, building a high-quality DL system for a specific task highly relies on human expertise, hindering the applications of DL to more areas. Automated machine learning (AutoML) becomes a promising solution to build a DL system without human assistance, and a growing number of researchers focus on AutoML. In this paper, we provide a comprehensive and up-to-date review of the state-of-the-art (SOTA) in AutoML. First, we introduce AutoML methods according to the pipeline, covering data preparation, feature engineering, hyperparameter optimization, and neural architecture search (NAS). We focus more on NAS, as it is currently very hot sub-topic of AutoML. We summarize the performance of the representative NAS algorithms on the CIFAR-10 and ImageNet datasets and further discuss several worthy studying directions of NAS methods: one/two-stage NAS, one-shot NAS, and joint hyperparameter and architecture optimization. Finally, we discuss some open problems of the existing AutoML methods for future research., Comment: automated machine learning (AutoML), published in journal of Knowledge-Based Systems
Published: 2019
Full Text: View/download PDF

16. Vision-based Robotic Grasping From Object Localization, Object Pose Estimation to Grasp Estimation for Parallel Grippers: A Review

Author: Du, Guoguang, Wang, Kai, Lian, Shiguo, and Zhao, Kaiyong
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents a comprehensive survey on vision-based robotic grasping. We conclude three key tasks during vision-based robotic grasping, which are object localization, object pose estimation and grasp estimation. In detail, the object localization task contains object localization without classification, object detection and object instance segmentation. This task provides the regions of the target object in the input data. The object pose estimation task mainly refers to estimating the 6D object pose and includes correspondence-based methods, template-based methods and voting-based methods, which affords the generation of grasp poses for known objects. The grasp estimation task includes 2D planar grasp methods and 6DoF grasp methods, where the former is constrained to grasp from one direction. These three tasks could accomplish the robotic grasping with different combinations. Lots of object pose estimation methods need not object localization, and they conduct object localization and object pose estimation jointly. Lots of grasp estimation methods need not object localization and object pose estimation, and they conduct grasp estimation in an end-to-end manner. Both traditional methods and latest deep learning-based methods based on the RGB-D image inputs are reviewed elaborately in this survey. Related datasets and comparisons between state-of-the-art methods are summarized as well. In addition, challenges about vision-based robotic grasping and future directions in addressing these challenges are also pointed out., Comment: This is a pre-print of an article published in Artificial Intelligence Review. The final authenticated version is available online at: https://doi.org/10.1007/s10462-020-09888-5. Related refs are summarized at: https://github.com/GeorgeDu/vision-based-robotic-grasping
Published: 2019
Full Text: View/download PDF

17. A Distributed Synchronous SGD Algorithm with Global Top-$k$ Sparsification for Low Bandwidth Networks

Author: Shi, Shaohuai, Wang, Qiang, Zhao, Kaiyong, Tang, Zhenheng, Wang, Yuxin, Huang, Xiang, and Chu, Xiaowen
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Distributed synchronous stochastic gradient descent (S-SGD) has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to exchange gradients iteratively. Recently, Top-$k$ sparsification techniques have been proposed to reduce the volume of data to be exchanged among workers. Top-$k$ sparsification can zero-out a significant portion of gradients without impacting the model convergence. However, the sparse gradients should be transferred with their irregular indices, which makes the sparse gradients aggregation difficult. Current methods that use AllGather to accumulate the sparse gradients have a communication complexity of $O(kP)$, where $P$ is the number of workers, which is inefficient on low bandwidth networks with a large number of workers. We observe that not all top-$k$ gradients from $P$ workers are needed for the model update, and therefore we propose a novel global Top-$k$ (gTop-$k$) sparsification mechanism to address the problem. Specifically, we choose global top-$k$ largest absolute values of gradients from $P$ workers, instead of accumulating all local top-$k$ gradients to update the model in each iteration. The gradient aggregation method based on gTop-$k$ sparsification reduces the communication complexity from $O(kP)$ to $O(k\log P)$. Through extensive experiments on different DNNs, we verify that gTop-$k$ S-SGD has nearly consistent convergence performance with S-SGD, and it has only slight degradations on generalization performance. In terms of scaling efficiency, we evaluate gTop-$k$ on a cluster with 32 GPU machines which are interconnected with 1 Gbps Ethernet. The experimental results show that our method achieves $2.7-12\times$ higher scaling efficiency than S-SGD and $1.1-1.7\times$ improvement than the existing Top-$k$ S-SGD., Comment: 10 pages. Add discussion with more experimental results. To appear at the ICDCS2019 workshop
Published: 2019

18. Efficient simulation of non-stationary non-homogeneous wind field: Fusion of multi-dimensional interpolation and NUFFT

Author: Tao, Tianyou, He, Jiaye, Wang, Hao, and Zhao, Kaiyong
Published: 2023
Full Text: View/download PDF

19. EASNet: Searching Elastic and Accurate Network Architecture for Stereo Matching

Author: Wang, Qiang, primary, Shi, Shaohuai, additional, Zhao, Kaiyong, additional, and Chu, Xiaowen, additional
Published: 2022
Full Text: View/download PDF

20. Efficient simulation of fully non-stationary random wind field based on reduced 2D hermite interpolation

Author: Tao, Tianyou, Wang, Hao, and Zhao, Kaiyong
Published: 2021
Full Text: View/download PDF

21. AutoML: A survey of the state-of-the-art

Author: He, Xin, Zhao, Kaiyong, and Chu, Xiaowen
Published: 2021
Full Text: View/download PDF

22. Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

Author: Du, Guoguang, Wang, Kai, Lian, Shiguo, and Zhao, Kaiyong
Published: 2021
Full Text: View/download PDF

23. VoxelMap++: Mergeable Voxel Mapping Method for Online LiDAR(-Inertial) Odometry

Author: Wu, Chang, primary, You, Yuan, additional, Yuan, Yifei, additional, Kong, Xiaotong, additional, Zhang, Ying, additional, Li, Qiyan, additional, and Zhao, Kaiyong, additional
Published: 2024
Full Text: View/download PDF

24. Rethinking Disparity: A Depth Range Free Multi-View Stereo Based on Disparity

Author: Yan, Qingsong, primary, Wang, Qiang, additional, Zhao, Kaiyong, additional, Li, Bo, additional, Chu, Xiaowen, additional, and Deng, Fei, additional
Published: 2023
Full Text: View/download PDF

25. Analysis of wind vibration response of suspended derrick under downburst.

Author: Lin Yuxuan, Xu Zidong, Wang Hao, Liu Yaodong, Zhang Han, and Zhao Kaiyong
Subjects: MICROBURSTS, CONSTRUCTION safety measures, WIND speed, TENSILE strength, FINITE element method
Abstract: Copyright of Journal of Southeast University (English Edition) is the property of Journal of Southeast University Editorial Office and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2023
Full Text: View/download PDF

26. Rethinking Disparity: A Depth Range Free Multi-View Stereo based on Disparity

Author: Yan, Qingsong, Wang, Qiang, Zhao, Kaiyong, Li, Bo, Chu, Xiaowen, Deng, Fei, Yan, Qingsong, Wang, Qiang, Zhao, Kaiyong, Li, Bo, Chu, Xiaowen, and Deng, Fei
Abstract: Existing learning-based multi-view stereo (MVS) methods rely on the depth range to build the 3D cost volume and may fail when the range is too large or unreliable. To address this problem, we propose a disparity-based MVS method based on the epipolar disparity flow (E-flow), called DispMVS, which infers the depth information from the pixel movement between two views. The core of DispMVS is to construct a 2D cost volume on the image plane along the epipolar line between each pair (between the reference image and several source images) for pixel matching and fuse uncountable depths triangulated from each pair by multi-view geometry to ensure multi-view consistency. To be robust, DispMVS starts from a randomly initialized depth map and iteratively refines the depth map with the help of the coarse-to-fine strategy. Experiments on DTUMVS and Tanks&Temple datasets show that DispMVS is not sensitive to the depth range and achieves state-of-the-art results with lower GPU memory. Copyright © 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Published: 2023

27. Benchmarking the Memory Hierarchy of Modern GPUs

Author: Mei, Xinxin, Zhao, Kaiyong, Liu, Chengjian, Chu, Xiaowen, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Kobsa, Alfred, Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Hsu, Ching-Hsien, editor, Shi, Xuanhua, editor, and Salapura, Valentina, editor
Published: 2014
Full Text: View/download PDF

28. Study on the EPSD of Wind-Induced Responses of the Sutong Bridge Using Harmonic Wavelets.

Author: Xu, Zidong, Wang, Hao, and Zhao, Kaiyong
Subjects: TYPHOONS, STRUCTURAL health monitoring, LONG-span bridges, HARMONIC suppression filters, STOCHASTIC processes, CONTINUOUS bridges
Abstract: Many long-span bridges are located at typhoon prone regions. With the continuous increase of the bridge span, the typhoon-induced buffeting becomes more and more prominent. In this study, based on the structural health monitoring system installed in the Sutong Bridge, the recorded buffeting responses of the main girder during typhoons Damrey and Haikui were analyzed. The run test method demonstrated that the recorded acceleration responses can be regarded as zero-mean non-stationary random processes. Hence, to capture the energy distribution of the recorded data in the time-frequency domain, the evolutionary power spectral density (EPSD) estimation was conducted using efficient generalized harmonic wavelet (GHW) and filtered harmonic wavelet (FHW), respectively. Compared with the GHW, narrower wavelet bandwidth is required by the FHW to yield a compromise between the time and frequency resolution. For the FHW-based method, the power spectral density amplitudes of the averaging EPSDs are slightly larger for certain major frequency components than those obtained by the Pwelch method. Results show that the non-stationary features of the buffeting of long-span bridges during Typhoon events should be considered. This study can also provide references for non-stationary buffeting analysis of other long-span bridges during extreme wind events. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

29. Practical Random Linear Network Coding on GPUs

Author: Chu, Xiaowen, Zhao, Kaiyong, Yuen, David A., editor, Wang, Long, editor, Chi, Xuebin, editor, Johnsson, Lennart, editor, Ge, Wei, editor, and Shi, Yaolin, editor
Published: 2013
Full Text: View/download PDF

30. SphereDepth: Panorama Depth Estimation from Spherical Domain

Author: Yan, Qingsong, primary, Wang, Qiang, additional, Zhao, Kaiyong, additional, Li, Bo, additional, Chu, Xiaoweo, additional, and Deng, Fei, additional
Published: 2022
Full Text: View/download PDF

31. Practical Random Linear Network Coding on GPUs

Author: Chu, Xiaowen, Zhao, Kaiyong, Wang, Mea, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Fratta, Luigi, editor, Schulzrinne, Henning, editor, Takahashi, Yutaka, editor, and Spaniol, Otto, editor
Published: 2009
Full Text: View/download PDF

32. SepRep: A Novel Reputation Evaluation Model in Peer-to-Peer Networks

Author: Chen, Xiaowei, Zhao, Kaiyong, Chu, Xiaowen, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Rong, Chunming, editor, Jaatun, Martin Gilje, editor, Sandnes, Frode Eika, editor, Yang, Laurence T., editor, and Ma, Jianhua, editor
Published: 2008
Full Text: View/download PDF

33. Detecting “protein words” through unsupervised word segmentation [version 1; referees: 1 approved with reservations, 1 not approved]

Author: Wang Liang and Zhao Kaiyong
Subjects: Method Article, Articles, Protein Chemistry & Proteomics, Structural Genomics, Theory & Simulation, Word segmentation, Protein sequence, Protein secondary structure, Unsupervised method, Soft counting, Protein word, Gene finding, Description length
Abstract: Unsupervised word segmentation methods were applied to analyze protein sequences. Protein sequences, such as “MTMDKSELVQKA…,” were used as input to these methods. Segmented protein word sequences, such as “MTM DKSE LVQKA,” were then obtained. We compared the protein words derived via unsupervised segmentation and protein secondary structure segmentation. An interesting finding is that unsupervised word segmentation is more efficient than secondary structure segmentation in expressing information. Our experiment also suggests the presence of several “protein ruins” in current non-coding regions.
Published: 2015
Full Text: View/download PDF

34. A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs

Author: Li, Qinjian, Zhong, Chengwen, Li, Kai, Zhang, Guangyong, Lu, Xiaowei, Zhang, Qing, Zhao, Kaiyong, and Chu, Xiaowen
Published: 2014
Full Text: View/download PDF

35. Benchmarking the Memory Hierarchy of Modern GPUs

Author: Mei, Xinxin, primary, Zhao, Kaiyong, additional, Liu, Chengjian, additional, and Chu, Xiaowen, additional
Published: 2014
Full Text: View/download PDF

36. IRS: A Large Naturalistic Indoor Robotics Stereo Dataset to Train Deep Models for Disparity and Surface Normal Estimation

Author: Wang, Qiang, primary, Zheng, Shizhen, additional, Yan, Qingsong, additional, Deng, Fei, additional, Zhao, Kaiyong, additional, and Chu, Xiaowen, additional
Published: 2021
Full Text: View/download PDF

37. Reputation and trust management in heterogeneous peer-to-peer networks

Author: Chu, Xiaowen, Chen, Xiaowei, Zhao, Kaiyong, and Liu, Jiangchuan
Published: 2010
Full Text: View/download PDF

38. G-BLASTN: accelerating nucleotide alignment by graphics processors

Author: Zhao, Kaiyong and Chu, Xiaowen
Published: 2014
Full Text: View/download PDF

39. Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training

Author: Wang, Yuxin, Wang, Qiang, Shi, Shaohuai, He, Xin, Tang, Zhenheng, Zhao, Kaiyong, Chu, Xiaowen, Wang, Yuxin, Wang, Qiang, Shi, Shaohuai, He, Xin, Tang, Zhenheng, Zhao, Kaiyong, and Chu, Xiaowen
Abstract: Deep learning has become widely used in complex AI applications. Yet, training a deep neural network (DNNs) model requires a considerable amount of calculations, long running time, and much energy. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) are designed to improve the performance of AI training. However, processors from different vendors perform dissimilarly in terms of performance and energy consumption. To investigate the differences among several popular off-the-shelf processors (i.e., Intel CPU, NVIDIA GPU, AMD GPU, and Google TPU) in training DNNs, we carry out a comprehensive empirical study on the performance and energy efficiency of these processors1 by benchmarking a representative set of deep learning workloads, including computation-intensive operations, classical convolutional neural networks (CNNs), recurrent neural networks (LSTM), Deep Speech 2, and Transformer. Different from the existing end-to-end benchmarks which only present the training time, We try to investigate the impact of hardware, vendor's software library, and deep learning framework on the performance and energy consumption of AI training. Our evaluation methods and results not only provide an informative guide for end users to select proper AI accelerators, but also expose some opportunities for the hardware vendors to improve their software library. © 2020 IEEE.
Published: 2020

40. Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

Author: Shi, Shaohuai, Tang, Zhenheng, Wang, Qiang, Zhao, Kaiyong, Chu, Xiaowen, Shi, Shaohuai, Tang, Zhenheng, Wang, Qiang, Zhao, Kaiyong, and Chu, Xiaowen
Published: 2020

41. Non-Stationary Turbulent Wind Field Simulation of Long-Span Bridges Using the Updated Non-Negative Matrix Factorization-Based Spectral Representation Method

Author: Hao Wang, Zhao Kaiyong, Qingxin Zhu, Zhang Han, Zidong Xu, and Hui Gao
Subjects: Computer science, Fast Fourier transform, non-stationary turbulence simulation, suspension bridge, 020101 civil engineering, 02 engineering and technology, 0201 civil engineering, Matrix decomposition, Non-negative matrix factorization, non-negative matrix factorization, 0203 mechanical engineering, General Materials Science, Instrumentation, Physics::Atmospheric and Oceanic Physics, Fluid Flow and Transfer Processes, evolutionary power spectral density, Computer simulation, Basis (linear algebra), Turbulence, Process Chemistry and Technology, Mathematical analysis, General Engineering, Spectral density, Aeroelasticity, Computer Science Applications, 020303 mechanical engineering & transports, fast Fourier transform (FFT), Physics::Space Physics
Abstract: Numerical simulation of the turbulent wind field on long-span bridges is an important task in structural buffeting analysis when it comes to the system non-linearity. As for non-stationary extreme wind events, some efforts have been paid to update the classic spectral representation method (SRM) and the fast Fourier transform (FFT) has been introduced to improve the computational efficiency. Here, the non-negative matrix factorization-based FFT-aided SRM has been updated to generate not only the horizontal non-stationary turbulent wind field, but also the vertical one. Specifically, the evolutionary power spectral density (EPSD) is estimated to characterize the non-stationary feature of the field-measured wind data during Typhoon Wipha at the Runyang Suspension Bridge (RSB) site. The coherence function considering the phase angles is utilized to generate the turbulent wind fields for towers. The simulation accuracy is validated by comparing the simulated and target auto-/cross-correlation functions. Results show that the updated method performs well in generating the non-stationary turbulent wind field. The obtained wind fields will provide the research basis for analyzing the non-stationary buffeting behavior of the RSB and other wind-sensitive structures in adjacent regions.
Published: 2019
Full Text: View/download PDF

42. Practical Random Linear Network Coding on GPUs

Author: Chu, Xiaowen, primary, Zhao, Kaiyong, additional, and Wang, Mea, additional
Published: 2009
Full Text: View/download PDF

43. SOAP3: ultra-fast GPU-based parallel alignment tool for short reads

Author: Liu, Chi-Man, Wong, Thomas, Wu, Edward, Luo, Ruibang, Yiu, Siu-Ming, Li, Yingrui, Wang, Bingqiang, Yu, Chang, Chu, Xiaowen, Zhao, Kaiyong, Li, Ruiqiang, and Lam, Tak-Wah
Published: 2012
Full Text: View/download PDF

44. Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

Author: Du, Guoguang, primary, Wang, Kai, additional, Lian, Shiguo, additional, and Zhao, Kaiyong, additional
Published: 2020
Full Text: View/download PDF

45. FADNet: A Fast and Accurate Network for Disparity Estimation

Author: Wang, Qiang, primary, Shi, Shaohuai, additional, Zheng, Shizhen, additional, Zhao, Kaiyong, additional, and Chu, Xiaowen, additional
Published: 2020
Full Text: View/download PDF

46. Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training

Author: Wang, Yuxin, primary, Wang, Qiang, additional, Shi, Shaohuai, additional, He, Xin, additional, Tang, Zhenheng, additional, Zhao, Kaiyong, additional, and Chu, Xiaowen, additional
Published: 2020
Full Text: View/download PDF

47. Non-Stationary Turbulent Wind Field Simulation of Long-Span Bridges Using the Updated Non-Negative Matrix Factorization-Based Spectral Representation Method

Author: Xu, Zidong, primary, Wang, Hao, additional, Zhang, Han, additional, Zhao, Kaiyong, additional, Gao, Hui, additional, and Zhu, Qingxin, additional
Published: 2019
Full Text: View/download PDF

48. Efficient simulation of fully non-stationary random wind field based on reduced 2D hermite interpolation

Author: Zhao Kaiyong, Tianyou Tao, and Hao Wang
Subjects: 0209 industrial biotechnology, Hermite polynomials, Computer science, Mechanical Engineering, Fast Fourier transform, MathematicsofComputing_NUMERICALANALYSIS, Wind field, Aerospace Engineering, Spectral density, 02 engineering and technology, 01 natural sciences, Computer Science Applications, 020901 industrial engineering & automation, Control and Systems Engineering, Hermite interpolation, 0103 physical sciences, Signal Processing, Coherence (signal processing), Trigonometry, 010301 acoustics, Algorithm, Civil and Structural Engineering, Cholesky decomposition
Abstract: The spectral representation method (SRM) has been widely used to simulate stationary or non-stationary wind fields for engineering structures. Although several attempts have been made to realize the invoking of Fast Fourier Transform (FFT), the SRM is still very inefficient to simulate the fully non-stationary wind field with a time-varying coherence due to the extremely time-consuming Cholesky decompositions and large memory requirement. In this paper, a reduced 2D Hermite interpolation-enhanced approach is developed to further improve the efficiency of SRM in simulating fully non-stationary wind fields. Central to this approach is the interpolation procedure which requires Cholesky decompositions and storage of cross power spectral density matrix (CEPSD) elements only at interpolation knots. Thus the computational costs of Cholesky decompositions and memory requirement are dramatically decreased. The number of Cholesky decompositions is then fixed with no relation to the segments of frequency and duration of wind samples, which eliminates the Cholesky decomposition as a cause that affects the simulation efficiency. Meanwhile, each element in the decomposed CEPSD matrix is decoupled into products of time- and frequency-dependent functions by the reduced 2D Hermite interpolation, so the FFT can be used to expedite the summation of trigonometric terms. Apart from using FFT, another merit of the proposed approach is that an accelerated FFT algorithm can be incorporated to further improve the simulation efficiency based on the specific decoupled expression of frequency-dependent functions. The parametric analysis shows that the proposed approach is very efficient in comparison with the existing method using proper orthogonal decomposition (POD), and it provides a desired level of simulation accuracy when appropriate interpolation interval is selected. The case study in simulating the fully non-stationary wind field of a long-span cable-stayed bridge demonstrates the effectiveness of the proposed approach with verifications on both evolutionary power spectra and correlation functions.
Published: 2021

49. A distributed synchronous SGD algorithm with global top-k Sparsification for low bandwidth networks

Author: Shi, Shaohuai, Wang, Qiang, Zhao, Kaiyong, Tang, Zhenheng, Wang, Yuxin, Huang, Xiang, Chu, Xiaowen, Shi, Shaohuai, Wang, Qiang, Zhao, Kaiyong, Tang, Zhenheng, Wang, Yuxin, Huang, Xiang, and Chu, Xiaowen
Abstract: Distributed synchronous stochastic gradient descent (S-SGD) with data parallelism has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to exchange gradients iteratively. Recently, Top-k sparsification techniques have been proposed to reduce the volume of data to be exchanged among workers and thus alleviate the network pressure. Top-k sparsification can zero-out a significant portion of gradients without impacting the model convergence. However, the sparse gradients should be transferred with their indices, and the irregular indices make the sparse gradients aggregation difficult. Current methods that use AllGather to accumulate the sparse gradients have a communication complexity of O(kP), where P is the number of workers, which is inefficient on low bandwidth networks with a large number of workers. We observe that not all top-k gradients from P workers are needed for the model update, and therefore we propose a novel global Top-k (gTop-k) sparsification mechanism to address the difficulty of aggregating sparse gradients. Specifically, we choose global top-k largest absolute values of gradients from P workers, instead of accumulating all local top-k gradients to update the model in each iteration. The gradient aggregation method based on gTop-k sparsification, namely gTopKAllReduce, reduces the communication complexity from O(kP) to O(k log P). Through extensive experiments on different DNNs, we verify that gTop-k S-SGD has nearly consistent convergence performance with S-SGD, and it has only slight degradations on generalization performance. In terms of scaling efficiency, we evaluate gTop-k on a cluster with 32 GPU machines which are interconnected with 1 Gbps Ethernet. The experimental results show that our method achieves 2.7-12× higher scaling efficiency than S-SGD with dense gradients and 1.1-1.7× improvement than the existing Top-k S-S
Published: 2019

50. A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

Author: Shi, Shaohuai, primary, Zhao, Kaiyong, additional, Wang, Qiang, additional, Tang, Zhenheng, additional, and Chu, Xiaowen, additional
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

131 results on '"Zhao, Kaiyong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources