Descriptor: "Compute Unified Device Architecture (CUDA)" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Compute Unified Device Architecture (CUDA)"' showing total 318 results

Start Over Descriptor "Compute Unified Device Architecture (CUDA)"

318 results on '"Compute Unified Device Architecture (CUDA)"'

1. CompreCity: Accelerating the Traveling Salesman Problem on GPU with data compression

Author: Yalcin, Salih, Usul, Hamdi Burak, and Yalcin, Gulay
Published: 2025
Full Text: View/download PDF

2. rupMC: a ray-unit parallel marching cubes algorithm on CPU/GPU heterogeneous architectures

Author: Xue Yang, Shuo Yun, Qingfeng Guan, and Huan Gao
Subjects: Open-source Message Passing Interface (OpenMPI), Compute Unified Device Architecture (CUDA), CPU/GPU heterogeneous architecture, Iso-surface extraction, marching cubes (MC), Mathematical geography. Cartography, GA1-1776
Abstract: ABSTRACTThe marching cubes (MC) algorithm is widely used for extracting isosurfaces from volume data and 3D visualizations because of its effectiveness and robustness but require extensive memory and computing time for large-scale applications. Additionally, MC isosurfaces lack topologic information, making them difficult to use in some geologic applications. To overcome these limitations, this study proposes an enhanced MC using CPU/GPU heterogeneous architecture called the ray-unit parallel MC (rupMC) algorithm. First, ray units form the basic voxel to determine how the surface intersects to reduce repeated computations and enhance efficiency. Then, rupMC uses multiple computing processes and threads on a CPU/GPU heterogeneous architecture to process points concurrently. Finally, the unique surface intersection indices are preserved to compose the surface triangles, and the topological surface information is directly embedded in the triangle compositions. Experiments on five stratum datasets of varying sizes demonstrated that, rupMC achieved approximately dozens of times faster than other serial MC and 4 times faster than a parallel DMC. rupMC demonstrated high scalability and adaptability to various CPUs/GPUs and datasets of various sizes. rupMC has remarkable capabilities for efficiently and feasibly extracting precise surface intersections and triangles, making it well-suited for large-scale and high-density applications.
Published: 2024
Full Text: View/download PDF

3. 面向GPU的地形遮蔽探测并行算法.

Author: 孙卡 and 俞俗强
Abstract: Copyright of Journal of Computer Engineering & Applications is the property of Beijing Journal of Computer Engineering & Applications Journal Co Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2024
Full Text: View/download PDF

4. A parallel computing framework for real-time moving object detection on high resolution videos.

Author: Hashmi, Mohammad Farukh, Ayele, Eskinder, Naik, Banoth Thulasya, and Keskar, Avinash G.
Subjects: OBJECT recognition (Computer vision), REAL-time computing, PARALLEL programming, VIDEO surveillance, CAMCORDERS, PARALLEL processing
Abstract: Graphic Processing Units (GPUs) are becoming very important in the present day. Their high computational capabilities with high speed and accuracy are making them a very strong force in communication engineering. In recent times, their need has increased tremendously due to the increasing range of applications. Video surveillance is an important field where very heavy computations are needed to be done on videos to perfectly detect the motion of an object in suspicious situations. The various analyses on video can be used to extract information and process data to generate actionable intelligent conclusions. However, CPUs fail to deliver real time results when it comes to high-resolution videos from a large number of cameras simultaneously. Thankfully, there is a lot of graphic hardware available nowadays, which comprises powerful hardware processors often intended to process data in parallel and so greatly accelerates the processes being done on them. An accelerated algorithm is required for processing petabytes of data from security cameras and video surveillance satellites and that in real time. In this paper, we propose a method of using GPUs in detecting the motion of an object at different junctions in video surveillance. The results show a great gain in performance when the proposed method runs on GPUs and CPUs in terms of speed as well as accuracy. The new parallel processing approaches are developed on each phase of the algorithm to enhance the efficiency of the system. Proposed algorithm achieved an average speed up of 50.094x for lower resolution video frames (320 × 240,720 × 480,1024 × 768) and 38.012x for higher resolution video frames (1360 × 768,1920 × 1080) on GPU, which is superior to CPU processing. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. GPU Implementation and Optimization of a High-Order Spectral Difference Method for Aeroacoustic Problems.

Author: Zhang, Dongfei and Gao, Junhui
Subjects: *GRAPHICS processing units, *COMPUTING platforms, *PARALLEL programming, *AEROACOUSTICS
Abstract: This study focuses on the implementation of the spectral difference (SD) method on hexahedral elements to NVIDIA graphics processing units (GPUs) using the Compute Unified Device Architecture (CUDA) for aeroacoustic problems. Three problems were addressed in the implementation of this study: thread parallelism strategy optimization within the GPU, data access patterns management, and multi-GPU parallelization implementation. Computational speed testing showed that the three factors significantly affect the efficiency of the code on the GPU. The implemented GPU solver was validated using an inviscid problem and a viscous problem. The numerical results show that the GPU solver achieves the same level of accuracy as the CPU program, with remarkable speed improvements. Specifically, compared with a single CPU core with a turbo boost frequency of 3.2 GHz (Intel Xeon Silver 4210), the inviscid case tested on an RTX 2070 Super GPU achieved acceleration of 122.4× , and the viscous case conducted on an RTX 3090 GPU achieved acceleration of 229.7×. Additionally, the GPU solver exhibits a parallel efficiency exceeding 93% when performing parallel computing on a platform with multiple RTX 3090 GPU cards. Furthermore, the GPU-accelerated computational aeroacoustics solver was applied to compute the noise from a low-speed propeller. The computed results were compared with experimental data, and the excellent agreement demonstrated the effectiveness and feasibility of the GPU implementation of the SD solver. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. High-precision parallel computing model of solute transport based on GPU acceleration.

Author: Zhang, Shang-hong, Zhang, Rong-qi, Li, Wen-da, Yang, Xi-yan, and Zhou, Yang
Abstract: The scenario simulation analysis of water environmental emergencies is very important for risk prevention and control, and emergency response. To quickly and accurately simulate the transport and diffusion process of high-intensity pollutants during sudden environmental water pollution events, in this study, a high-precision pollution transport and diffusion model for unstructured grids based on Compute Unified Device Architecture (CUDA) is proposed. The finite volume method of a total variation diminishing limiter with the Kong proposed r-factor is used to reduce numerical diffusion and oscillation errors in the simulation of pollutants under sharp concentration conditions, and graphics processing unit acceleration technology is used to improve computational efficiency. The advection diffusion process of the model is verified numerically using two benchmark cases, and the efficiency of the model is evaluated using an engineering example. The results demonstrate that the model perform well in the simulation of material transport in the presence of sharp concentration. Additionally, it has high computational efficiency. The acceleration ratio is 46 times the single-thread acceleration effect of the original model. The efficiency of the accelerated model meet the requirements of an engineering application, and the rapid early warning and assessment of water pollution accidents is achieved. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. Enabling Sustainable Development Through Artificial Intelligence-Based Surveillance System on Cloud Platform

Author: Kharbanda, Aryaman, Rana, Varun, Baghela, Nakshatra Kumar, Fatima, Mehtab, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Whig, Pawan, editor, Silva, Nuno, editor, Elngar, Ahmed A., editor, Aneja, Nagender, editor, and Sharma, Pavika, editor
Published: 2023
Full Text: View/download PDF

8. CATE: A fast and scalable CUDA implementation to conduct highly parallelized evolutionary tests on large scale genomic data

Author: Deshan Perera, Elsa Reisenhofer, Said Hussein, Eve Higgins, Christian D. Huber, and Quan Long
Subjects: compute unified device architecture (CUDA), multiprocessing and threading, population genetics, tests for molecular evolution, Ecology, QH540-549.5, Evolution, QH359-425
Abstract: Abstract Statistical tests for molecular evolution provide quantifiable insights into the selection pressures that govern a genome's evolution. Increasing sample sizes used for analysis leads to higher statistical power. However, this requires more computational nodes or longer computational time. CATE (CUDA Accelerated Testing of Evolution) is a computational solution to this problem comprised of two main innovations. The first is a file organization system coupled with a novel search algorithm and the second is a large‐scale parallelization of algorithms using both graphical processing unit (GPU) and central processing unit. CATE is capable of conducting evolutionary tests such as Tajima's D, Fu and Li's, and Fay and Wu's test statistics, McDonald–Kreitman Neutrality Index, Fixation Index and Extended Haplotype Homozygosity. CATE is magnitudes faster than standard tools with benchmarks estimating it being on average over 180 times faster. For instance, CATE processes all 54,849 human genes for all 22 autosomal chromosomes across the five super populations present in the 1000 Genomes Project in less than 30 min while counterpart software took 3.62 days. This proven framework has the potential to be adapted for GPU‐accelerated large‐scale parallel analyses of many evolutionary and genomic analyses.
Published: 2023
Full Text: View/download PDF

9. 面向多尺度拓扑优化的渐进均匀化 GPU 并行算法研究.

Author: 夏兆辉, 刘健力, 高百川, 聂涛, 余琛, 陈龙, and 余金桂
Abstract: Copyright of Journal of Zhejiang University (Science Edition) is the property of Journal of Zhejiang University (Science Edition) Editorial Office and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2023
Full Text: View/download PDF

10. GPU Implementation of Graph-Regularized Sparse Unmixing With Superpixel Structures

Author: Zeng Li, Jie Chen, Muhammad Mobeen Movania, and Susanto Rahardja
Subjects: Compute unified device architecture (CUDA), graphics processing unit (GPU), hyperspectral images, parallel processing, sparse unmixing, superpixel, Ocean engineering, TC1501-1800, Geophysics. Cosmic physics, QC801-809
Abstract: To enhance spectral unmixing performance, a large number of algorithms have simultaneously investigated spatial and spectral information in hyperspectral images. However, sophisticated algorithms with high computational complexity can be very time-consuming when a large amount of data are involved in processing hyperspectral images. In this article, we first introduce a group sparse graph-regularized unmixing method with superpixel structure, to promote piecewise consistency of abundances and reduce computational burden. Segmenting the image into several nonoverlapped superpixels also enables to decompose the unmixing problem into uncoupled subproblems that can be processed in parallel. An implementation for the proposed algorithm on graphics processing units (GPUs) is then developed based on the NVIDIA compute unified device architecture (CUDA) framework. The proposed scheme achieves parallelism at both the intrasuperpixel and intersuperpixel levels, where multiple concurrent streams have been used to enable multiple kernels to execute on the device simultaneously. Simulation results with a series of experiments demonstrate advantages of the proposed algorithm. The performance of the GPU implementation also illustrates that parallel scheme largely expedites the implementation.
Published: 2023
Full Text: View/download PDF

11. CATE: A fast and scalable CUDA implementation to conduct highly parallelized evolutionary tests on large scale genomic data.

Author: Perera, Deshan, Reisenhofer, Elsa, Hussein, Said, Higgins, Eve, Huber, Christian D., and Long, Quan
Subjects: MOLECULAR evolution, CENTRAL processing units, GENOMICS, HAPLOTYPES, STATISTICAL power analysis, HOMOZYGOSITY, EVOLUTIONARY algorithms, SEARCH algorithms
Abstract: Statistical tests for molecular evolution provide quantifiable insights into the selection pressures that govern a genome's evolution. Increasing sample sizes used for analysis leads to higher statistical power. However, this requires more computational nodes or longer computational time.CATE (CUDA Accelerated Testing of Evolution) is a computational solution to this problem comprised of two main innovations. The first is a file organization system coupled with a novel search algorithm and the second is a large‐scale parallelization of algorithms using both graphical processing unit (GPU) and central processing unit. CATE is capable of conducting evolutionary tests such as Tajima's D, Fu and Li's, and Fay and Wu's test statistics, McDonald–Kreitman Neutrality Index, Fixation Index and Extended Haplotype Homozygosity.CATE is magnitudes faster than standard tools with benchmarks estimating it being on average over 180 times faster. For instance, CATE processes all 54,849 human genes for all 22 autosomal chromosomes across the five super populations present in the 1000 Genomes Project in less than 30 min while counterpart software took 3.62 days.This proven framework has the potential to be adapted for GPU‐accelerated large‐scale parallel analyses of many evolutionary and genomic analyses. [ABSTRACT FROM AUTHOR]
Published: 2023
Full Text: View/download PDF

12. Speeding up Genetic Programming Based Symbolic Regression Using GPUs

Author: Zhang, Rui, Lensen, Andrew, Sun, Yanan, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Khanna, Sankalp, editor, Cao, Jian, editor, Bai, Quan, editor, and Xu, Guandong, editor
Published: 2022
Full Text: View/download PDF

13. GPU-Oriented Parallel Algorithm for Histogram Statistical Image Enhancement

Author: XIAO Han, SUN Lupeng, LI Cailin, ZHOU Qinglei
Subjects: histogram statistics, local enhancement, local mean, graphics processing unit (gpu), compute unified device architecture (cuda), parallel algorithm, Electronic computers. Computer science, QA75.5-76.95
Abstract: Histogram statistics has important applications in the fields of image enhancement and target detection. However, with the increasing size of the image and the higher real-time requirements, the processing process of the histogram statistical local enhancement algorithm is slow and cannot reach the expected satisfactory speed. In view of this deficiency, this paper realizes the parallel processing of histogram statistical image enhancement algorithm on graphics processing unit (GPU) platform, which improves the processing speed of large format digital images. Firstly, the efficiency of data access is improved by making full use of compute unified device architecture (CUDA) active thread block and active thread to process different sub-image blocks and pixels in parallel. Then, the paralle-lization of histogram statistical image enhancement algorithm on GPU platform is realized by using kernel configu-ration parameter optimization and data parallel computing technology. Finally, the efficient data transmission mode between the host and the device is adopted, which further shortens the execution time of the system on the hetero-geneous computing platform. The results show that for images with different image sizes, the processing speed of the image histogram statistical parallel algorithm is two orders of magnitude higher than that of the CPU serial algorithm. It takes 787.11 ms to process an image with an image size of 3241×3685. The processing speed of the parallel algo-rithm is increased by 261.35 times. It lays a good foundation for the realization of real-time large-scale image processing.
Published: 2022
Full Text: View/download PDF

14. High-performance solutions of geographically weighted regression in R

Author: Binbin Lu, Yigong Hu, Daisuke Murakami, Chris Brunsdon, Alexis Comber, Martin Charlton, and Paul Harris
Subjects: Non-stationarity, big data, parallel computing, Compute Unified Device Architecture (CUDA), Geographically Weighted models (GWmodel), Mathematical geography. Cartography, GA1-1776, Geodesy, QB275-343
Abstract: ABSTRACTAs an established spatial analytical tool, Geographically Weighted Regression (GWR) has been applied across a variety of disciplines. However, its usage can be challenging for large datasets, which are increasingly prevalent in today’s digital world. In this study, we propose two high-performance R solutions for GWR via Multi-core Parallel (MP) and Compute Unified Device Architecture (CUDA) techniques, respectively GWR-MP and GWR-CUDA. We compared GWR-MP and GWR-CUDA with three existing solutions available in Geographically Weighted Models (GWmodel), Multi-scale GWR (MGWR) and Fast GWR (FastGWR). Results showed that all five solutions perform differently across varying sample sizes, with no single solution a clear winner in terms of computational efficiency. Specifically, solutions given in GWmodel and MGWR provided acceptable computational costs for GWR studies with a relatively small sample size. For a large sample size, GWR-MP and FastGWR provided coherent solutions on a Personal Computer (PC) with a common multi-core configuration, GWR-MP provided more efficient computing capacity for each core or thread than FastGWR. For cases when the sample size was very large, and for these cases only, GWR-CUDA provided the most efficient solution, but should note its I/O cost with small samples. In summary, GWR-MP and GWR-CUDA provided complementary high-performance R solutions to existing ones, where for certain data-rich GWR studies, they should be preferred.
Published: 2022
Full Text: View/download PDF

15. High-performance solutions of geographically weighted regression in R.

Author: Lu, Binbin, Hu, Yigong, Murakami, Daisuke, Brunsdon, Chris, Comber, Alexis, Charlton, Martin, and Harris, Paul
Subjects: SAMPLE size (Statistics), PARALLEL programming
Abstract: As an established spatial analytical tool, Geographically Weighted Regression (GWR) has been applied across a variety of disciplines. However, its usage can be challenging for large datasets, which are increasingly prevalent in today's digital world. In this study, we propose two high-performance R solutions for GWR via Multi-core Parallel (MP) and Compute Unified Device Architecture (CUDA) techniques, respectively GWR-MP and GWR-CUDA. We compared GWR-MP and GWR-CUDA with three existing solutions available in Geographically Weighted Models (GWmodel), Multi-scale GWR (MGWR) and Fast GWR (FastGWR). Results showed that all five solutions perform differently across varying sample sizes, with no single solution a clear winner in terms of computational efficiency. Specifically, solutions given in GWmodel and MGWR provided acceptable computational costs for GWR studies with a relatively small sample size. For a large sample size, GWR-MP and FastGWR provided coherent solutions on a Personal Computer (PC) with a common multi-core configuration, GWR-MP provided more efficient computing capacity for each core or thread than FastGWR. For cases when the sample size was very large, and for these cases only, GWR-CUDA provided the most efficient solution, but should note its I/O cost with small samples. In summary, GWR-MP and GWR-CUDA provided complementary high-performance R solutions to existing ones, where for certain data-rich GWR studies, they should be preferred. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. Optimization of Ray-Tracing Algorithm for Simulation of PMD Sensors

Author: Lade, Sangita, Kulkarni, Purva, Saraf, Prasad, Nartam, Purva, Patil, Aniket, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Swain, Debabala, editor, Pattnaik, Prasant Kumar, editor, and Athawale, Tushar, editor
Published: 2021
Full Text: View/download PDF

17. Real-Time Variants of Vertical Synchrosqueezing: Application to Radar Remote Sensing

Author: Karol Abratkiewicz and Jacek Gambrych
Subjects: Compute unified device architecture (CUDA), radar remote sensing, radar signal processing, time–frequency (TF) analysis, vertical synchrosqueezing (VSS), Ocean engineering, TC1501-1800, Geophysics. Cosmic physics, QC801-809
Abstract: This article presentsthorough research on using high-order vertical synchrosqueezing (VSS) in different radar remote sensing applications. The method well established in the literature is examined and compared to the novel form of third-order VSS and first-order VSS using an enhanced estimator of the instantaneous frequency, both proposed by the authors. An investigation shows that the two introduced variants of VSS are characterized by preserved capabilities (understood as the possibility to concentrate the time-frequency distribution and its reconstruction) with significantly reduced computation cost. The research shows that in practical radar remote sensing applications, high-order VSS can be successfully replaced by the approach proposed in this article with a lower computational burden. Furthermore, the methods are validated under numerical experiments, both simulated and real-life, which showed the efficiency of the proposed methods in radar signal processing, particular component extraction, and signal decomposition. Moreover, the authors developed the real-time graphical-processing-unit-based implementation of the proposed techniques and presented its efficiency in practical conditions.
Published: 2022
Full Text: View/download PDF

18. Accelerating Edit-Distance Sequence Alignment on GPU Using the Wavefront Algorithm

Author: Quim Aguado-Puig, Santiago Marco-Sola, Juan Carlos Moure, David Castells-Rufas, Lluc Alvarez, Antonio Espinosa, and Miquel Moreto
Subjects: Approximate string matching, compute unified device architecture (CUDA), edit-distance, graphics processing unit (GPU), Levenshtein distance, pairwise sequence alignment, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Sequence alignment remains a fundamental problem with practical applications ranging from pattern recognition to computational biology. Traditional algorithms based on dynamic programming are hard to parallelize, require significant amounts of memory, and fail to scale for large inputs. This work presents eWFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute the exact edit-distance sequence alignment based on the wavefront alignment algorithm (WFA). This approach exploits the similarities between the input sequences to accelerate the alignment process while requiring less memory than other algorithms. Our implementation takes full advantage of the massive parallel capabilities of modern GPUs to accelerate the alignment process. In addition, we propose a succinct representation of the alignment data that successfully reduces the overall amount of memory required, allowing the exploitation of the fast shared memory of a GPU. Our results show that our GPU implementation outperforms by 3- $9\times $ the baseline edit-distance WFA implementation running on a 20 core machine. As a result, eWFA-GPU is up to 265 times faster than state-of-the-art CPU implementation, and up to 56 times faster than state-of-the-art GPU implementations.
Published: 2022
Full Text: View/download PDF

19. 面向GPU的直方图统计图像增强并行算法.

Author: 肖汉, 孙陆鹏, 李彩林, and 周清雷
Subjects: PARALLEL algorithms, HETEROGENEOUS computing, IMAGE intensifiers, IMAGE processing, STRUCTURAL optimization, DIGITAL images, PIXELS, GRAPHICS processing units
Abstract: Copyright of Journal of Frontiers of Computer Science & Technology is the property of Beijing Journal of Computer Engineering & Applications Journal Co Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2022
Full Text: View/download PDF

20. Numerical simulations of nano-particle's drag forces using DSMC method for various Knudsen numbers.

Author: Shin, Sang Woo and Lee, Sang Hwan
Subjects: *KNUDSEN flow, *POISEUILLE flow, *DRAG force, *GRANULAR flow, *MICROCHANNEL flow, *COMPUTER simulation
Abstract: In this study, high-vacuum flow was analyzed using the direct simulation Monte Carlo (DSMC) method, and various forces acting on fine particles in a high-vacuum flow field were studied. The DSMC method is a Lagrangian method that models the flow as particles and analyzes the collisions and behaviors of each particle, which costs a large computing resource. To validate DSMC method, computational results of a Poiseuille flow in microchannel are compared with analytical results. In addition, the force acting on the particles in the high-vacuum rarefied gas region was verified using the outputs of previous studies. Through this numerical analysis, it is possible to analyze about regions that are difficult to proceed with experiments. As a result, the drag forces according to the Knudsen number which indicates the ratio of vacuum and the particle size, it was confirmed that the drag force can be predicted through the empirical formula of previous studies. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

21. CUDA Accelerated HAPO (C-HAPO) Algorithm for Fast Responses in Vehicular Ad Hoc Networks

Author: Jindal, Vinita, Bedi, Punam, Verma, Ajit Kumar, Series Editor, Kapur, P. K., Series Editor, Kumar, Uday, Series Editor, Singh, Ompal, editor, and Khatri, Sunil Kumar, editor
Published: 2020
Full Text: View/download PDF

22. A GPU-Accelerated Discontinuous Galerkin Method for Solving Two-Dimensional Laminar Flows.

Author: GAO Huanqin, CHEN Hongquan, ZHANG Jiale, XU Shengguan, and GAO Yukun
Subjects: LAMINAR flow, GRAPHICS processing units, GALERKIN methods, TWO-dimensional models, COMPUTER simulation
Abstract: Copyright of Transactions of Nanjing University of Aeronautics & Astronautics is the property of Editorial Department of Journal of Nanjing University of Aeronautics & Astronautics and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2022
Full Text: View/download PDF

23. Enhanced Spatial–Temporal Savitzky–Golay Method for Reconstructing High-Quality NDVI Time Series: Reduced Sensitivity to Quality Flags and Improved Computational Efficiency.

Author: Yang, Xue, Chen, Jin, Guan, Qingfeng, Gao, Huan, and Xia, Wei
Subjects: *TIME series analysis, *NORMALIZED difference vegetation index, *GRAPHICS processing units
Abstract: The spatial–temporal Savitzky–Golay (STSG) method for noise reduction can address the problem of tempor- ally continuous normalized difference vegetation index (NDVI) gaps and effectively increase local low NDVI values without overcorrection. However, STSG largely depends on the quality flags of the NDVI time-series data, and inaccurate quality flags yield misleading final results. STSG also requires extensive computing time when used in large-scale applications. This study proposes an enhanced method, called compute unified device architecture (CUDA)-based STSG (cuSTSG), to address the aforementioned limitations of STSG. First, cosine similarities between the annual NDVI time series were used to identify and exclude the NDVI values with inaccurate quality flags from the NDVI seasonal growth trajectory. Second, computational performance was improved by reducing redundant computations and parallelizing computationally intensive procedures using the CUDA on graphics processing units (GPUs). Experiments on four MODIS NDVI time-series datasets of various sizes and regions showed that compared with the original STSG, cuSTSG reduced the mean absolute errors of the final products by 4.90%, 7.77%, 11.76%, and 2.06%, respectively. The results also showed that cuSTSG on a GPU achieved more than 75 speed-up compared with the Interactive Data Language-implemented STSG, and more than 30 speed-up compared with the C++-implemented STSG. cuSTSG can effectively mitigate the impacts of inaccurate quality flags on final products and generate high-quality NDVI time series at large scales with high accuracy and performance. The source code of cuSTSG is available at https://github.com/HPSCIL/cuSTSG. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

24. Solving Electromagnetic Scattering Problems With Tens of Billions of Unknowns Using GPU Accelerated Massively Parallel MLFMA.

Author: He, Wei-Jia, Yang, Zeng, Huang, Xiao-Wei, Wang, Wu, Yang, Ming-Lin, and Sheng, Xin-Qing
Subjects: *ELECTROMAGNETIC wave scattering, *GRAPHICS processing units, *HETEROGENEOUS computing, *RANDOM access memory
Abstract: In this article, a massively parallel approach of the multilevel fast multipole algorithm (PMLFMA) on graphics processing unit (GPU) heterogeneous platform, noted as GPU-PMLFMA, is presented for solving extremely large electromagnetic scattering problems involving tens of billions of unknowns, In this approach, the flexible and efficient ternary partitioning scheme is employed at first to partition the MLFMA octree among message-passing interface (MPI) processes. Then, the computationally intensive parts of the PMLFMA on each MPI process, matrix filling, aggregation and disaggregation, and so on are accelerated by using the GPU. Different parallelization strategies in coincidence with the ternary parallel MLFMA approach are designed for GPU to ensure high computational throughput. Special memory usage strategy is designed to improve computational efficiency and benefit data reusing. The CPU/GPU asynchronous computing pattern is designed with the OpenMP and compute unified device architecture (CUDA), respectively, for accelerating the CPU and GPU execution parts and computation time overlapped. GPU architecture-based optimization strategies are implemented to further improve the computational efficiency. Numerical results demonstrate that the proposed GPU-PMLFMA can achieve over three times speedup, compared with the eight-threaded conventional PMLFMA. Solutions of scattering by electrically large and complicated objects with about 24 000 wavelengths and over 41.8 billion unknowns are presented. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

25. Toward Large-Scale Evolutionary Multitasking: A GPU-Based Paradigm.

Author: Huang, Yuxiao, Feng, Liang, Qin, Alex Kai, Chen, Meng, and Tan, Kay Chen
Subjects: CENTRAL processing units, EVOLUTIONARY algorithms, KNOWLEDGE transfer, GRAPHICS processing units
Abstract: Evolutionary multitasking (EMT), which shares knowledge across multiple tasks while the optimization progresses online, has demonstrated superior performance in terms of both optimization quality and convergence speed over its single-task counterpart in solving complex optimization problems. However, most of the existing EMT algorithms only consider handling two tasks simultaneously. As the computational cost incurred in the evolutionary search and knowledge transfer increased rapidly with the number of optimization tasks, these EMT algorithms cannot meet today’s requirements of optimization service on the cloud for many real-world applications, where hundreds or thousands of optimization requests (labeled as large-scale EMT) are often received simultaneously and require to be optimized in a short time. Recently, graphics processing unit (GPU) computing has attracted extensive attention to accelerate the applications possessing large-scale data volume that are traditionally handled by the central processing unit (CPU). Taking this cue, toward large-scale EMT, in this article, we propose a new EMT paradigm based on the island model with the compute unified device architecture (CUDA), which is able to handle a large number of continuous optimization tasks efficiently and effectively. Moreover, under the proposed paradigm, we develop the GPU-based implicit and explicit knowledge transfer mechanisms for EMT. To evaluate the performance of the proposed paradigm, comprehensive empirical studies have been conducted against its CPU-based counterpart in large-scale EMT. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

26. An improved multistage preconditioner on GPUs for compositional reservoir simulation

Author: Zhao, Li, Li, Shizhe, Zhang, Chen-Song, Feng, Chunsheng, and Shu, Shi
Published: 2023
Full Text: View/download PDF

27. High-Performance Flow Classification of Big Data Using Hybrid CPU-GPU Clusters of Cloud Environments

Author: Fazel-Najafabadi, Azam, Abbasi, Mahdi, Attar, Hani H., Amer, Ayman, Taherkordi, Amir, Shokrollahi, Azad, Khosravi, Mohammad R., Solyman, Ahmed A., Fazel-Najafabadi, Azam, Abbasi, Mahdi, Attar, Hani H., Amer, Ayman, Taherkordi, Amir, Shokrollahi, Azad, Khosravi, Mohammad R., and Solyman, Ahmed A.
Abstract: The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data, so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like Tuple Space Search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms, including Graphics Processing Units (GPUs), clusters of Central Processing Units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 Million packets per second (Mpps). That is, the hybrid cluster produced by the integration of Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.
Published: 2024
Full Text: View/download PDF

28. Sparse Linear Spectral Unmixing of Hyperspectral Images Using Expectation-Propagation.

Author: Li, Zeng, Altmann, Yoann, Chen, Jie, Mclaughlin, Stephen, and Rahardja, Susanto
Subjects: *SUPERVISED learning, *ISING model, *GRAPHICS processing units, *LATENT variables, *COMPUTATIONAL complexity
Abstract: This article presents a novel Bayesian approach for hyperspectral image unmixing. The observed pixels are modeled by a linear combination of material signatures weighted by their corresponding abundances. A spike-and-slab abundance prior is adopted to promote sparse mixtures and an Ising prior model is used to capture spatial correlation of the mixture support across pixels. We approximate the posterior distribution of the abundances using the expectation-propagation (EP) method. We show that it can significantly reduce the computational complexity of the unmixing stage and meanwhile provide uncertainty measures, compared to expensive Monte Carlo strategies traditionally considered for uncertainty quantification. Moreover, many variational parameters within each EP factor can be updated in a parallel manner, which enables mapping of efficient algorithmic architectures based on graphics processing units (GPUs). Under the same approximate Bayesian framework, we then extend the proposed algorithm to semi-supervised unmixing, whereby the abundances are viewed as latent variables and the expectation-maximization (EM) algorithm is used to refine the endmember matrix. Experimental results on synthetic data and real hyperspectral data illustrate the benefits of the proposed framework over state-of-art linear unmixing methods. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

29. Parallelization and Optimization of Application for Phonon BTE

Author: WEN Minhua, LIU Yongzhi, BAO Hua, HU Yue, SHEN Yongxing, WEI Jianwen, LIN Xinhua
Subjects: parallel acceleration, boltzmann transport equation (bte), dgx-2, compute unified device architecture (cuda), Electronic computers. Computer science, QA75.5-76.95
Abstract: Heat conduction, as occurring at submicron scale can be predicted effectively using the Boltzmann transport equation (BTE) for phonons. Compared with the stochastic methods, the deterministic method represented by the finite volume method for the phonon BTE is considered to be more promising to solve engineering practical problems. However, the finite volume method has the problems of large number of iteration steps and long iteration time. To this end, the parallel acceleration scheme on GPU for the iterative solution part of phonon BTE is proposed. And the appropriate thread allocation method and data storage format are designed. This paper also applies the loop unrolling and merging kernel functions to optimize the iteration process. In addition, the multi-GPU version of phonon BTE is implemented by using the direction-based parallel strategy with the MPI+CUDA, CUDA-Aware MPI and NCCL (NVIDIA collective communications library). Experimental results show that the performance of the single GPU version on a V100 is up to 31.5X faster than the serial implementation of Intel Xeon Gold 6248. And the multi-GPU version with NCCL yields 83% parallel efficiency on 8 DGX-2 nodes with a total of 128 V100 GPUs, which is 57% higher than the parallel method using MPI+CUDA.
Published: 2020
Full Text: View/download PDF

30. Performance engineering for HEVC transform and quantization kernel on GPUs

Author: Mate Čobrnić, Alen Duspara, Leon Dragić, Igor Piljić, and Mario Kovač
Subjects: integer discrete cosine transform (dct), high efficiency video coding (hevc), graphics processor unit (gpu), matrix multiplication, compute unified device architecture (cuda), Control engineering systems. Automatic machinery (General), TJ212-225, Automation, T59.5
Abstract: Continuous growth of video traffic and video services, especially in the field of high resolution and high-quality video content, places heavy demands on video coding and its implementations. High Efficiency Video Coding (HEVC) standard doubles the compression efficiency of its predecessor H.264/AVC at the cost of high computational complexity. To address those computing issues high-performance video processing takes advantage of heterogeneous multiprocessor platforms. In this paper, we present a highly performance-optimized HEVC transform and quantization kernel with all-zero-block (AZB) identification designed for execution on a Graphics Processor Unit (GPU). Performance optimization strategy involved all three aspects of parallel design, exposing as much of the application’s intrinsic parallelism as possible, exploitation of high throughput memory and efficient instruction usage. It combines efficient mapping of transform blocks to thread-blocks and efficient vectorized access patterns to shared memory for all transform sizes supported in the standard. Two different GPUs of the same architecture were used to evaluate proposed implementation. Achieved processing times are 6.03 and 23.94 ms for DCI 4K and 8K Full Format, respectively. Speedup factors compared to CPU, cuBLAS and AVX2 implementations are up to 80, 19 and 4 times respectively. Proposed implementation outperforms previous work 1.22 times.
Published: 2020
Full Text: View/download PDF

31. Fast, Sub-pixel Accurate Digital Image Correlation Algorithm Powered by Heterogeneous (CPU-GPU) Framework

Author: Thiagu, Mullai, Subramanian, Sankara J., Nasre, Rupesh, Zimmerman, Kristin B., Series Editor, Lamberti, Luciano, editor, Lin, Ming-Tzer, editor, Furlong, Cosme, editor, Sciammarella, Cesar, editor, Reu, Phillip L., editor, and Sutton, Michael A, editor
Published: 2019
Full Text: View/download PDF

32. Active Foreground Neural Network

Author: Aggarwal, Ayush, Gupta, Subhash Chand, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Hirche, Sandra, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Liang, Qilian, Series Editor, Martin, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Möller, Sebastian, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zhang, Junjie James, Series Editor, Mishra, Sukumar, editor, Sood, Yog Raj, editor, and Tomar, Anuradha, editor
Published: 2019
Full Text: View/download PDF

33. Topology Optimization Using GPGPU

Author: Gavranovic, Stefan, Hartmann, Dirk, Wever, Utz, Oñate, Eugenio, Series Editor, Minisci, Edmondo, editor, Vasile, Massimiliano, editor, Periaux, Jacques, editor, Gauger, Nicolas R., editor, Giannakoglou, Kyriakos C., editor, and Quagliarella, Domenico, editor
Published: 2019
Full Text: View/download PDF

34. Implementation of Mass Transfer Model on Parallel Computational System

Author: Pavluš, Miron, Bačinský, Tomáš, Greguš, Michal, Xhafa, Fatos, Series Editor, Barolli, Leonard, editor, Kryvinska, Natalia, editor, Enokido, Tomoya, editor, and Takizawa, Makoto, editor
Published: 2019
Full Text: View/download PDF

35. An Efficient Parallel Implementation of CPU Scheduling Algorithms Using Data Parallel Algorithms

Author: Agrawal, Suvigya, Yadav, Aishwarya, Parwani, Disha, Mayya, Veena, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Kamal, Raj, editor, Henshaw, Michael, editor, and Nair, Pramod S., editor
Published: 2019
Full Text: View/download PDF

36. Parallel Computing for Cell Mapping

Author: Sun, Jian-Qiao, Xiong, Fu-Rui, Schütze, Oliver, Hernández, Carlos, Sun, Jian-Qiao, Xiong, Fu-Rui, Schütze, Oliver, and Hernández, Carlos
Published: 2019
Full Text: View/download PDF

37. cuFSDAF: An Enhanced Flexible Spatiotemporal Data Fusion Algorithm Parallelized Using Graphics Processing Units.

Author: Gao, Huan, Zhu, Xiaolin, Guan, Qingfeng, Yang, Xue, Yao, Yao, Zeng, Wen, and Peng, Xuantong
Subjects: *GRAPHICS processing units, *MULTISENSOR data fusion, *IMAGE fusion, *DEEP learning, *ALGORITHMS, *REMOTE sensing, *SURFACE dynamics
Abstract: Spatiotemporal data fusion is a cost-effective way to produce remote sensing images with high spatial and temporal resolutions using multisource images. Using spectral unmixing analysis and spatial interpolation, the flexible spatiotemporal data fusion (FSDAF) algorithm is suitable for heterogeneous landscapes and capable of capturing abrupt land-cover changes. However, the extensive computational complexity of FSDAF prevents its use in large-scale applications and mass production. Besides, the domain decomposition strategy of FSDAF causes accuracy loss at the edges of subdomains due to the insufficient consideration of edge effects. In this study, an enhanced FSDAF (cuFSDAF) is proposed to address these problems, and includes three main improvements. First, the TPS interpolator is replaced by an accelerated inverse distance weighted (IDW) interpolator to reduce computational complexity. Second, the algorithm is parallelized based on the compute unified device architecture (CUDA), a widely used parallel computing framework for graphics processing units (GPUs). Third, an adaptive domain decomposition (ADD) method is proposed to improve the fusion accuracy at the edges of subdomains and to enable GPUs with varying computing capacities to deal with datasets of any size. Experiments showed while obtaining similar accuracies to FSDAF and an up-to-date deep-learning-based method, cuFSDAF reduced the computing time significantly and achieved speed-ups of 140.3–182.2 over the original FSDAF program. cuFSDAF is capable of efficiently producing fused images with both high spatial and temporal resolutions to support applications for large-scale and long-term land surface dynamics. Source code and test data available at https://github.com/HPSCIL/cuFSDAF. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

38. Advanced GNSS-R Signals Processing With GPUs

Author: Oriol Cervello i Nogues, Daniel Pascual, Raul Onrubia, and Adriano Camps
Subjects: Compute unified device architecture (CUDA), global navigation satellite system reflectometry (GNSS-R), graphics processing unit (GPU) processing, parallel computing, real-time processing, Ocean engineering, TC1501-1800, Geophysics. Cosmic physics, QC801-809
Abstract: Global navigation satellite system reflectometry (GNSS-R) is a group of techniques that uses satellite navigation signals as signals of opportunity for remote sensing applications. In GNSS-R, large amounts of data are acquired and have to be processed. Computation time is typically the bottleneck for ground and airborne experiments. This article presents an efficient solution for off-line GNSS-R processing data taking advantage of graphics processing units (GPUs). After comparing to the typically used CPU languages, such as MATLAB and C++, the advantage of using parallel processing on the GPU is clear. GPU-based computation can reduce the processing time by as much as 95% of the acquisition time of the data. An implementation taking advantage of a home-use GPU is proposed for the data processing units. Thanks to its efficiency, even real-time processing experiments are feasible.
Published: 2020
Full Text: View/download PDF

39. Parallel Palm Print Identification Using Fractional Coefficients of Palm Edge Transformed Images on GPU

Author: Gudadhe, Santwana S., Thakare, A. D., Dhote, C. A., Kacprzyk, Janusz, Series editor, Pal, Nikhil R., Advisory editor, Bello Perez, Rafael, Advisory editor, Corchado, Emilio S., Advisory editor, Hagras, Hani, Advisory editor, Kóczy, László T., Advisory editor, Kreinovich, Vladik, Advisory editor, Lin, Chin-Teng, Advisory editor, Lu, Jie, Advisory editor, Melin, Patricia, Advisory editor, Nedjah, Nadia, Advisory editor, Nguyen, Ngoc Thanh, Advisory editor, Wang, Jun, Advisory editor, Dash, Subhransu Sekhar, editor, Das, Swagatam, editor, and Panigrahi, Bijaya Ketan, editor
Published: 2018
Full Text: View/download PDF

40. Parallel Computations

Author: Awange, Joseph L., Paláncz, Béla, Lewis, Robert H., Völgyesi, Lajos, Awange, Joseph L., Paláncz, Béla, Lewis, Robert H., and Völgyesi, Lajos
Published: 2018
Full Text: View/download PDF

41. An Overview of Hardware Implementation of Membrane Computing Models.

Author: GEXIANG ZHANG, ZEYI SHANG, VERLAN, SERGEY, MARTÍNEZ-DEL-AMOR, MIGUEL Á., CHENGXUN YUAN, VALENCIA-CABRERA, LUIS, and PÉREZ-JIMÉNEZ, MARIO J.
Subjects: *BIOLOGICALLY inspired computing, *GATE array circuits, *FIELD programmable gate arrays, *PARALLEL algorithms, *PARALLEL programming
Abstract: The model of membrane computing, also known under the name of P systems, is a bio-inspired large-scale parallel computing paradigm having a good potential for the design of massively parallel algorithms. For its implementation it is very natural to choose hardware platforms that have important inherent parallelism, such as field-programmable gate arrays (FPGAs) or compute unified device architecture (CUDA)-enabled graphic processing units (GPUs). This article performs an overview of all existing approaches of hardware implementation in the area of P systems. The quantitative and qualitative attributes of FPGA-based implementations and CUDA-enabled GPU-based simulations are compared to evaluate the two methodologies. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

42. 基于三阶运动模型的 GRFT 算法的并行化实现.

Author: 冯伟刚 and 张顺生
Abstract: Copyright of Journal of Signal Processing is the property of Journal of Signal Processing and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2021
Full Text: View/download PDF

43. Audio fingerprint hierarchy searching strategies on GPGPU massively parallel computer

Author: Toan Nguyen Mau and Yasushi Inoguchi
Subjects: Audio fingerprint, massively parallel, parallel processing, GPGPU, compute unified device architecture (CUDA), K-modes, K-means, locality sensitive hashing (LSH), Telecommunication, TK5101-6720, Information technology, T58.5-58.64
Abstract: Audio fingerprint was developed for representing the audio based on the content of waveform. With the audio fingerprint database, we can easily manage the song/music with high reliability and flexibility. However, with the well-developed Internet of today, the audio data have become bigger and bigger which make the management of audio/music data more difficult. There are two problems that we need to solve when the audio fingerprint database turn into bigdata: the size of the database needs to be sufficient for storing 10 millions of audio fingerprint and the strategies for searching the nearest song in acceptable time for thousands of queries at once [Nguyen Mau, T., & Inoguchi, Y. (2016). Audio fingerprint hierarchy searching on massively parallel with multi-gpgpus using K-modes and lsh. Eighth international conference on knowledge and systems engineering (KSE) (pp. 49–54). IEEE]. In this research, we propose the methods for storing the audio fingerprint using multiple GPGPU and nearest song searching strategies based on these databases. We also showed that our methods have the significant result for deploying the real system in the future.
Published: 2018
Full Text: View/download PDF

44. Robust Optimization for Audio FingerPrint Hierarchy Searching on Massively Parallel with Multi-GPGPUs Using K-modes and LSH

Author: Mau, Toan Nguyen, Inoguchi, Yasushi, Duy, Vo Hoang, editor, Dao, Tran Trong, editor, Kim, Sang Bong, editor, Tien, Nguyen Tan, editor, and Zelinka, Ivan, editor
Published: 2017
Full Text: View/download PDF

45. High-Performance Computing for Earthquake Disaster Simulation of Urban Buildings

Author: Lu, Xinzheng, Guan, Hong, Lu, Xinzheng, and Guan, Hong
Published: 2017
Full Text: View/download PDF

46. High-Performance Computing and Visualization for Earthquake Disaster Simulation of Tall Buildings

Author: Lu, Xinzheng, Guan, Hong, Lu, Xinzheng, and Guan, Hong
Published: 2017
Full Text: View/download PDF

47. Performance engineering for HEVC transform and quantization kernel on GPUs.

Author: Čobrnić, Mate, Duspara, Alen, Dragić, Leon, Piljić, Igor, and Kovač, Mario
Subjects: VIDEO coding, COMPUTATIONAL complexity, GRAPHICS processing units, VIDEO processing, MULTIPROCESSORS, DISCRETE cosine transforms, VIDEO on demand
Abstract: Continuous growth of video traffic and video services, especially in the field of high resolution and high-quality video content, places heavy demands on video coding and its implementations. High Efficiency Video Coding (HEVC) standard doubles the compression efficiency of its predecessor H.264/AVC at the cost of high computational complexity. To address those computing issues high-performance video processing takes advantage of heterogeneous multiprocessor platforms. In this paper, we present a highly performance-optimized HEVC transform and quantization kernel with all-zero-block (AZB) identification designed for execution on a Graphics Processor Unit (GPU). Performance optimization strategy involved all three aspects of parallel design, exposing as much of the application's intrinsic parallelism as possible, exploitation of high throughput memory and efficient instruction usage. It combines efficient mapping of transform blocks to thread-blocks and efficient vectorized access patterns to shared memory for all transform sizes supported in the standard. Two different GPUs of the same architecture were used to evaluate proposed implementation. Achieved processing times are 6.03 and 23.94 ms for DCI 4K and 8K Full Format, respectively. Speedup factors compared to CPU, cuBLAS and AVX2 implementations are up to 80, 19 and 4 times respectively. Proposed implementation outperforms previous work 1.22 times. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

48. 声子BTE应用的并行和优化研究.

Author: 文敏华, 刘永志, 鲍华, 胡跃, 沈泳星, 韦建文, and 林新华
Abstract: Copyright of Journal of Frontiers of Computer Science & Technology is the property of Beijing Journal of Computer Engineering & Applications Journal Co Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Published: 2020
Full Text: View/download PDF

49. Multiple-GPU-Based Simulation of Ka-Band Helix Traveling Wave Tube.

Author: Wang, Xiaoyue, Chen, Qi, and Li, Mingzhi
Subjects: *TRAVELING-wave tubes, *GRAPHICS processing units, *MICROWAVE devices, *MESSAGE passing (Computer science), *ALGORITHMS, *FINITE difference method
Abstract: The finite-difference time-domain (FDTD) algorithm and the particle-in-cell (PIC) method-based simulation are classic approaches to design and optimize the helix traveling wave tubes (TWTs). In this article, a multiple-graphics processing unit (GPU)-based 3-D-FDTD-PIC parallel program is developed to complete the full-wave simulation of a Ka-band helix TWT in the time domain. The specific parallel simulation scheme is given. The code based on compute unified device architecture (CUDA) and message passing interface (MPI) can run on a scalable heterogeneous cluster consisting of multiple CPUs and GPUs, substantially improving the simulation speed and shortening the development period of TWTs. In this article, the specific parameters of the experimental TWT are given. The simulation results are found basically consistent with the measured values and theoretical analysis, verifying the correctness of this code. Moreover, this program can realistically restore the physical processes occurring in the tube in a short period of time, which means it can be applied as an efficient simulation tool for further research of TWTs or even other microwave power devices based on beam–wave interaction. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

50. GPU-Accelerated Computation of Time-Evolving Electromagnetic Backscattering Field From Large Dynamic Sea Surfaces.

Author: Linghu, Longxiang, Wu, Jiaji, Wu, Zhensen, Jeon, Gwanggil, and Wang, Xiaobin
Abstract: An efficient facet-based composite scattering model (FBCSM) is developed for calculating the time-evolving electromagnetic (EM) scattering field (TESF) to study the normalized radar cross section and Doppler spectrum characteristics from dynamic sea surfaces. The dynamic sea surface comprises two-scale profiles: small-scale capillary ripples modulated by large-scale gravity waves, which are modeled by millions of small facets. In microwave bands, two scattering mechanisms, quasi-specular scattering with respect to gravity waves and Bragg scattering with respect to ripples, are taken into account in the FBCSM for computation of the time-evolving EM scattering field under diverse polarizations. However, it may be very time-consuming and difficult to calculate the TESF due to the high resolution and dynamic complexity of the large dynamic sea surface. In this paper, the NVIDIA Tesla K80 graphics processing unit (GPU) with the compute unified device architecture is utilized to improve the computational performance of the TESF. The whole GPU-based TESF computation includes the optimal use of temporary variables, shared memory, constant memory and register, fast-math compiler options, asynchronous data transfer, and the most suitable block size and number of registers. By utilizing the proposed five improvement strategies, a significant speedup of $1200 \times $ can be achieved for computation of TESF from large dynamic sea surfaces for microwave bands compared with the single-threaded C program executed on the Intel(R) Core(TM) i5-3450 CPU. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

318 results on '"Compute Unified Device Architecture (CUDA)"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources