7,704 results on '"graphics processing unit"'
Search Results
2. A numerical toy model of Langevin dynamics provides real-time visualization of colloidal microdroplet evaporation
- Author
-
Derkachov, G., Jakubczyk, T., Alikhanzadeh-Arani, S., Wojciechowski, T., and Jakubczyk, D.
- Published
- 2025
- Full Text
- View/download PDF
3. GAAS: GPU accelerated absorption simulator
- Author
-
Callahan, Charles S., Bresler, Sean M., Coburn, Sean C., Long, David A., and Rieker, Gregory B.
- Published
- 2025
- Full Text
- View/download PDF
4. Efficient multi-GPU implementation of a moving boundary approach in rotor flow simulation using LBM and level-set method
- Author
-
Sun, Xiangcheng and Wang, Xian
- Published
- 2025
- Full Text
- View/download PDF
5. Deep Learning Technique for Computer Vision-Based Pose Estimation for Augmented Reality
- Author
-
Palanimeera, J., Ponmozhi, K., Jyothi, Kanagaraj, Dargar, Shashi Kant, Birla, Shilpi, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Tripathi, Anshuman, editor, Soni, Amit, editor, Tiwari, Manish, editor, Swarnkar, Anil, editor, and Sahariya, Jagrati, editor
- Published
- 2025
- Full Text
- View/download PDF
6. GPU optimization techniques to accelerate optiGAN—a particle simulation GAN
- Author
-
Srikanth, Anirudh, Trigila, Carlotta, and Roncali, Emilie
- Subjects
Information and Computing Sciences ,Applied Computing ,Machine Learning ,generative adversarial networks ,graphics processing unit ,performance optimization ,radiation detector ,multidimensional probability distributions ,Monte-Carlo simulation ,Applied computing ,Machine learning - Abstract
The demand for specialized hardware to train AI models has increased in tandem with the increase in the model complexity over the recent years. Graphics processing unit (GPU) is one such hardware that is capable of parallelizing operations performed on a large chunk of data. Companies like Nvidia, AMD, and Google have been constantly scaling-up the hardware performance as fast as they can. Nevertheless, there is still a gap between the required processing power and processing capacity of the hardware. To increase the hardware utilization, the software has to be optimized too. In this paper, we present some general GPU optimization techniques we used to efficiently train the optiGAN model, a Generative Adversarial Network that is capable of generating multidimensional probability distributions of optical photons at the photodetector face in radiation detectors, on an 8GB Nvidia Quadro RTX 4000 GPU. We analyze and compare the performances of all the optimizations based on the execution time and the memory consumed using the Nvidia Nsight Systems profiler tool. The optimizations gave approximately a 4.5x increase in the runtime performance when compared to a naive training on the GPU, without compromising the model performance. Finally we discuss optiGANs future work and how we are planning to scale the model on GPUs.
- Published
- 2024
7. Symmetric Tridiagonal Eigenvalue Solver Across CPU Graphics Processing Unit (GPU) Nodes.
- Author
-
Hernández-Rubio, Erika, Estrella-Cruz, Alberto, Meneses-Viveros, Amilcar, Rivera-Rivera, Jorge Alberto, Barbosa-Santillán, Liliana Ibeth, and Chapa-Vergara, Sergio Víctor
- Subjects
SYMMETRIC matrices ,GRAPHICS processing units ,EIGENVALUES ,DENSITY functional theory ,EIGENVECTORS - Abstract
In this work, an improved and scalable implementation of Cuppen's algorithm for diagonalizing symmetric tridiagonal matrices is presented. This approach uses a hybrid-heterogeneous parallelization technique, taking advantage of GPU and CPU in a distributed hardware architecture. Cuppen's algorithm is a theoretical concept and a powerful tool in various scientific and engineering applications. It is a key player in matrix diagonalization, finding its use in Functional Density Theory (FDT) and Spectral Clustering. This highly efficient and numerically stable algorithm computes eigenvalues and eigenvectors of symmetric tridiagonal matrices, making it a crucial component in many computational methods. One of the challenges in parallelizing algorithms for GPUs is their limited memory capacity. However, we overcome this limitation by utilizing multiple nodes with both CPUs and GPUs. This enables us to solve subproblems that fit within the memory of each device in parallel and subsequently combine these subproblems to obtain the complete solution. The hybrid-heterogeneous approach proposed in this work outperforms the state-of-the-art libraries and also maintains a high degree of accuracy in terms of orthogonality and quality of eigenvectors. Furthermore, the sequential version of the algorithm with our approach in this work demonstrates superior performance and potential for practical use. In the experiments carried out, it was possible to verify that the performance of the implementation that was carried out scales by 2× using two graphic cards in the same node. Notably, Symmetric Tridiagonal Eigenvalue Solvers are fundamental to solving more general eigenvalue problems. Additionally, the divide-and-conquer approach employed in this implementation can be extended to singular value solvers. Given the wide range of eigenvalue problems encountered in scientific and engineering domains, this work is essential in advancing computational methods for efficient and accurate matrix diagonalization. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Parallel implementation of discrete cosine transform and its inverse for image compression applications.
- Author
-
Mukherjee, Debasish
- Subjects
- *
DISCRETE cosine transforms , *GRAPHICS processing units - Abstract
This paper presents the graphics processing unit (GPU) implementation of two-dimensional discrete cosine transform (2D DCT) and inverse discrete cosine transform (2D IDCT) for image compression applications. Based on the trigonometric properties, the transform matrices are simplified, resulting in reduced computation over the naive implementation. To assess its performance, the output image quality is measured in terms of several metrics and found to be better than all other existing transforms. To further improve the timings, a GPU implementation of the proposed transforms is obtained by exploiting the inter-level parallelism among threads and blocks in addition to efficiently accessing data from the shared memory resources. This has resulted in significant improvement in speedup (more than 5k) for both the transforms. The proposed GPU implementation of 2D DCT is compared in terms of processing time and is shown to outperform the existing work across all image dimensions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. Batched sparse direct solver design and evaluation in SuperLU_DIST.
- Author
-
Boukaram, Wajih, Hong, Yuxi, Liu, Yang, Shi, Tianyi, and Li, Xiaoye S
- Subjects
- *
FUNCTION algebras , *DIRECTED acyclic graphs , *LINEAR algebra , *FACTORIZATION , *BANDWIDTHS - Abstract
Over the course of interactions with various application teams, the need for batched sparse linear algebra functions has emerged in order to make more efficient use of the GPUs for many small and sparse linear algebra problems. In this paper, we present our recent work on a batched sparse direct solver for GPUs. The sparse LU factorization is computed by the levels of the elimination tree, leveraging the batched dense operations at each level and a new batched Scatter GPU kernel. The sparse triangular solve is computed by the level sets of the directed acyclic graph (DAG) of the triangular matrix. Batched operations overcome the large overhead associated with launching many small kernels. For medium sized matrix batches with not-so-small bandwidth, using an NVIDIA A100 GPU, our new batched sparse direct solver is orders of magnitude faster than a batched banded solver and uses less than one-tenth of the memory. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Revealing Hidden Features of Chaotic Systems Using High-Performance Bifurcation Analysis Tools Based on CUDA Technology.
- Author
-
Rybin, Vyacheslav, Butusov, Denis, Shirnin, Kirill, and Ostrovskii, Valerii
- Subjects
- *
GRAPHICS processing units , *BIFURCATION diagrams , *ORDINARY differential equations , *TASK analysis , *TEST systems - Abstract
Bifurcation analysis is an essential tool in nonlinear dynamics. Bifurcation diagrams help to discover subtle features of investigated dynamics such as chaotic and periodic regimes, hidden attractors and fixed points. However, plotting high-resolution bifurcation diagrams can be a computationally challenging task, especially in multiparametric evaluation. It should be noted that the bifurcation analysis is a task with natural parallelism and thus can be efficiently solved using hybrid central-graphics processing architectures. In this paper, we propose an advanced algorithm and special software for plotting bifurcation diagrams using calculations accelerated by graphics processing unit in combination with a highly efficient semi-implicit ordinary differential equation solver. Time series processing is based on the extraction of amplitude and phase features for the density-based spatial clustering of applications with noise to determine oscillation periodicity. We showcase the features of the application of proposed solutions on a set of test chaotic systems. The performance of the analysis algorithms is investigated in comparison with conventional solutions based on central processing unit and several approaches known from the literature. We explicitly show that the proposed algorithm outperforms known solutions in both calculation speed and precision. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Transient Fault Detection in Tensor Cores for Modern GPUs.
- Author
-
Hafezan, Mohammad Hassan and Atoofian, Ehsan
- Subjects
ARTIFICIAL neural networks ,MACHINE learning ,AUTONOMOUS vehicles - Abstract
Deep neural networks (DNNs) have emerged as an effective solution for many machine learning applications. However, the great success comes with the cost of excessive computation. The Volta graphics processing unit (GPU) from NVIDIA introduced a specialized hardware unit called tensor core (TC) aiming at meeting the growing computation demand needed by DNNs. Most previous studies on TCs have focused on performance improvement through the utilization of the TC's high degree of parallelism. However, as DNNs are deployed into security-sensitive applications such as autonomous driving, the reliability of TCs is as important as performance. In this work, we exploit the unique architectural characteristics of TCs and propose a simple and implementation-efficient hardware technique called fault detection in tensor core (FDTC) to detect transient faults in TCs. In particular, FDTC exploits the zero-valued weights that stem from network pruning as well as sparse activations arising from the common ReLU operator to verify tensor operations. The high level of sparsity in tensors allows FDTC to run original and verifying products simultaneously, leading to zero performance penalty. For applications with a low sparsity rate, FDTC relies on temporal redundancy to re-execute effectual products. FDTC schedules the execution of verifying products only when multipliers are idle. Our experimental results reveal that FDTC offers 100% fault coverage with no performance penalty and small energy overhead in TCs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. 選択的レーザ溶融付加製造における複数トラック・複数層走査 時のエピタキシャル成長組織予測のためのmulti-phase-field フレームワーク
- Author
-
高木知弘, 高橋侑希, and 坂根慎治
- Abstract
In this study, a multi-phase-field (MPF) framework for predicting epitaxial grain growth in selective laser melting (SLM) additive manufacturing (AM) with multi-track and multi-layer scanning was developed. The spatiotemporal change in temperature was approximated using the Rosenthal equation, which provides a theoretical solution for the temperature distribution due to a moving point heat source. The powder bed was modeled as a polycrystalline layer. Large-scale MPF simulations for SLM-AM were performed using parallel computing with multiple graphics processing units. Using the MPF framework developed herein, we simulated SLM-AM with four tracks and four layers for 316L stainless steel. By observing the epitaxial grain growth process on two-dimensional cross-sections and in three dimensions, we clarified a typical growth procedure of grains with characteristic 3D shapes. The MPF framework will potentially enable a systematic estimation of the material microstructures formed during SLM-AM. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Can GPU performance increase faster than the code error rate?
- Author
-
dos Santos, Fernando Fernandes and Rech, Paolo
- Subjects
- *
ERROR rates , *GRAPHICS processing units , *CONVOLUTIONAL neural networks , *DATA corruption , *SOLUTION strengthening - Abstract
Graphics processing units (GPUs) are the reference architecture to accelerate high-performance computing applications and the training/interference of convolutional neural networks. For both these domains, performance and reliability are two of the main constraints. It is believed that the only way to increase reliability is to sacrifice performance, e.g., using redundancies. We show in this paper that this is not always the case. As a very promising result, we found that most GPUs performance improvements also bring the benefit of increasing the number of executions correctly completed before experiencing a silent data corruption (SDC). We consider four different common GPUs' performance optimizations: architectural solutions, software implementations, compiler optimizations, and threads degree of parallelism. We compare different implementations of a variety of parallel codes and, through beam experiments and applications profiling, we show that the performance improvement typically (but not necessarily) increases the GPU SDC rate. Nevertheless, for the vast majority of the configurations the performance gain is much higher than the SDC rate increase, allowing to process a higher amount of correct data. As we show, the programmer choices can increase up to 25 × the number of correctly completed executions without redesigning the algorithm nor including specific hardening solutions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. NM-SpMM:A semi-structured sparse matrix multiplication algorithm for domestic heterogeneous vector processor.
- Author
-
JIANG Jing-fei, HE Yuan-hong, XU Jin-wei, XU Shi-yao, and QIAN Xi-fu
- Abstract
Deep neural networks have achieved excellent results in natural language processing, computer vision and other fields. Due to the growth of the scale of data processed by intelligent applications and the rapid development of large models, the inference performance of deep neural networks is increasingly demanding. N:M semi-structured sparse scheme has become one of the hot technologies to balance the computing power demand and application effect. The domestic heterogeneous vector processor FT-M7032 provides more space for data parallelism and instruction parallelism development in intelligent model processing. In order to address the challenges of N:M semi-structured sparse model computation with various sparse patterns, a flexible configurable sparse matrix multiplication algorithm NM-SpMM is proposed for FT-M7032. NM-SpMM designs an efficient compressed offset address sparse encoding format COA, which avoids the impact of semi-structured parameter configuration on sparse data access. Based on the COA, NM-SpMM performs fine-grained optimization of sparse matrix multiplication in different dimensions. The experimental results on FT-M7032 single core show that NM-SpMM can obtain 1.73~21.00 times speedup compared to dense matrix multiplication, and 0.04~1.04 times speedup compared to NVIDIA V100 GPU with CuSPARSE. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. RISMiCal: A software package to perform fast RISM/3D‐RISM calculations.
- Author
-
Maruyama, Yutaka and Yoshida, Norio
- Subjects
- *
INTEGRATED software , *CHEMICAL processes , *CHEMICAL reactions , *GRAPHICS processing units - Abstract
Solvent plays an essential role in a variety of chemical, physical, and biological processes that occur in the solution phase. The reference interaction site model (RISM) and its three‐dimensional extension (3D‐RISM) serve as powerful computational tools for modeling solvation effects in chemical reactions, biological functions, and structure formations. We present the RISM integrated calculator (RISMiCal) program package, which is based on RISM and 3D‐RISM theories with fast GPU code. RISMiCal has been developed as an integrated RISM/3D‐RISM program that has interfaces with external programs such as Gaussian16, GAMESS, and Tinker. Fast 3D‐RISM programs for single‐ and multi‐GPU codes written in CUDA would enhance the availability of these hybrid methods because they require the performance of many computationally expensive 3D‐RISM calculations. We expect that our package can be widely applied for chemical and biological processes in solvent. The RISMiCal package is available at https://rismical-dev.github.io. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Parallelization of Pigeonhole Sort for Efficient Data Sorting.
- Author
-
Sai Datta, Pasupuleti Rohith, Kamath, Chinmaya D., Kini, N. Gopalakrishna, and B., Ashwath Rao
- Subjects
GRAPHICS processing units ,PARALLEL programming ,MESSAGE passing (Computer science) ,PARALLEL algorithms - Abstract
The need for parallel sorting algorithms have been driven by the increasing need for large-scale datasets to be processed efficiently. Pigeonhole sorting is one of the sorting algorithms that carries sorting in linear time. This study focuses on enhancing the efficacy of the Pigeonhole Sorting method to improve the performance of the algorithm by employing parallel programming techniques specifically Message Passing Interface (MPI) and Compute Unified Device Architecture (CUDA). The primary objective is to develop and assess parallel solutions for Pigeonhole Sorting, with the aim of optimizing sorting efficiency in data-intensive applications. Commencing with a comprehensive analysis of the sequential design of the Pigeonhole Sorting algorithm, this work proceeds to create parallel implementations using CUDA for Graphics Processing Unit (GPU) acceleration and MPI for distributed memory parallelism. This work contributes valuable insights into adapting the Pigeonhole Sorting algorithm to parallel contexts. The findings emphasize the potential advantages of parallelization in reducing the overall computation time. [ABSTRACT FROM AUTHOR]
- Published
- 2024
17. Development of Real-Time Hybrid Detection System Using Deep Learning for Security Applications
- Author
-
Sharma, Prachi, Lamba, Anil Kumar, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Swaroop, Abhishek, editor, Kansal, Vineet, editor, Fortino, Giancarlo, editor, and Hassanien, Aboul Ella, editor
- Published
- 2024
- Full Text
- View/download PDF
18. Performance Analysis for Image Rendering on Different Operating Systems
- Author
-
Hidalgo, Luis Diego, Cordero, Juan José, Phillips, Àngel, Howlett, Robert J., Series Editor, Jain, Lakhmi C., Series Editor, Choudrie, Jyoti, editor, Tuba, Eva, editor, Perumal, Thinagaran, editor, and Joshi, Amit, editor
- Published
- 2024
- Full Text
- View/download PDF
19. Design and optimization of haze prediction model based on particle swarm optimization algorithm and graphics processor
- Author
-
Zuhan Liu, Kexin Zhao, Xuehu Liu, and Huan Xu
- Subjects
Haze prediction ,Support vector regression ,Parallel computing ,Graphics Processing Unit ,Medicine ,Science - Abstract
Abstract With the rapid expansion of industrialization and urbanization, fine Particulate Matter (PM2.5) pollution has escalated into a major global environmental crisis. This pollution severely affects human health and ecosystem stability. Accurately predicting PM2.5 levels is essential. However, air quality forecasting currently faces challenges in processing vast data and enhancing model accuracy. Deep learning models are widely applied for their superior learning and fitting abilities in haze prediction. Yet, they are limited by optimization challenges, long training periods, high data quality needs, and a tendency towards overfitting. Furthermore, the complex internal structures and mechanisms of these models complicate the understanding of haze formation. In contrast, traditional Support Vector Regression (SVR) methods perform well with complex non-linear data but struggle with increased data volumes. To address this, we developed CUDA-based code to optimize SVR algorithm efficiency. We also combined SVR with Genetic Algorithms (GA), Sparrow Search Algorithm (SSA), and Particle Swarm Optimization (PSO) to identify the optimal haze prediction model. Our results demonstrate that the model combining intelligent algorithms with Central Processing Unit-raphics Processing Unit (CPU-GPU) heterogeneous parallel computing significantly outpaces the PSO-SVR model in training speed. It achieves a computation time that is 6.21–35.34 times faster. Compared to other models, the Particle Swarm Optimization-Central Processing Unit-Graphics Processing Unit-Support Vector Regression (PSO-CPU-GPU-SVR) model stands out in haze prediction, offering substantial speed improvements and enhanced stability and reliability while maintaining high accuracy. This breakthrough not only advances the efficiency and accuracy of haze prediction but also provides valuable insights for real-time air quality monitoring and decision-making.
- Published
- 2024
- Full Text
- View/download PDF
20. Development and Validation of 3D Core Physics Code STORK Based on GPU Acceleration
- Author
-
YU Lulin1, YANG Gaosheng2, CHEN Guohua1, BEI Hua2, JIANG Xiaofeng1, GAO Mingmin2, WANG Tao
- Subjects
neutron transport ,graphics processing unit ,method of characteristics ,on-line homogenization ,pin-by-pin ,sp3 ,super homogenization method ,Nuclear engineering. Atomic power ,TK9001-9401 ,Nuclear and particle physics. Atomic energy. Radioactivity ,QC770-798 - Abstract
A 3D neutron transport computational code, STORK, has been developed based on a small-scale multi-GPU computing platform, utilizing the coupled approach of the two-dimensional full-core layer-by-layer transport calculation by the method of characteristics (MOC) and the 3D pin-by-pin simplified P3 (SP3) calculation. In this code, firstly, the core was layered according to the axial characteristics and the two-dimensional multi-group (69-group) transport equation was solved by MOC method (with fully reflective boundary conditions in the axial direction) for each axial layer. Secondly, utilizing the results from 2D MOC calculations, based on the equivalent homogenization theory and the super-homogenization (SPH) technology, the heterogenous cells were homogenized, which produced the few-group homogenous cross sections as well as SPH factors. Finally, the 3D whole-core pin-by-pin SP3 calculation was carried out to obtain cell flux and power distribution. Moreover, the constructive solid geometry (CSG) was applied to enhance the complex geometric modeling capability in STORK. A combination of the enhanced neutron flow method and the equivalence theory was used to perform resonance calculations and a pre-produced table of resonance interference factors was adopted to handle the resonance interference effects. During 2D transport calculation, a two-level unstructured coarse mesh finite difference method was applied to accelerate the convergence of the MOC calculation. In the 3D pin-by-pin calculation, the 3D SP3 equations were solved by the transverse integration technique and the nodal expansion method with group transverse-integrated neutron fluxes approximated by the parabola expansion in the radial direction and by semi-analytical expansion in the axial direction. In terms of code development, a hybrid programming of CUDA, C++ and Python was adopted, and all the computational modules were developed based on CUDA/C++ with a large number of performance optimizations, so that 2D MOC calculations at each layer of the core could be carried out on multiple GPUs at the same time. To maximize computational efficiency, the computationally-intensive modules in STORK, including MOC calculation, CMFD, resonance calculation, burnup calculation, and SP3 calculation modules, were executed on the GPU. The validation of the SRORK code through the C5G7 3D Rodded problem and VERA benchmark problems demonstrates its high computational accuracy, with a radial assembly power error of less than 1%. However, due to the code's direct utilization of the energy spectrum of the adjacent layers' active regions for the axial reflector and the lack of consideration for neutron leakage from neighboring axial layers, significant discrepancies in axial power occur near the reflector and in fuel layers containing spacer grids, but they remain below 3%. More importantly, developed based on the CPU/GPU heterogeneous system, the code exhibits significant advantages in terms of computational efficiency and cost compared to similar neutron transport softwares.
- Published
- 2024
- Full Text
- View/download PDF
21. Distributed data processing and task scheduling based on GPU parallel computing
- Author
-
Li, Jun
- Published
- 2024
- Full Text
- View/download PDF
22. Exploring GPU acceleration framework for climate based daylight modeling
- Author
-
Du, Sida, Zhao, Yongqing, Tian, Zhen, Geisler-Moroder, David, and Wang, Wei
- Published
- 2024
- Full Text
- View/download PDF
23. Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUs.
- Author
-
Wei, Bingxin, Wang, Yizhuo, Chang, Fangli, Gao, Jianhua, and Ji, Weixing
- Subjects
- *
MULTIPLICATION , *ALGORITHMS , *MACHINE design - Abstract
Sparse General Matrix-Matrix Multiplication (SpGEMM) has played an important role in a number of applications. So far, many efficient algorithms have been proposed to improve the performance of SpGEMM on GPUs. However, the performance of each algorithm for matrices of different structures varies a lot. There is no algorithm that can achieve the optimal performance of SpGEMM computation on all matrices. In this article, we design a machine learning based approach for predicting the optimal SpGEMM algorithm on input matrices. By extracting features from input matrices, we utilize LightGBM and XGBoost to train different lightweight models. The models are capable of predicting the best performing algorithm with low inference overhead and high accuracy for the given input matrices. We also investigate the impact of tree depth on model accuracy and inference overhead. Our evaluation shows that an increase in tree depth leads to a corresponding increase in prediction accuracy, reaching a maximum of approximately 85%, while resulting in increased inference overhead of approximately 10 µs. Compared with the state-of-the-art algorithms on three GPU platforms, our method achieves better overall performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Efficient implementation of low-order-precision smoothed particle hydrodynamics.
- Author
-
Hosono, Natsuki and Furuichi, Mikito
- Subjects
- *
HYDRODYNAMICS , *ARTIFICIAL intelligence , *SURFACE interactions - Abstract
Smoothed particle hydrodynamics (SPH) method is widely accepted as a flexible numerical treatment for surface boundaries and interactions. High-resolution simulations of hydrodynamic events require high-performance computing (HPC). There is a need for an SPH code that runs efficiently on modern supercomputers involving accelerators such as NVIDIA or AMD graphics processing units. In this work, we applied half-precision, which is widely used in artificial intelligence, to the SPH method. However, improving HPC performance at such low-order precisions is a challenge. An as-is implementation with half-precision will have lower computational cost than that of float/double precision simulations, but also worsens the simulation accuracy. We propose a scaling and shifting method that maintains the simulation accuracy near the level of float/double precision. By examining the impact of half-precision on the simulation accuracy and time-to-solution, we demonstrated that the use of half-precision can improve the computational performance of SPH simulations for scientific purposes without sacrificing the accuracy. In addition, we demonstrated that the efficiency of half-precision depends on the architecture used. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Design and optimization of haze prediction model based on particle swarm optimization algorithm and graphics processor.
- Author
-
Liu, Zuhan, Zhao, Kexin, Liu, Xuehu, and Xu, Huan
- Subjects
PARTICLE swarm optimization ,GRAPHICS processing units ,PREDICTION models ,HAZE ,AIR quality monitoring ,DEEP learning - Abstract
With the rapid expansion of industrialization and urbanization, fine Particulate Matter (PM
2.5 ) pollution has escalated into a major global environmental crisis. This pollution severely affects human health and ecosystem stability. Accurately predicting PM2.5 levels is essential. However, air quality forecasting currently faces challenges in processing vast data and enhancing model accuracy. Deep learning models are widely applied for their superior learning and fitting abilities in haze prediction. Yet, they are limited by optimization challenges, long training periods, high data quality needs, and a tendency towards overfitting. Furthermore, the complex internal structures and mechanisms of these models complicate the understanding of haze formation. In contrast, traditional Support Vector Regression (SVR) methods perform well with complex non-linear data but struggle with increased data volumes. To address this, we developed CUDA-based code to optimize SVR algorithm efficiency. We also combined SVR with Genetic Algorithms (GA), Sparrow Search Algorithm (SSA), and Particle Swarm Optimization (PSO) to identify the optimal haze prediction model. Our results demonstrate that the model combining intelligent algorithms with Central Processing Unit-raphics Processing Unit (CPU-GPU) heterogeneous parallel computing significantly outpaces the PSO-SVR model in training speed. It achieves a computation time that is 6.21–35.34 times faster. Compared to other models, the Particle Swarm Optimization-Central Processing Unit-Graphics Processing Unit-Support Vector Regression (PSO-CPU-GPU-SVR) model stands out in haze prediction, offering substantial speed improvements and enhanced stability and reliability while maintaining high accuracy. This breakthrough not only advances the efficiency and accuracy of haze prediction but also provides valuable insights for real-time air quality monitoring and decision-making. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
26. Accelerated Augmented Reality Holographic 4k Video Projections Based on Lidar Point Clouds for Automotive Head‐Up Displays.
- Author
-
Skirnewskaja, Jana, Montelongo, Yunuen, Sha, Jinze, Wilkes, Phil, and Wilkinson, Timothy D.
- Subjects
- *
OPTICAL scanners , *HEAD-up displays , *POINT cloud , *AUGMENTED reality , *DRIVER assistance systems , *OPTICAL radar - Abstract
Identifying road obstacles hidden from the driver's field of view can ensure road safety in transportation. Current driver assistance systems such as 2D head‐up displays are limited to the projection area on the windshield of the car. An augmented reality holographic point cloud video projection system is developed to display objects aligned with real‐life objects in size and distance within the driver's field of view. Light Detection and Ranging (LiDAR) point cloud data collected with a 3D laser scanner is transformed into layered 3D replay field objects consisting of 400 k points. GPU‐accelerated computing generated real‐time holograms 16.6 times faster than the CPU processing time. The holographic projections are obtained with a Spatial Light Modulator (SLM) (3840×2160 px) and virtual Fresnel lenses, which enlarged the driver's eye box to 25 mm × 36 mm. Real‐time scanned road obstacles from different perspectives provide the driver a full view of risk factors such as generated depth in 3D mode and the ability to project any scanned object from different angles in 360°. The 3D holographic projection technology allows for maintaining the driver's focus on the road instead of the windshield and enables assistance by projecting road obstacles hidden from the driver's field of view. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. Time optimization for simulation of PMD 3D camera.
- Author
-
Lade, Sangita Gautam, Pawale, Sanjesh, and Patil, Aniket
- Subjects
COMPUTER vision ,PARALLEL algorithms ,RAY tracing ,PARALLEL programming ,AUGMENTED reality - Abstract
This research aims to enhance the efficiency of simulating 3D cameras utilizing Photonic Mixer Devices (PMD) technology, crucial for applications in computer vision, robotics, and augmented reality. Despite their significance, the computational demands of simulating PMD 3D cameras present substantial challenges in time and resource management. This study proposes a novel approach to optimizing simulation time without sacrificing accuracy, achieved through advanced algorithms and parallel computing techniques. Through a comprehensive analysis of existing simulation methodologies, bottlenecks are identified, and tailored optimization techniques are implemented. The system is designed to simulate PMD sensors, wherein ray tracing precedes power calculation, essential for determining pixel radiance and irradiance. However, the inherent computational intensity of the sequential power calculation algorithm presents a challenge of speed, particularly for PMD sensor simulation reliant on fast-imaging technology. To address this issue, a parallel algorithm leveraging General Purpose Graphics Processing Units (GP GPUs) is proposed and implemented. Experimentation is carried out on Volta (GV100) Graphics Processing Unit (GPU) with varying block sizes from 32 to 1024 in the multiples of 32. Experimental results demonstrate significant speed enhancements, with a maximum speed up of 78% utilizing Volta GPU with a block size of 1024, thereby showcasing the efficacy of the proposed methodology. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
28. GPU parallel computation strategy for electrothermal coupling problems using improved assembly-free FEM.
- Author
-
Wu, Shaowen, Wang, Youyuan, Hou, Jinhong, and Meng, Ruixiao
- Subjects
PARALLEL algorithms ,GRAPHICS processing units ,FINITE element method ,DEGREES of freedom - Abstract
The analysis of electrothermal coupling problems finds extensive application in engineering. However, for large-scale electrothermal coupling problems, the time cost and storage requirements for solving them using the finite element method (FEM) are substantial. We optimize the finite element electrothermal coupling computation from two aspects: computational speed and storage usage. Based on the assembly-free FEM, we explore the symmetry of element matrices to reduce storage for second-order tetrahedral elements and propose a graphics processing unit (GPU) parallel algorithm to improve computational speed. At the same time, we allocate the parallel parts of an electrothermal coupling problem to two GPUs to improve the speed further. In addition, for the three types of boundary conditions in electrothermal coupling problems, we design parallel application methods suitable for assembly-free FEM. Finally, we compare our strategy with methods from other literature through the numerical experiment. Our method reduces the element matrices' storage by 45%. Compared with the solution process using the element level method and degree of freedom level method, our strategy achieves average acceleration ratios of 5.83 and 1.38, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. 基于GPU加速的三维堆芯物理程序 STORK 的开发与验证.
- Author
-
俞陆林, ,杨高升, 陈国华, 卑华, 蒋校丰, 高明敏, and 王涛
- Abstract
Copyright of Atomic Energy Science & Technology is the property of Editorial Board of Atomic Energy Science & Technology and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
30. Utilizing Machine Learning Techniques for Worst-Case Execution Time Estimation on GPU Architectures
- Author
-
Vikash Kumar, Behnaz Ranjbar, and Akash Kumar
- Subjects
Timing analysis ,machine learning ,WCET analysis ,graphics processing unit ,measurement-based approach ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The massive parallelism provided by Graphics Processing Units (GPUs) to accelerate compute-intensive tasks makes it preferable for Real-Time Systems such as autonomous vehicles. Such systems require the execution of heavy Machine Learning (ML) and Computer Vision applications because of the computing power of GPUs. However, such systems need a guarantee of timing predictability. It means the Worst-Case Execution Time (WCET) of the application is estimated tightly and safely to schedule each application before its deadline to avoid catastrophic consequences. As more applications use GPUs, running many applications simultaneously on the same GPU becomes necessary. To provide predictable performance while the application is running in parallel, it must be WCET-aware, which GPUs do not fully support in a multitasking environment. Nvidia recently added a feature called the Multi-Process Service. It allows the different applications to run simultaneously in the same CUDA context by partitioning the compute resources of the GPU. Using this feature, we can measure the interference from co-running GPU applications to estimate WCET. In this paper, we propose a novel technique to estimate the WCET of the GPU kernel using an ML approach. Our approach is based on the application’s source, and the model is trained based on the large data set. The approach is flexible and can be applied to different GPU-sharing mechanisms. We allow the victim and enemy kernel of the GPU to execute in parallel to get the maximum interference from the enemy to estimate the WCET of the victim kernel. Enemy kernels are chosen to cause a higher slowdown by acquiring the resources of the victim kernel. We compare our implementation with state-of-the-art approaches to show its effectiveness. Our ML approach reduces the time by 99% in most cases because inferences take only seconds to predict WCET, and the resource consumption required to estimate WCET compared to traditional approaches is minimal because we don’t need to execute the application on GPU for hours. Although our approach does not offer safety guarantees because of its empirical nature, we observed that predicted WCETs are always higher than any observed execution times for all benchmarks, and the maximum overestimation factor observed is 11x.
- Published
- 2024
- Full Text
- View/download PDF
31. Real-Time Incoherent Digital Holography System Using an Embedded Graphic Processing Unit
- Author
-
Mahiro Baba, Tatsuki Tahara, Tomoyoshi Ito, and Tomoyoshi Shimobaba
- Subjects
Digital holography ,graphics processing unit ,holography ,incoherent holography ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Incoherent digital holography (IDH) is a technique that enables single-shot three-dimensional (3D) imaging by recording holograms with incoherent light, such as sunlight and LEDs. IDH is expected to be a next-generation 3D measurement technique, but there are two challenges: speeding up both denoising required for IDH due to insufficient light and the diffraction calculations required to obtain reconstructed images. We propose a real-time IDH system in which single-exposure phase-shifting digital holography, which can record multiple phase-shifted holograms in a single exposure, is combined with an embedded graphic processing unit (GPU). The proposed system enables real-time color imaging of the real world from incoherent holograms with $2448 \times 2048$ pixels while reducing noise at 21.2 frames per second. The effect of the denoising was evaluated by speckle contrast, showing that noises were well reduced. Compared to five existing studies using field programable gate arrays and GPUs, the proposed system is capable of computing large hologram reproduction at high speed. The proposed system is a prototype for future incoherent holographic cameras.
- Published
- 2024
- Full Text
- View/download PDF
32. Performance-Oriented and Sustainability-Oriented Design of an Effective Android Malware Detector
- Author
-
Sana Qadir, Amna Naeem, Mehdi Hussain, Huma Ghafoor, and Aisha Hassan Abdalla Hashim
- Subjects
Malware detection ,machine learning ,graphics processing unit ,performance ,sustainability ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Effective Android malware detection is a complex problem because of the rapidly-evolving, complicated, and diverse nature of malware. The design of malware detectors should prioritise high detection rate, efficient use of computational resources, and sustainability. Keeping these design priorities in mind, we develop and empirically evaluate four different classifiers. Firstly, to ensure high detection rate, we use a dataset compiled using hybrid analysis of a diverse set of apps. Unlike most publicly-available Android datasets, the dynamic analysis of each app was carried out on a real device and not on a virtual setup. This means that this dataset contains a comprehensive profile of sophisticated malware capable of changing its behaviour on a virtual setup. Secondly, to enhance efficiency, we explore the use of a GPU-based setup and different feature selection techniques. Lastly, we emphasize sustainability by training the models using apps that date back to the beginning of the Android ecosystem i.e. from 2008 until 2020. Our results show that Random Forest (RF) is the most effective classifier with the highest accuracy of 97.86%. This accuracy is 2.78% higher than the best accuracy reported in existing literature. The data also shows that RF is the most sustainable classifier with minimal decrease in F1 score for over-time performance. With regard to efficiency, we find that Logistic Regression (LR) is the best option and that the training time of most models improves significantly when a GPU-based setup instead of a CPU-based setup.
- Published
- 2024
- Full Text
- View/download PDF
33. GPU-accelerated body-internal electric field exposure simulation using low-frequency magnetic field sampling points
- Author
-
Haussmann, Norman, Stroka, Steven, Schmuelling, Benedikt, and Clemens, Markus
- Published
- 2023
- Full Text
- View/download PDF
34. Symmetric Tridiagonal Eigenvalue Solver Across CPU Graphics Processing Unit (GPU) Nodes
- Author
-
Erika Hernández-Rubio, Alberto Estrella-Cruz, Amilcar Meneses-Viveros, Jorge Alberto Rivera-Rivera, Liliana Ibeth Barbosa-Santillán, and Sergio Víctor Chapa-Vergara
- Subjects
Cuppen’s algorithm ,Eigenvalue Solver ,Graphics Processing Unit ,Hybrid-Heterogeneous computing ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
In this work, an improved and scalable implementation of Cuppen’s algorithm for diagonalizing symmetric tridiagonal matrices is presented. This approach uses a hybrid-heterogeneous parallelization technique, taking advantage of GPU and CPU in a distributed hardware architecture. Cuppen’s algorithm is a theoretical concept and a powerful tool in various scientific and engineering applications. It is a key player in matrix diagonalization, finding its use in Functional Density Theory (FDT) and Spectral Clustering. This highly efficient and numerically stable algorithm computes eigenvalues and eigenvectors of symmetric tridiagonal matrices, making it a crucial component in many computational methods. One of the challenges in parallelizing algorithms for GPUs is their limited memory capacity. However, we overcome this limitation by utilizing multiple nodes with both CPUs and GPUs. This enables us to solve subproblems that fit within the memory of each device in parallel and subsequently combine these subproblems to obtain the complete solution. The hybrid-heterogeneous approach proposed in this work outperforms the state-of-the-art libraries and also maintains a high degree of accuracy in terms of orthogonality and quality of eigenvectors. Furthermore, the sequential version of the algorithm with our approach in this work demonstrates superior performance and potential for practical use. In the experiments carried out, it was possible to verify that the performance of the implementation that was carried out scales by 2× using two graphic cards in the same node. Notably, Symmetric Tridiagonal Eigenvalue Solvers are fundamental to solving more general eigenvalue problems. Additionally, the divide-and-conquer approach employed in this implementation can be extended to singular value solvers. Given the wide range of eigenvalue problems encountered in scientific and engineering domains, this work is essential in advancing computational methods for efficient and accurate matrix diagonalization.
- Published
- 2024
- Full Text
- View/download PDF
35. Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemEx
- Author
-
Williams-Young, David B, Bagusetty, Abhishek, de Jong, Wibe A, Doerfler, Douglas, van Dam, Hubertus JJ, Vázquez-Mayagoitia, Álvaro, Windus, Theresa L, and Yang, Chao
- Subjects
Information and Computing Sciences ,Applied Computing ,Bioengineering ,Density functional theory ,Accelerator ,Graphics processing unit ,Performance portability ,Distributed Computing ,Cognitive Sciences ,Distributed computing and systems software - Abstract
The numerical integration of the exchange–correlation (XC) potential is one of the primary computational bottlenecks in Gaussian basis set Kohn–Sham density functional theory (KS-DFT). To achieve optimal performance and accuracy, care must be taken in this numerical integration to preserve local sparsity as to allow for near linear weak scaling with system size. This leads to an integration scheme with several performance critical kernels which must be hand optimized for each architecture of interest. As the set of available accelerator hardware goes more diverse, a key challenge for developers of KS-DFT software is to maintain performance portability across a wide range of computational architectures. In this work, we examine a modular software design pattern which decouples the implementation details of performance critical kernels from the expression of high-level algorithmic workflows in a device-agnostic language such as C++; thus allowing for developers to target existing and emerging accelerator hardware within a single code base. We consider the efficacy of such a design pattern in the numerical integration of the XC potential by demonstrating its ability to achieve performance portability across a set of accelerator architectures which are representative of those on current and future U.S. Department of Energy Leadership Computing Facilities.
- Published
- 2021
36. An exploration of quantitative models and algorithms for vehicle routing optimization and traveling salesman problems
- Author
-
Oskari Lähdeaho and Olli-Pekka Hilmola
- Subjects
Vehicle Route Optimization ,Traveling Salesman Problem ,Branch and Bound ,Logistics ,Parallel Computing ,Graphics Processing Unit ,Marketing. Distribution of products ,HF5410-5417.5 ,Management. Industrial management ,HD28-70 - Abstract
This study presents optimization models for large vehicle routing problems using a spreadsheet solver and Python programming language with extended graphic card boosting computing power. Near optimality is feasible and attainable with spreadsheet tools and models for solving real-life problems. However, increasing the availability of additional computing power through graphics processing and visualization is now a viable option for decision-makers and problem-solvers. This study shows that decision-makers can solve vehicle routing optimization problems with limited access to high-end optimization tools. This study shows managers and decision-makers can use vehicle routing optimization even with limited access to sophisticated optimization tools.
- Published
- 2024
- Full Text
- View/download PDF
37. COMPUTING WEAK DISTANCE BETWEEN THE 2-SPHERE AND ITS NONSMOOTH APPROXIMATIONS.
- Author
-
KAZUKI KOGA
- Subjects
- *
PIECEWISE linear approximation , *FAST Fourier transforms , *FOURIER transforms , *STATISTICAL smoothing , *GRAPHICS processing units - Abstract
A novel algorithm is proposed for quantitative comparisons between compact surfaces embedded in the three-dimensional Euclidian space. The key idea is to identify those objects with the associated surface measures and compute a weak distance between them using the Fourier transform on the ambient space. In particular, the inhomogeneous Sobolev norm of negative order for a difference between two surface measures is evaluated via the Plancherel theorem, which amounts to approximating a weighted integral norm of smooth data on the frequency space. This approach allows several advantages, including high accuracy due to fast-converging numerical quadrature rules, acceleration by the nonuniform fast Fourier transform, and parallelization on many-core processors. In numerical experiments, the 2-sphere, which is an example whose Fourier transform is explicitly known, is compared with its icosahedral discretization, and it is observed that the piecewise linear approximations converge to the smooth object at the quadratic rate up to small truncation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Massive Parallelization of Massive Sample-Size Survival Analysis.
- Author
-
Yang, Jianxiao, Schuemie, Martijn J., Ji, Xiang, and Suchard, Marc A.
- Subjects
- *
SURVIVAL analysis (Biometry) , *PROPORTIONAL hazards models , *PARALLEL algorithms , *GRAPHICS processing units , *REGRESSION analysis , *MEDICAL supplies - Abstract
Large-scale observational health databases are increasingly popular for conducting comparative effectiveness and safety studies of medical products. However, increasing number of patients poses computational challenges when fitting survival regression models in such studies. In this article, we use Graphics Processing Units (GPUs) to parallelize the computational bottlenecks of massive sample-size survival analyses. Specifically, we develop and apply time- and memory-efficient single-pass parallel scan algorithms for Cox proportional hazards models and forward-backward parallel scan algorithms for Fine-Gray models for analysis with and without a competing risk using a cyclic coordinate descent optimization approach. We demonstrate that GPUs accelerate the computation of fitting these complex models in large databases by orders of magnitude as compared to traditional multi-core CPU parallelism. Our implementation enables efficient large-scale observational studies involving millions of patients and thousands of patient characteristics. The above implementation is available in the open-source R package Cyclops. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Design of real-time GNSS-R software-defined receiver for coastal altimetry using GPS/BDS/QZSS signals.
- Author
-
Meng, Xinyue, Gao, Fan, Xu, Tianhe, He, Yunqiao, Wang, Nazi, and Ning, Baojiao
- Abstract
Global navigation satellite system reflectometry (GNSS-R) has considerable potential for monitoring sea surface height with high spatiotemporal resolution at low cost. However, because of the immaturity of reflected signal processing, no commercial GNSS-R receiver can provide reliable altimetry measurements. Typically, raw intermediate-frequency data are collected and processed using a software-defined receiver (SDR), which allows full access for signal processing and testing innovative algorithms. Since high-precision code-ranging measurements from open-loop tracking are needed for GNSS-R altimetry, the sampling rate of raw IF data is usually several times that of conventional data used for navigation and positioning. Therefore, the increased data load makes processing very slow when using a computer with only a conventional central processing unit (CPU). To overcome such inefficiency, a graphics processing unit (GPU) was utilized in this study to design the GNSS-R altimetry SDR. As GPU can provide massive parallel computing performance, the correlators were implemented on it, while some procedures with low computational requirements were still implemented on the CPU. The performance of the developed SDR was evaluated by processing GNSS-R raw IF data highly sampled at 62 MHz from a coastal experiment, which has a central frequency of 1176.45 MHz. Then, code-level altimetry solutions were retrieved from BeiDou navigation satellite system (BDS) B2a and quasi-zenith satellite system (QZSS)/global positioning system (GPS) L5 signals. To optimize the SDR, different integration times and error control methods were tested. Results showed that centimeter-level GNSS-R code altimetry solutions can be achieved using QZSS geostationary orbit satellite signals in the case of real-time operation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Design and Development of a CCSDS 131.2-B Software-Defined Radio Receiver Based on Graphics Processing Unit Accelerators.
- Author
-
Ciardi, Roberto, Giuffrida, Gianluca, Bertolucci, Matteo, and Fanucci, Luca
- Subjects
SOFTWARE radio ,TELECOMMUNICATION systems ,EARTH stations ,SIGNAL processing ,RADIO technology ,GRAPHICS processing units ,VIDEO coding - Abstract
In recent years, the number of Earth Observation missions has been exponentially increasing. Satellites dedicated to these missions usually embark with payloads that produce large amount of data and that need to be transmitted towards ground stations, in time-limited windows. Moreover, the noisy nature of the link between satellites and ground stations makes it hard to achieve reliable communication. To address these problems, a standard for a flexible advanced coding and modulation scheme for high-rate telemetry applications has been defined by the Consultative Committee for Space Data Systems (CCSDS). The defined standard, referred to as CCSDS 131.2-B, makes use of Serially Concatenated Convolutional Codes (SCCC) based on 27 ModCods to optimize transmission quality. A limiting factor in the adoption of this standard is represented by the complexity and the cost of the hardware required for developing high-performance receivers. In the last decade, the performance of software has grown due to the advancement of general-purpose processing hardware, leading to the development of many high-performance software systems even in the telecommunication sector. These are commonly referred to as Software-Defined Radio (SDR), indicating a radio communication system in which components that are usually implemented in hardware, by means of FPGAs or ASICs, are instead implemented in software, offering many advantages such as flexibility, modularity, extensibility, cheaper maintenance, and cost saving. This paper proposes the development of an SDR based on NVIDIA Graphics Processing Units (GPU) for implementing the receiver end of the CCSDS 131.2-B standard. At first, a brief description of the CCSDS 131.2-B standard is given, focusing on the architecture of the transmitter and receiver sides. Then, the receiver architecture is shown, giving an overview of its functional blocks and of the implementation choices made to optimize the processing of the signal, especially for the SCCC Decoder. Finally, the performance of the system is analyzed in terms of data-rate and error correction and compared with other SW systems to highlight the achieved improvements. The presented system has been demonstrated to be a perfect solution for CCSDS 131.2-B-compliant device testing and for its use in science missions, providing a valid low-cost alternative with respect to the state-of-the-art HW receivers. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. Implementation of real‐time hybrid simulation based on Python‐graphics processing unit computing.
- Author
-
Dong, Xiaohui, Tang, Zhenyun, and Du, Xiuli
- Subjects
HYBRID computer simulation ,PYTHON programming language ,GRAPHICS processing units ,DEGREES of freedom ,HIGH performance computing ,COMPUTER simulation - Abstract
Summary: Real‐time hybrid simulation is a testing method that combines physical experiments and numerical simulations, which can increase the dimensions of experimental specimens and reduce the error of scaling testing. Currently, the maximum degrees of freedom of numerical models are 7000 in real time. To improve the scale of numerical simulation in real time, a testing framework based on Python and graphics processing unit was proposed in this paper. The maximum degrees of freedom of the numerical model exceeded 24,000 with the testing framework. The testing capacity of real‐time hybrid simulation was significantly improved by the graphics processing unit calculations. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
42. 基于 GPU 粗细粒度和混合精度的 SAR 后向投影 算法的并行加速研究.
- Author
-
田卫明, 刘富强, 谢 鑫, 王长军, 王 健, and 邓云开
- Abstract
Copyright of Journal of Signal Processing is the property of Journal of Signal Processing and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2023
- Full Text
- View/download PDF
43. Improving graphics processing unit performance based on neural network direct memory access controller.
- Author
-
Kumar, Santosh, Neelappa, Bhusare, Saroja, and Yatnalli, Veeramma
- Subjects
CONVOLUTIONAL neural networks ,LONG-term memory ,RECURRENT neural networks ,GRAPHICS processing units ,BACK propagation - Abstract
In this paper proposes the design and implementation of the back-propagation algorithm (BPA) based neural network direct memory access (DMA) controller for use of multimedia applications. The proposed DMA controller work with the back propagation-training algorithm. The advantages of the BPA it will be work on the gradient loss w.r.t the network weights. So, this BPA is used as training algorithm for the DMA controller. The proposed method is test with the different workload characteristics like heavy workload, medium workload and normal workload. The performance parameters are considered here is like accuracy, precision, recall, and F1-score. The proposed method is compared with existing methods like convolutional neural network (CNN), recurrent neural network (RNN), long sort term memory (LSTM), and gated recurrent unit (GRU). Finally, the proposed design will give the better performance than existing methods. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
44. Hardware Acceleration of Explainable AI
- Author
-
Pan, Zhixin, Mishra, Prabhat, Pan, Zhixin, and Mishra, Prabhat
- Published
- 2023
- Full Text
- View/download PDF
45. Explainable AI Acceleration Using Tensor Processing Units
- Author
-
Pan, Zhixin, Mishra, Prabhat, Pan, Zhixin, and Mishra, Prabhat
- Published
- 2023
- Full Text
- View/download PDF
46. Recipe for Fast Large-Scale SVM Training: Polishing, Parallelism, and More RAM!
- Author
-
Glasmachers, Tobias, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Calders, Toon, editor, Vens, Celine, editor, Lijffijt, Jefrey, editor, and Goethals, Bart, editor
- Published
- 2023
- Full Text
- View/download PDF
47. Accelerating Operations on Permutations Using Graphics Processing Units
- Author
-
Lavdanskyi, Artem, Faure, Emil, Skutskyi, Artem, Bazilo, Constantine, Xhafa, Fatos, Series Editor, Faure, Emil, editor, Danchenko, Olena, editor, Bondarenko, Maksym, editor, Tryus, Yurii, editor, Bazilo, Constantine, editor, and Zaspa, Grygoriy, editor
- Published
- 2023
- Full Text
- View/download PDF
48. AI Accelerators for Standalone Computer
- Author
-
Kim, Taewoo, Lee, Junyong, Jung, Hyeonseong, Kim, Shiho, Mishra, Ashutosh, editor, Cha, Jaekwang, editor, Park, Hyunbin, editor, and Kim, Shiho, editor
- Published
- 2023
- Full Text
- View/download PDF
49. Massive Parallelization Boosts Big Bayesian Multidimensional Scaling
- Author
-
Holbrook, Andrew J, Lemey, Philippe, Baele, Guy, Dellicour, Simon, Brockmann, Dirk, Rambaut, Andrew, and Suchard, Marc A
- Subjects
Mathematical Sciences ,Statistics ,Networking and Information Technology R&D (NITRD) ,Emerging Infectious Diseases ,Influenza ,Biodefense ,Pneumonia & Influenza ,Infectious Diseases ,2.2 Factors relating to the physical environment ,Bayesian phylogeography ,Graphics processing unit ,Hamiltonian Monte Carlo ,Massive parallelization ,Single-instruction ,multiple-data ,GPU ,SIMD ,stat.CO ,Econometrics ,Statistics & Probability - Abstract
Big Bayes is the computationally intensive co-application of big data and large, expressive Bayesian models for the analysis of complex phenomena in scientific inference and statistical learning. Standing as an example, Bayesian multidimensional scaling (MDS) can help scientists learn viral trajectories through space-time, but its computational burden prevents its wider use. Crucial MDS model calculations scale quadratically in the number of observations. We partially mitigate this limitation through massive parallelization using multi-core central processing units, instruction-level vectorization and graphics processing units (GPUs). Fitting the MDS model using Hamiltonian Monte Carlo, GPUs can deliver more than 100-fold speedups over serial calculations and thus extend Bayesian MDS to a big data setting. To illustrate, we employ Bayesian MDS to infer the rate at which different seasonal influenza virus subtypes use worldwide air traffic to spread around the globe. We examine 5392 viral sequences and their associated 14 million pairwise distances arising from the number of commercial airline seats per year between viral sampling locations. To adjust for shared evolutionary history of the viruses, we implement a phylogenetic extension to the MDS model and learn that subtype H3N2 spreads most effectively, consistent with its epidemic success relative to other seasonal influenza subtypes. Finally, we provide MassiveMDS, an open-source, stand-alone C++ library and rudimentary R package, and discuss program design and high-level implementation with an emphasis on important aspects of computing architecture that become relevant at scale.
- Published
- 2021
50. A GPU-enabled acceleration algorithm for the CAM5 cloud microphysics scheme.
- Author
-
Hong, Yan, Wang, Yuzhu, Zhang, Xuanying, Wang, Xiaocong, Zhang, He, and Jiang, Jinrong
- Subjects
- *
MICROPHYSICS , *METEOROLOGICAL research , *ATMOSPHERIC models , *ALGORITHMS , *GEOGRAPHIC names , *PARALLEL algorithms , *GRAPHICS processing units - Abstract
The National Center for Atmospheric Research released a global atmosphere model named Community Atmosphere Model version 5.0 (CAM5), which aimed to provide a global climate simulation for meteorological research. Among them, the cloud microphysics scheme is extremely time-consuming, so developing efficient parallel algorithms faces large-scale and chronic simulation challenges. Due to the wide application of GPU in the fields of science and engineering and the NVIDIA's mature and stable CUDA platform, we ported the code to GPU to accelerate computing. In this paper, by analyzing the parallelism of CAM5 cloud microphysical schemes (CAM5 CMS) in different dimensions, corresponding GPU-based one-dimensional (1D) and two-dimensional (2D) parallel acceleration algorithms are proposed. Among them, the 2D parallel algorithm exploits finer-grained parallelism. In addition, we present a data transfer optimization method between the CPU and GPU to further improve the overall performance. Finally, GPU version of the CAM5 CMS (GPU-CMS) was implemented. The GPU-CMS can obtain a speedup of 141.69 × on a single NVIDIA A100 GPU with I/O transfer. In the case without I/O transfer, compared to the baseline performance on a single Intel Xeon E5-2680 CPU core, the 2D acceleration algorithm obtained a speedup of 48.75 × , 280.11 × , and 507.18 × on a single NVIDIA K20, P100, and A100 GPU, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.