1,939 results on '"Graphics processing unit"'
Search Results
2. Simulation of Brownian Motion for Molecular Communications on a Graphics Processing Unit
- Author
-
Tian, Yun, Rogers, Uri, Cain, Tobias, Ji, Yanqing, Shen, Fangyang, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, and Latifi, Shahram, editor
- Published
- 2020
- Full Text
- View/download PDF
3. Accelerating all-pairs shortest path algorithms for bipartite graphs on graphics processing units.
- Author
-
Hanif, Muhammad Kashif, Zimmermann, Karl-Heinz, and Anees, Asad
- Subjects
BIPARTITE graphs ,GRAPHICS processing units ,GRAPH algorithms ,PHYSICAL sciences ,UNDIRECTED graphs ,MATRIX multiplications - Abstract
Bipartite graphs are used to model and represent many real-world problems in biological and physical sciences. Finding shortest paths in bipartite graphs is an important task and has numerous applications. Different dynamic programming based solutions to find the shortest paths exists which differ on complexity and structure of graph. The computational complexity of these algorithms is a major concern. This work formulates the parallel versions of Floyd-Warshall and Torgasin-Zimmermann algorithms to compute the shortest paths in bipartite graphs efficiently. These algorithms are mapped to graphics processing unit using tropical matrix product. The performance for different realizations and parameters are compared for Floyd-Warshall and Torgasin-Zimmermann algorithms. Parallel implementation of Torgasin-Zimmermann algorithm attained a speed-up factor of almost 274 when compared with serial Floyd-Warshall algorithm for random-generated undirected graphs. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
4. Graphics processing unit acceleration of the island model genetic algorithm using the CUDA programming platform.
- Author
-
Janssen, Dylan M., Pullan, Wayne, and Liew, Alan Wee‐Chung
- Subjects
GRAPHICS processing units ,GENETIC algorithms ,GENETIC models ,EVOLUTIONARY algorithms - Abstract
Genetic algorithms are a practical approach for finding near‐optimal solutions for nondeterministic polynomial‐hard problems. In this work we exploit the parallel processing capability of graphics processing units and Nvidia's CUDA programming platform to accelerate the island model genetic algorithm by modifying the evolutionary operations to fit the hardware architecture and have successfully achieved significant computational speedups. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
5. An Efficient Method for Defining Multivariate Functions Using Expression Templates for Arrays in C++ and CUDA
- Author
-
Hossein Mahmoodi Darian
- Subjects
expression templates ,variadic templates ,c++ ,graphics processing unit ,cuda ,Engineering design ,TA174 - Abstract
In this paper an efficient method for defining multi-variable functions using expression templates for array computations in computational fluid dynamics simulations in C++ is introduced. The method is implemented using variadic templates which is a new feature in C++. One of the advantages of the method is its easy of use for users of computational fields. The user can define and use his own function with any number of input arguments without having knowledge of templates programming concepts. The present method may replace conventional expression templates in developing numerical libraries. For three different functions, including arithmetic operations and trigonometric functions, the efficiency of the proposed method for arrays of different sizes is compared with that of the conventional expression templates, two different C++ syntax and Fortran language. Furthermore, the performance of the method in terms of the compilation time and executable file size is demonstrated. A similar comparison on Graphic Processing Units (GPU) using CUDA is made and the efficiency of the method is shown. The results indicate that, for any array size, the present method has a very good performance in terms of computational time, compilation time and executable file size. Finally, as an application of the proposed method, a numerical simulation is done.
- Published
- 2018
- Full Text
- View/download PDF
6. A VNS with Parallel Evaluation of Solutions for the Inverse Lighting Problem
- Author
-
Decia, Ignacio, Leira, Rodrigo, Pedemonte, Martín, Fernández, Eduardo, Ezzatti, Pablo, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Squillero, Giovanni, editor, and Sim, Kevin, editor
- Published
- 2017
- Full Text
- View/download PDF
7. An automatic self‐initialized clustering method for brain tissue segmentation and pathology detection from magnetic resonance human head scans with graphics processing unit machine.
- Author
-
Thiruvenkadam, Kalaiselvi, Nagarajan, Kalaichelvi, and Padmanaban, Sriramakrishnan
- Subjects
GRAPHICS processing units ,MAGNETIC resonance ,WHITE matter (Nerve tissue) ,CEREBROSPINAL fluid ,PARALLEL programming ,TISSUES ,GRAY matter (Nerve tissue) - Abstract
The proposed work introduces a fully automatic modified fuzzy c‐means (MFCM) algorithm for segmenting brain tissue into gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF) which identifies the pathological conditions of magnetic resonance human head scans. The present work implements histogram smoothing using Gaussian distribution for finding the number of clusters (K) and cluster centers (C) to initialize modified FCM algorithm (MFCM). The modification includes the local impact of each pixel based on the median of local neighborhoods. This needs more computational power to reduce the processing time and requires a parallel programming environment like the Graphics Processing Unit. The parallel MFCM is performed with the help of compute unified device architecture language and reduced the processing time up to 80 speedup folds than the serial implementation in Matlab and 20 speedup folds than C programming implementation. The method is examined with the Internet Brain Segmentation Repository (IBSR20) T1W dataset. The quantitative and qualitative results of the proposed method are compared with state‐of‐the‐art‐methods using the Dice coefficient (DC). Proposed method yields high DC 0.84 ± 0.03 for GM, 0.83 ± 0.04 for WM, and 0.41 ± 0.12 for CSF segmentation. In post‐processing, 3D volumes of segmented regions have been constructed and compared with the gold standard quantitatively and qualitatively. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
8. High performance bioinformatics and computational biology on general-purpose graphics processing units
- Author
-
Ling, Cheng, Benkrid, Khaled., and Erdogan, Ahmet
- Subjects
572.8 ,Graphics Processing Unit ,GPU ,CUDA ,Compute Unified Device Architecture ,HPC ,High Performance Computing - Abstract
Bioinformatics and Computational Biology (BCB) is a relatively new multidisciplinary field which brings together many aspects of the fields of biology, computer science, statistics, and engineering. Bioinformatics extracts useful information from biological data and makes these more intuitive and understandable by applying principles of information sciences, while computational biology harnesses computational approaches and technologies to answer biological questions conveniently. Recent years have seen an explosion of the size of biological data at a rate which outpaces the rate of increases in the computational power of mainstream computer technologies, namely general purpose processors (GPPs). The aim of this thesis is to explore the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high performance and efficient implementation of BCB applications in order to meet the demands of biological data increases at affordable cost. The thesis presents detailed design and implementations of GPU solutions for a number of BCB algorithms in two widely used BCB applications, namely biological sequence alignment and phylogenetic analysis. Biological sequence alignment can be used to determine the potential information about a newly discovered biological sequence from other well-known sequences through similarity comparison. On the other hand, phylogenetic analysis is concerned with the investigation of the evolution and relationships among organisms, and has many uses in the fields of system biology and comparative genomics. In molecular-based phylogenetic analysis, the relationship between species is estimated by inferring the common history of their genes and then phylogenetic trees are constructed to illustrate evolutionary relationships among genes and organisms. However, both biological sequence alignment and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with the size of sequence databases. The thesis firstly presents a multi-threaded parallel design of the Smith- Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A novel technique is put forward to solve the restriction on the length of the query sequence in previous GPU-based implementations of the SW algorithm. Based on this implementation, the difference between two main task parallelization approaches (Inter-task and Intra-task parallelization) is presented. The resulting GPU implementation matches the speed of existing GPU implementations while providing more flexibility, i.e. flexible length of sequences in real world applications. It also outperforms an equivalent GPPbased implementation by 15x-20x. After this, the thesis presents the first reported multi-threaded design and GPU implementation of the Gapped BLAST with Two-Hit method algorithm, which is widely used for aligning biological sequences heuristically. This achieved up to 3x speed-up improvements compared to the most optimised GPP implementations. The thesis then presents a multi-threaded design and GPU implementation of a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and multiple sequence alignment (MSA). This achieves 8x-20x speed up compared to an equivalent GPP implementation based on the widely used ClustalW software. The NJ method however only gives one possible tree which strongly depends on the evolutionary model used. A more advanced method uses maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte Carlo (MCMC)-based Bayesian inference. The latter was the subject of another multi-threaded design and GPU implementation presented in this thesis, which achieved 4x-8x speed up compared to an equivalent GPP implementation based on the widely used MrBayes software. Finally, the thesis presents a general evaluation of the designs and implementations achieved in this work as a step towards the evaluation of GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA) technology.
- Published
- 2012
9. INVESTIGATION OF C++ VARIADIC TEMPLATES FOR NUMERICAL METHODS AND FINITE DIFFERENCE SCHEMES.
- Author
-
DARIAN, HOSSEIN MAHMOODI
- Subjects
- *
FINITE difference method , *COMPILERS (Computer programs) , *C++ , *FINITE differences , *GRAPHICS processing units , *RIEMANN-Hilbert problems - Abstract
In this paper we investigate the utilization of variadic templates for numerical methods and finite difference schemes. Specifically, an efficient method for defining multivariate functions in the framework of expression templates for array-based computations in C++ and CUDA is introduced. One of the advantages of the method is its ease of use for users of computational fields. The user can define and use his own function with any number of input arguments without having the knowledge of templates programming. For three different functions, the efficiency of the proposed method for arrays of different sizes is compared with that of the other implementations in C++ and also that of Fortran. The Roofline analysis is presented for the C++ implementations. Furthermore, for different compilers, the performance of the method in terms of the compilation time and executable file size and also the vectorization status are demonstrated. A similar comparison on graphics processing units (GPUs) using CUDA is made and the efficiency of the method is shown. The results indicate that, for any array size, the present method has very good performance in terms of the computational time, compilation time, and executable file size. Also, the variadic templates are utilized to define linear and nonlinear finite difference schemes. The performance of three finite difference schemes is compared with that of the plain C implementation. The results show the method proposed for nonlinear schemes has the same performance as that of the plain C implementation. Finally, as a practical application, two numerical simulations, the simulation of the discharge process of a lead-acid battery cell on a CPU and the simulation of a two-dimensional Riemann problem using high-order weighted ENO schemes on a GPU, are carried out. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
10. Fitness evaluation reuse for accelerating GPU-based evolutionary induction of decision trees.
- Author
-
Jurczuk, Krzysztof, Czajkowski, Marcin, and Kretowski, Marek
- Subjects
- *
DECISION trees , *DATA mining , *GRAPHICS processing units , *BIOLOGICAL evolution , *BIG data , *TECHNOLOGICAL progress - Abstract
Decision trees (DTs) are one of the most popular white-box machine-learning techniques. Traditionally, DTs are induced using a top-down greedy search that may lead to sub-optimal solutions. One of the emerging alternatives is an evolutionary induction inspired by the biological evolution. It searches for the tree structure and tests simultaneously, which results in less complex DTs with at least comparable prediction performance. However, the evolutionary search is computationally expensive, and its effective application to big data mining needs algorithmic and technological progress. In this paper, noting that many trees or their parts reappear during the evolution, we propose a reuse strategy. A fixed number of recently processed individuals (DTs) is stored in a so-called repository. A part of the repository entry (related to fitness calculations) is maintained on a CPU side to limit CPU/GPU memory transactions. The rest of the repository entry (tree structures) is located on a GPU side to speed up searching for similar DTs. As the most time-demanding task of the induction is the DTs' evaluation, the GPU first searches similar DTs in the repository for reuse. If it fails, the GPU has to evaluate DT from the ground up. Large artificial and real-life datasets and various repository strategies are tested. Results show that the concept of reusing information from previous generations can accelerate the original GPU-based solution further. It is especially visible for large-scale data. To give an idea of the overall acceleration scale, the proposed solution can process even billions of objects in a few hours on a single GPU workstation. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
11. A highly-efficient locally encoded boundary scheme for lattice Boltzmann method on GPU.
- Author
-
Zhang, Zehua, Peng, Cheng, Li, Chengxiang, Zhang, Hua, Xian, Tao, and Wang, Lian-Ping
- Subjects
- *
LATTICE Boltzmann methods , *FLOW simulations , *FLUID flow , *THREE-dimensional flow , *POROUS materials - Abstract
The lattice Boltzmann method (LBM) is an algorithm to simulate fluid flows with the advantage of locality and simplicity, which is suitable for GPU acceleration and simulation of complex flows. However, LBM simulations involving complex solid boundaries require each boundary node to be aware of the types of all its neighbor nodes, i.e., fluid or solid, during the execution of boundary conditions, which involves tremendous data transfer between global and local memory on GPU. Such data transfer operations constitute a large portion of consumed time and can significantly affect simulation efficiency. This article proposes a novel boundary processing scheme that encodes the neighbor nodes' information into a single integer and stores it on the local node. We choose two- and three-dimensional porous-medium flows to test the performance of the proposed scheme on complex boundary geometries and compare it with the usual schemes that retrieve information redundantly from neighbors. The comparison shows that our proposed scheme can improve the overall computing efficiency by up to 40% for 3D flow simulations through porous media. Such improvement is achieved by reducing time consumption on data transfer. • Novel scheme encodes neighbor nodes' info into single integers by binary encoding, and stores them locally. • Single integer includes local and neighboring node types, boundary condition types. • Proposed scheme improves GPU efficiency by up to 40% for 3D simulation of flow through porous media. • Approach can be applied to other methods like indirect addressing treatment to enhance performance further. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. Fast Hybrid BSA-DE-SA Algorithm on GPU
- Author
-
Brévilliers, Mathieu, Abdelkafi, Omar, Lepagnot, Julien, Idoumghar, Lhassane, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Siarry, Patrick, editor, Idoumghar, Lhassane, editor, and Lepagnot, Julien, editor
- Published
- 2016
- Full Text
- View/download PDF
13. Toward Large-Scale Evolutionary Multitasking: A GPU-Based Paradigm
- Author
-
Meng Chen, Liang Feng, Kay Chen Tan, A. K. Qin, and Yuxiao Huang
- Subjects
Continuous optimization ,Optimization problem ,Computer science ,business.industry ,Distributed computing ,Graphics processing unit ,Cloud computing ,Theoretical Computer Science ,CUDA ,Computational Theory and Mathematics ,Human multitasking ,Central processing unit ,Explicit knowledge ,business ,Software - Abstract
Evolutionary Multi-Tasking (EMT), which shares knowledge across multiple tasks while the optimization progresses online, has demonstrated superior performance in terms of both optimization quality and convergence speed over its single-task counterpart in solving complex optimization problems. However, most of existing EMT algorithms only consider to handle two tasks simultaneously. As the computational cost incurred in the evolutionary search and knowledge transfer increased rapidly with the number of optimization tasks, these EMT algorithms cannot meet today’s requirements of optimization service on the cloud for many real-world applications, where hundreds or thousands of optimization requests (labelled as large-scale EMT) are often received simultaneously and require to be optimized in a short time. Recently, Graphics Processing Unit (GPU) computing has attracted extensive attention to accelerate the applications possessing large-scale data volume that are traditionally handled by the Central Processing Unit (CPU). Taking this cue, towards large-scale EMT, in this paper, we propose a new EMT paradigm based on the island model with the Compute Unified Device Architecture (CUDA), which is able to handle a large number of continuous optimization tasks efficiently and effectively. Moreover, under the proposed paradigm, we develop the GPU-based implicit and explicit knowledge transfer mechanisms for EMT. To evaluate the performance of the proposed paradigm, comprehensive empirical studies have been conducted against its CPU-based counterpart in large-scale EMT.
- Published
- 2022
14. Level 2 Reformulation Linearization Technique–Based Parallel Algorithms for Solving Large Quadratic Assignment Problems on Graphics Processing Unit Clusters.
- Author
-
Date, Ketan and Nagi, Rakesh
- Subjects
- *
QUADRATIC assignment problem , *PARALLEL algorithms , *ASSIGNMENT problems (Programming) , *MULTICORE processors , *GRAPHICS processing units , *PARALLEL programming , *PARALLEL processing , *COLLEGE facilities - Abstract
This paper discusses efficient parallel algorithms for obtaining strong lower bounds and exact solutions for large instances of the quadratic assignment problem (QAP). Our parallel architecture is comprised of both multicore processors and compute unified device architecture–enabled NVIDIA graphics processing units (GPUs) on the Blue Waters Supercomputing Facility at the University of Illinois at Urbana–Champaign. We propose novel parallelization of the Lagrangian dual ascent algorithm on the GPUs, which is used for solving a QAP formulation based on the level-2 reformulation linearization technique. The linear assignment subproblems in this procedure are solved using our accelerated Hungarian algorithm [Date K, Rakesh N (2016) GPU-accelerated Hungarian algorithms for the linear assignment problem. Parallel Computing 57:52–72.]. We embed this accelerated dual-ascent algorithm in a parallel branch-and-bound scheme and conduct extensive computational experiments on single and multiple GPUs, using problem instances with up to 42 facilities from the quadratic assignment problem library (QAPLIB). The experiments suggest that our GPU-based approach is scalable, and it can be used to obtain tight lower bounds on large QAP instances. Our accelerated branch-and-bound scheme is able to comfortably solve Nugent and Taillard instances (up to 30 facilities) from the QAPLIB, using a modest number of GPUs. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
15. Fully Bayesian Analysis of RNA-seq Counts for the Detection of Gene Expression Heterosis.
- Author
-
Landau, Will, Niemi, Jarad, and Nettleton, Dan
- Subjects
- *
MONTE Carlo method , *HETEROSIS , *BAYESIAN analysis , *MARKOV chain Monte Carlo , *GENE expression , *NULL hypothesis - Abstract
Heterosis, or hybrid vigor, is the enhancement of the phenotype of hybrid progeny relative to their inbred parents. Heterosis is extensively used in agriculture, and the underlying mechanisms are unclear. To investigate the molecular basis of phenotypic heterosis, researchers search tens of thousands of genes for heterosis with respect to expression in the transcriptome. Difficulty arises in the assessment of heterosis due to composite null hypotheses and nonuniform distributions for p-values under these null hypotheses. Thus, we develop a general hierarchical model for count data and a fully Bayesian analysis in which an efficient parallelized Markov chain Monte Carlo algorithm ameliorates the computational burden. We use our method to detect gene expression heterosis in a two-hybrid plant-breeding scenario, both in a real RNA-seq maize dataset and in simulation studies. In the simulation studies, we show our method has well-calibrated posterior probabilities and credible intervals when the model assumed in analysis matches the model used to simulate the data. Although model misspecification can adversely affect calibration, the methodology is still able to accurately rank genes. Finally, we show that hyperparameter posteriors are extremely narrow and an empirical Bayes (eBayes) approach based on posterior means from the fully Bayesian analysis provides virtually equivalent posterior probabilities, credible intervals, and gene rankings relative to the fully Bayesian solution. This evidence of equivalence provides support for the use of eBayes procedures in RNA-seq data analysis if accurate hyperparameter estimates can be obtained. Supplementary materials for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
16. Candidate Set Parallelization Strategies for Ant Colony Optimization on the GPU
- Author
-
Dawson, Laurence, Stewart, Iain A., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Kołodziej, Joanna, editor, Di Martino, Beniamino, editor, Talia, Domenico, editor, and Xiong, Kaiqi, editor
- Published
- 2013
- Full Text
- View/download PDF
17. An image generator based on neural networks in GPU
- Author
-
Halamo Reis, Thiago W. Silva, Elmar U. K. Melcher, Alisson V. Brito, and Antonio Marcus Nogueira Lima
- Subjects
Distributed Computing Environment ,Artificial neural network ,Computer Networks and Communications ,Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Graphics processing unit ,computer.software_genre ,Image (mathematics) ,Set (abstract data type) ,CUDA ,Hardware and Architecture ,Media Technology ,Data mining ,General-purpose computing on graphics processing units ,computer ,Software ,Generator (mathematics) - Abstract
Existing image databases contain a few diversity of images. Likewise, there is no specific image base available in other situations, leading to the need to undertake additional efforts in capturing images and creating datasets. Many of these datasets contain only a single object in each image, but often the scenario in which projects must operate in production requires several objects per image. Thus, it is necessary to expand original datasets into more complex ones with specific combinations to achieve the goal of the application. This work proposes a technique for image generation to extend an initial dataset. It has been designed generically to work with various images and create a data set from some initial images. The generated set of images is used in a distributed environment. It is possible to perform image generation in this environment, producing datasets with specific images to work in certain applications. The generation of images consists of two methods: generation by deformation and generation by a neural network. With the proposed methods, this work sought to bring as main contributions the specification and implementation of an image generating component so that it is possible to easily integrate it with possible heterogeneous devices capable of parallel computing, such as General Purpose Graphics Processing Unit (GPGPU). In comparison with the existing methods to the proposed one, this one proposes to use the image generator enlarging an initial image bank with the combination of two methods. Some experiments are presented doing generation with handwritten digits to validate the proposed approach. The generator was designed with CUDA and GPU-optimized libraries as TensorFlow-specific modules. The results obtained can optimize the integration process with the simulation of possible stimuli choices, avoiding problems in the generation of image phase tests.
- Published
- 2021
18. Real-Time Nonlinear Finite Element Computations on GPU: Handling of Different Element Types
- Author
-
Joldes, Grand R., Wittek, Adam, Miller, Karol, Wittek, Adam, editor, Nielsen, Poul M.F., editor, and Miller, Karol, editor
- Published
- 2011
- Full Text
- View/download PDF
19. Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasks
- Author
-
Peter Mandl, Alexander Döschl, and Max-Emanuel Keller
- Subjects
CUDA ,Computer Networks and Communications ,Distributed algorithm ,Computer science ,Computer cluster ,Scalability ,Spark (mathematics) ,Graphics processing unit ,Brute-force search ,Parallel computing ,General-purpose computing on graphics processing units ,Information Systems - Abstract
PurposeThis paper aims to evaluate different approaches for the parallelization of compute-intensive tasks. The study compares a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and resilient distributed data set (RDD) (Apache Spark) paradigms and a graphics processing unit (GPU) approach with Numba for compute unified device architecture (CUDA).Design/methodology/approachThe paper uses a simple but computationally intensive puzzle as a case study for experiments. To find all solutions using brute force search, 15! permutations had to be computed and tested against the solution rules. The experimental application comprises a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) paradigms and a GPU approach with Numba for CUDA. The implementations were benchmarked on Amazon-EC2 instances for performance and scalability measurements.FindingsThe comparison of the solutions with Apache Hadoop and Apache Spark under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30% lower, while the performance of Spark especially benefits from an increasing number of tasks. With the CUDA implementation, more than 16 times faster execution is achievable for the same price compared to the Spark solution. Apart from the multi-threaded implementation, the processing times of all solutions scale approximately linearly. Finally, several application suggestions for the different parallelization approaches are derived from the insights of this study.Originality/valueThere are numerous studies that have examined the performance of parallelization approaches. Most of these studies deal with processing large amounts of data or mathematical problems. This work, in contrast, compares these technologies on their ability to implement computationally intensive distributed algorithms.
- Published
- 2021
20. Development of High-Performance Algorithms for the Segmentation of Fundus Images Using a Graphics Processing Unit
- Author
-
N. Yu. Ilyasova, N. S. Demin, and Aleksandr Shirokanev
- Subjects
CUDA ,Speedup ,Computer science ,Pattern recognition (psychology) ,Key (cryptography) ,Graphics processing unit ,Parallel algorithm ,Segmentation ,Computer Vision and Pattern Recognition ,Fundus (eye) ,Computer Graphics and Computer-Aided Design ,Algorithm ,eye diseases - Abstract
Diabetic retinopathy is one of the dangerous fundus diseases that leads to irreversible loss of vision. In the case of untimely or incorrect treatment, blindness occurs. Currently, laser coagulation is a common treatment method. An ophthalmologist uses a laser to apply a series of burns to the retina. The success of the operation depends entirely on the experience of the doctor. The automatic formation of a preliminary plan of coagulates allows us to solve a number of problems related to the operation, such as long manual placement of coagulates or adjustment of laser power. Thus, the probability of a doctor’s error is reduced, and the preparation time for the operation is significantly reduced. One of the key stages in the formation of the plan is the segmentation of the fundus image. This stage is carried out with the help of texture features, the calculation of which takes a long time. In relation to this, this study proposes high-performance algorithms for the segmentation of fundus images using CUDA technologies, which significantly speed up sequential versions and outperform parallel algorithms.
- Published
- 2021
21. A GPU-Accelerated Modified Unsharp-Masking Method for High-Frequency Background- Noise Suppression
- Author
-
Chi-Kuang Sun and Bhaskar Jyoti Borah
- Subjects
General Computer Science ,Computer science ,Graphics processing unit ,02 engineering and technology ,Background noise ,CUDA ,Signal-to-noise ratio ,0202 electrical engineering, electronic engineering, information engineering ,General Materials Science ,Computer vision ,Electrical and Electronic Engineering ,Noise measurement ,High-frequency noise cancellation ,business.industry ,Noise (signal processing) ,General Engineering ,life-science imaging ,020207 software engineering ,TK1-9971 ,unsharp-masking ,CUDA-acceleration ,Analog signal ,020201 artificial intelligence & image processing ,Electrical engineering. Electronics. Nuclear engineering ,Artificial intelligence ,business ,Unsharp masking - Abstract
A digitized analog signal often encounters a high-frequency noisy background which degrades the signal-to-noise ratio (SNR) particularly in case of low signal strength. Despite quite a lot of hardware- and software-based approaches have been reported to date to deal with the noise issue, it is still a challenging task to real-time retrieve the noise-contaminated low-frequency information efficiently without degrading the original bandwidth. In this paper, we report a modified unsharp-masking (UM)-based Graphics Processing Unit (GPU)-accelerated algorithm to efficiently suppress a high-frequency noisy background in a digitized two-dimensional image. The proposed idea works effectively even if noise-density is high and signal of interest is comparable or weaker than the maximum noise level. While suppressing the noisy background, the original resolution remains least compromised. We first explore the effectiveness of the algorithm by means of simulated images and subsequently extend our demonstration towards a real-world life-science imaging application. Securing a potential for real-time applicability, we implement the algorithm via Compute Unified Device Architecture (CUDA)-acceleration and preserve a $ < 300~\mu \text{s}$ processing time for a $1000\times 1000$ -sized 8-bit data set.
- Published
- 2021
22. Investigation of C++ Variadic Templates for Numerical Methods and Finite Difference Schemes
- Author
-
Hossein Mahmoodi Darian
- Subjects
Expression templates ,Computational Mathematics ,CUDA ,Multivariate statistics ,Template ,Applied Mathematics ,Numerical analysis ,Graphics processing unit ,Finite difference ,Algorithm ,Mathematics - Abstract
In this paper we investigate the utilization of variadic templates for numerical methods and finite difference schemes. Specifically, an efficient method for defining multivariate functions in the ...
- Published
- 2021
23. Efficient Nearest-Neighbor Data Sharing in GPUs
- Author
-
Mohammad Sadrosadati, Babak Falsafi, Hajar Falahati, Negin Nematollahi, Hamid Sarbazi-Azad, Marzieh Barkhordar, and Mario Drumond
- Subjects
Data sharing ,CUDA ,Hardware and Architecture ,Stencil code ,Computer science ,Data exchange ,Graphics processing unit ,Redundancy (engineering) ,Benchmark (computing) ,Parallel computing ,General-purpose computing on graphics processing units ,Software ,Information Systems - Abstract
Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa’s performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange.
- Published
- 2020
24. El requisito de sostenibilidad en una aplicación de medición de Impacto
- Author
-
Rodriguez Redondo, David, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Gil, Marisa
- Subjects
CPU (Unitat Central de Processament) ,Docker ,CPU (Central Processing Unit) ,Application ,Performance ,Infografia ,NoSQL ,CUDA ,GPU (Graphics Processing Unit) ,GPU (Unitat de Processament Gràfic) ,Database ,Bases de dades ,Databases ,Energy efficiency ,Redis ,Eficiencia energètica ,Memory ,Base de Dades ,Rendiment ,Graphics Processing Unit ,Aplicació ,Memòria ,Python - Abstract
El propósito de este trabajo es responder a la pregunta sobre la viabilidad de ejecutar una aplicación, cuyas tareas principales son la comunicación con una base de datos, en diferentes entornos. Tradicionalmente, la ejecución de un programa viene gestionada por una Unidad de Procesamiento Central (CPU). Sin embargo, se ha observado el gran potencial de las Unidades de Procesamiento Gráfico (GPU) para ejecutar tareas de mayor complejidad o que requieren un procesado de datos muy elevado. Con este proyecto quieren observarse las implicaciones que supone ejecutar la misma aplicación en un escenario, donde la mayor parte de tareas las ejecuta una GPU. Finalmente, se tendrán en cuenta las implicaciones energéticas y de rendimiento para determinar si esta transición es realmente una opción viable. The aim of this project is to answer the question on the viability of running an application, of which the main tasks consist in communicating with a database, in different environments. Traditionally, the Central Processing Unit (CPU) oversees the execution of a program. Nonetheless, Graphic Processing Units (GPU) have shown a high potential when processing big amounts of data or more complex information. The goal is to observe the different implications that migrating from one environment to another, will have. All in all, energy usage and performance implications will be considered, to determine whether this alternative is a viable solution, or not.
- Published
- 2022
25. Accelerating Viterbi algorithm on graphics processing units.
- Author
-
Hanif, Muhammad and Zimmermann, Karl-Heinz
- Subjects
- *
HIDDEN Markov models , *VITERBI decoding , *GRAPHICS processing units , *MATRICES (Mathematics) , *CUDA (Computer architecture) , *COMPUTATIONAL complexity - Abstract
Viterbi algorithm is used in different scientific applications including biological sequence alignment, speech recognition, and probabilistic inference. However, high computational complexity of the Viterbi algorithm is a major concern. Accelerating the Viterbi algorithm is important, especially when the number of states or the length of the sequences increase significantly. In this paper, a parallel solution to improve the performance of Viterbi algorithm is presented. This is achieved by formulating a matrix product based algorithm. This algorithm has been mapped to a NVIDIA graphics processing unit. The performance for different parameters and realizations are compared. The results depicts matrix product is not a viable option for small number of states. However, matrix product solution using shared memory for large number of states gains good performance when compared with the serial version. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
26. GPU accelerated implementation of NCI calculations using promolecular density.
- Author
-
Rubez, Gaëtan, Etancelin, Jean‐Matthieu, Vigouroux, Xavier, Krajecki, Michael, Boisson, Jean‐Charles, and Hénon, Eric
- Subjects
- *
DENSITOMETERS , *GRAPHICS processing units , *MOTHERBOARDS , *COMPILERS (Computer programs) , *LIGANDS (Chemistry) - Abstract
The NCI approach is a modern tool to reveal chemical noncovalent interactions. It is particularly attractive to describe ligand-protein binding. A custom implementation for NCI using promolecular density is presented. It is designed to leverage the computational power of NVIDIA graphics processing unit (GPU) accelerators through the CUDA programming model. The code performances of three versions are examined on a test set of 144 systems. NCI calculations are particularly well suited to the GPU architecture, which reduces drastically the computational time. On a single compute node, the dual-GPU version leads to a 39-fold improvement for the biggest instance compared to the optimal OpenMP parallel run (C code, icc compiler) with 16 CPU cores. Energy consumption measurements carried out on both CPU and GPU NCI tests show that the GPU approach provides substantial energy savings. © 2017 Wiley Periodicals, Inc. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
27. Toward Optimal Computation of Ultrasound Image Reconstruction Using CPU and GPU.
- Author
-
Udomchai Techavipoo, Denchai Worasawate, Wittawat Boonleelakul, Rachaporn Keinprasit, Treepop Sunpetchniyom, Nobuhiko Sugino, and Pairash Thajchayapong
- Subjects
- *
ULTRASONIC imaging , *SIGNAL processing , *BEAMFORMING , *TIME delay systems , *IMAGE reconstruction , *CENTRAL processing units , *GRAPHICS processing units - Abstract
An ultrasound image is reconstructed from echo signals received by array elements of a transducer. The time of flight of the echo depends on the distance between the focus to the array elements. The received echo signals have to be delayed to make their wave fronts and phase coherent before summing the signals. In digital beamforming, the delays are not always located at the sampled points. Generally, the values of the delayed signals are estimated by the values of the nearest samples. This method is fast and easy, however inaccurate. There are other methods available for increasing the accuracy of the delayed signals and, consequently, the quality of the beamformed signals; for example, the in-phase (I)/quadrature (Q) interpolation, which is more time consuming but provides more accurate values than the nearest samples. This paper compares the signals after dynamic receive beamforming, in which the echo signals are delayed using two methods, the nearest sample method and the I/Q interpolation method. The comparisons of the visual qualities of the reconstructed images and the qualities of the beamformed signals are reported. Moreover, the computational speeds of these methods are also optimized by reorganizing the data processing flow and by applying the graphics processing unit (GPU). The use of single and double precision floating-point formats of the intermediate data is also considered. The speeds with and without these optimizations are also compared. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
28. GPU Programming Productivity in Different Abstraction Paradigms
- Author
-
Philip Merlin Uesbeck, Patrick Daleiden, and Andreas Stefik
- Subjects
Coprocessor ,General Computer Science ,Programming language ,Computer science ,Graphics processing unit ,computer.software_genre ,Supercomputer ,Assignment ,Education ,Computer graphics ,CUDA ,General-purpose computing on graphics processing units ,Programmer ,computer - Abstract
Coprocessor architectures in High Performance Computing are prevalent in today’s scientific computing clusters and require specialized knowledge for proper utilization. Various alternative paradigms for parallel and offload computation exist, but little is known about the human factors impacts of using the different paradigms. With computer science student participants from the University of Nevada, Las Vegas with no previous exposure to Graphics Processing Unit programming, our study compared NVIDIA CUDA C/C++ as a control group and the Thrust library. The designers of Thrust claim their higher level of abstraction enhances programmer productivity. The trial was conducted on 91 participants and was administered through our computerized testing platform. Although the study was narrowly focused on the basic steps of an offloaded computation problem and was not intended to be a comprehensive evaluation of the superiority of one approach or the other, we found evidence that although Thrust was designed for ease of use, the abstractions tended to be confusing to students and in several cases diminished productivity. Specifically, abstractions in Thrust for (i) memory allocation through a C++ Standard Template Library-style vector library call, (ii) memory transfers between the host and Graphics Processing Unit coprocessor through an overloaded assignment operator, and (iii) execution of an offloaded routine through a generic transform library call instead of a CUDA kernel routine all performed either equal to or worse than CUDA.
- Published
- 2020
29. GPU acceleration of ADMM for large-scale quadratic programming
- Author
-
John Lygeros, Michel Schubiger, and Goran Banjac
- Subjects
Computer Networks and Communications ,Computer science ,Graphics processing unit ,02 engineering and technology ,Parallel computing ,Quadratic programming ,GPU computing ,Alternating direction method of multipliers ,Theoretical Computer Science ,CUDA ,Artificial Intelligence ,FOS: Mathematics ,0202 electrical engineering, electronic engineering, information engineering ,Mathematics - Optimization and Control ,Parallelizable manifold ,020206 networking & telecommunications ,Solver ,Optimization and Control (math.OC) ,Hardware and Architecture ,Feature (computer vision) ,Convex optimization ,020201 artificial intelligence & image processing ,Software - Abstract
The alternating direction method of multipliers (ADMM) is a powerful operator splitting technique for solving structured convex optimization problems. Due to its relatively low per-iteration computational cost and ability to exploit sparsity in the problem data, it is particularly suitable for large-scale optimization. However, the method may still take prohibitively long to compute solutions to very large problem instances. Although ADMM is known to be parallelizable, this feature is rarely exploited in real implementations. In this paper we exploit the parallel computing architecture of a graphics processing unit (GPU) to accelerate ADMM. We build our solver on top of OSQP, a state-of-the-art implementation of ADMM for quadratic programming. Our open-source CUDA C implementation has been tested on many large-scale problems and was shown to be up to two orders of magnitude faster than the CPU implementation., Journal of Parallel and Distributed Computing, 144, ISSN:0743-7315, ISSN:1096-0848
- Published
- 2020
30. Удосконалення паралельного сортування масивів чисел методом злиття
- Subjects
010302 applied physics ,Hardware architecture ,Very-large-scale integration ,Sorting algorithm ,Computer science ,Graphics processing unit ,020206 networking & telecommunications ,потоковий граф ,паралельне сортування ,метод злиття ,алгоритми сортування ,попарне порівняння ,масив даних ,графічний процесор ,02 engineering and technology ,Parallel computing ,01 natural sciences ,CUDA ,Shared memory ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,lcsh:SD1-669.5 ,General Earth and Planetary Sciences ,Central processing unit ,SIMD ,lcsh:Forestry ,General Environmental Science - Abstract
The current stage of information technology is characterized by the accumulation of a large amount of data. For processing such arrays, systems often have to use sorting operations, which occupy about 40 % of the total time of working with data. The main ways to increase the speed of data sorting operations are the development of parallel methods and algorithms focused on mass-parallel computing tools with large shared memory (GPUs, which are SIMD processors) and the implementation on a device with using modern database elements such as very large-scale integration (VLSI). Analysis of existing literature sources showed that the existing sorting methods consist of two main operations. The first operation is a pairwise comparison and the second one is a permutation of elements in the case when the order of these values does not meet the sorting conditions. But most of these sorting methods are focused on sequential implementation and are not suitable for parallel computing. However, merge sorting is an exception. Due to the basic operation and algorithm of the method, merge sorting provides the opportunity of the high scaling on parallel systems since the input array could be separated on a huge amount of subsets that are not related to each other and each subset could be calculated in parallel. Considering the parallel characteristics of merge sorting, the best option for software implementation is GPU with program model CUDA. CUDA is software and hardware architecture for parallel computing from Nvidia, which can significantly increase computing performance because it focuses on a multi-core architect. The use of CUDA and appropriate algorithms reduces sorting time by about 100 times compared to using only the CPU. Thus, the paper considers the approach and steps for improving the parallel method of merge sorting and provides the algorithm for implementing the method of merge sorting on the graphics processing unit with CUDA architecture.
- Published
- 2020
31. Fitness evaluation reuse for accelerating GPU-based evolutionary induction of decision trees
- Author
-
Marcin Czajkowski, Krzysztof Jurczuk, and Marek Kretowski
- Subjects
020203 distributed computing ,Theoretical computer science ,Computer science ,Graphics processing unit ,Decision tree ,Evolutionary algorithm ,02 engineering and technology ,Reuse ,Theoretical Computer Science ,CUDA ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Greedy algorithm ,Software ,Big data mining - Abstract
Decision trees (DTs) are one of the most popular white-box machine-learning techniques. Traditionally, DTs are induced using a top-down greedy search that may lead to sub-optimal solutions. One of the emerging alternatives is an evolutionary induction inspired by the biological evolution. It searches for the tree structure and tests simultaneously, which results in less complex DTs with at least comparable prediction performance. However, the evolutionary search is computationally expensive, and its effective application to big data mining needs algorithmic and technological progress. In this paper, noting that many trees or their parts reappear during the evolution, we propose a reuse strategy. A fixed number of recently processed individuals (DTs) is stored in a so-called repository. A part of the repository entry (related to fitness calculations) is maintained on a CPU side to limit CPU/GPU memory transactions. The rest of the repository entry (tree structures) is located on a GPU side to speed up searching for similar DTs. As the most time-demanding task of the induction is the DTs’ evaluation, the GPU first searches similar DTs in the repository for reuse. If it fails, the GPU has to evaluate DT from the ground up. Large artificial and real-life datasets and various repository strategies are tested. Results show that the concept of reusing information from previous generations can accelerate the original GPU-based solution further. It is especially visible for large-scale data. To give an idea of the overall acceleration scale, the proposed solution can process even billions of objects in a few hours on a single GPU workstation.
- Published
- 2020
32. Parallel Implenetation of the GPR Techniques for Detecting and Mapping Ancient Buildings by Using CUDA
- Author
-
Salih Bayar and M. Cihat Mumcu
- Subjects
Electromagnetic field ,Computer science ,Acoustics ,Mühendislik ,Graphics processing unit ,Finite-difference time-domain method ,law.invention ,CUDA ,Sonlu fark zaman alanı (FDTD),Yere nüfuz eden radar (GPR),Mükemmel uyumlu katman (PML),Gömülü nesneler,Paralel Programlama ,Engineering ,Perfectly matched layer ,law ,Ground-penetrating radar ,Finite difference time-domain (FDTD),Ground penetrating radar (GPR),perfectly matched layer (PML),buried objects,Parallel Programming ,General-purpose computing on graphics processing units ,Radar - Abstract
Yere nüfuz eden radar (GPR), bir duvarın arkasına gizlenebilen veya duvarın içine yerleştirilebilen nesnelerin algılanması için kullanılan ultra geniş bantlı bir elektromanyetik sensördür. GPR yöntemi, arayüzde yer alan yüksek hızda bir anten tarafından yatay yönde yeraltına gönderilen elektromanyetik dalgaların, yine alıcı tarafından yatay yönde yansıtılmasının kaydedilmesi prensibi üzerine çalışır. Gömülü yapılar; toplanan veriler, bilgisayar programları ve çeşitli filtreler kullanılarak algılanır. Hava cepleri gibi duvarlar arasında gizlenmiş hedeflerin bulunması arkeologlara yardımcı olur. Bu çalışmada, toprağın dağılımı için Lorentz modeli kullanılmıltır. Sınır koşullarını sönümleyip açık bir alanı simüle etmek için mükemmel uyumlu katman (PML) kullanılmış ve dağıtıcı medyaya uyacak şekilde genişletilmiştir. Sonlu farklı zaman alanı (FDTD) yöntemi, elektromanyetik alanların zaman basamaklamasındaki kısmi diferansiyel denklemleri ayrıştırmak için kullanılır. FDTD hesaplaması çok yavaş çalışmaktadır. Bu sorunu çözmek için grafik işlem biriminde (GPGPU) genel amaçlı programlama yapılabilmektedir. Bu çalışmada GPU'ya CUDA kullanılarak 3-B FDTD yöntemi uygulanmış ve 10 kat hızlanmıştır., Ground-penetrating radar (GPR) is an ultra-wideband electromagnetic sensor used for the detection of objects which may be hidden behind a wall or inserted within the wall. The GPR method works on the principle of recording the reflection of electromagnetic waves sent to the underground at high speed from the interfaces by an antenna located in the horizontal direction, again by the receiver in the horizontal direction. Embedded structures are detected using collected data, computer programs, and various filters. Search for the presence of designated targets hidden between the walls, such as air pockets is help to archaeologists. In this work the Lorentz model was used for the distribution of the soil. The perfectly matched layer (PML) used for absorbing boundary conditions to simulate an open space and its expanded to match dispersive media. The finite-difference time-domain (FDTD) method is used to decompose partial differential equations for time cascading of the electromagnetic fields. FDTD calculation works very slowly. General-purpose programming can be done on the graphics processing unit (GPGPU) to solve this problem. In this work, the 3-D FDTD method was applied to the GPU by using CUDA and it was 10 times faster.
- Published
- 2020
33. A GPU-based numerical model coupling hydrodynamical and morphological processes
- Author
-
Chunhong Hu, Yu Tong, Baozhu Pan, Yongde Kang, Junqiang Xia, and Jingming Hou
- Subjects
Finite volume method ,Discretization ,Water flow ,Computer science ,Stratigraphy ,0207 environmental engineering ,Graphics processing unit ,Geology ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Riemann solver ,Computational science ,CUDA ,symbols.namesake ,symbols ,MUSCL scheme ,020701 environmental engineering ,Shallow water equations ,ComputingMethodologies_COMPUTERGRAPHICS ,0105 earth and related environmental sciences - Abstract
Sediment transport simulations are important in practical engineering. In this study, a graphics processing unit (GPU)-based numerical model coupling hydrodynamical and morphological processes was developed to simulate water flow, sediment transport, and morphological changes. Aiming at accurately predicting the sediment transport and sediment scouring processes, the model resolved the realistic features of sediment transport and used a GPU-based parallel computing technique to the accelerate calculation. This model was created in the framework of a Godunov-type finite volume scheme to solve the shallow water equations (SWEs). The SWEs were discretized into algebraic equations by the finite volume method. The fluxes of mass and momentum were computed by the Harten, Lax, and van Leer Contact (HLLC) approximate Riemann solver, and the friction source terms were calculated by the proposed a splitting point-implicit method. These values were evaluated using a novel 2D edge-based MUSCL scheme. The code was programmed using C++ and CUDA, which could run on GPUs to substantially accelerate the computation. The aim of the work was to develop a GPU-based numerical model to simulate hydrodynamical and morphological processes. The novelty is the application of the GPU techniques in the numerical model, making it possible to simulate the sediment transport and bed evolution in a high-resolution but efficient manner. The model was applied to two cases to evaluate bed evolution and the effects of the morphological changes on the flood patterns with high resolution. This indicated that the GPU-based high-resolution hydro-geomorphological model was capable of reproducing morphological processes. The computational times for this test case on the GPU and CPU were 298.1 and 4531.2 s, respectively, indicating that the GPU could accelerate the computation 15.2 times. Compared with the traditional CPU high-grid resolution, the proposed GPU-based high-resolution numerical model improved the reconstruction speed more than 2.0–12.83 times for different grid resolutions while remaining computationally efficient.
- Published
- 2020
34. Parallel implementation of the non-overlapping template matching test using CUDA
- Author
-
Jianguo Zhang, Pu Li, Yuncai Wang, Anbang Wang, and Li Kaikai
- Subjects
Computer Networks and Communications ,Computer science ,Template matching ,Graphics processing unit ,02 engineering and technology ,Parallel computing ,01 natural sciences ,010309 optics ,CUDA ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Test suite ,NIST ,020201 artificial intelligence & image processing ,Electrical and Electronic Engineering ,Performance improvement ,Blossom algorithm ,Statistical hypothesis testing - Abstract
NIST (National Institute of Standards and Technology) statistical test recognized as the most authoritative is widely used in verifying the randomness of binary sequences. The Non-overlapping Template Matching Test as the 7th test of the NIST Test Suit is remarkably time consuming and the slow performance is one of the major hurdles in the testing process. In this paper, we present an efficient bit-parallel matching algorithm and segmented scan-based strategy for execution on Graphics Processing Unit (GPU) using NVIDIA Compute Unified Device Architecture (CUDA). Experimental results show the significant performance improvement of the parallelized Non-overlapping Template Matching Test, the running speed is 483 times faster than the original NIST implementation without attenuating the test result accuracy.
- Published
- 2020
35. A parallel hybrid implementation of the 2D acoustic wave equation
- Author
-
Niyaz Tokmagambetov, Michael Ruzhansky, and Arshyn Altybay
- Subjects
020203 distributed computing ,Multi-core processor ,Differential equation ,Applied Mathematics ,010102 general mathematics ,Computational Mechanics ,Graphics processing unit ,Parallel algorithm ,General Physics and Astronomy ,Statistical and Nonlinear Physics ,02 engineering and technology ,01 natural sciences ,Computational science ,Computer Science::Performance ,Alternating direction implicit method ,CUDA ,Mechanics of Materials ,Modeling and Simulation ,Computer Science::Mathematical Software ,0202 electrical engineering, electronic engineering, information engineering ,Acoustic wave equation ,0101 mathematics ,Engineering (miscellaneous) ,Cyclic reduction - Abstract
In this paper, we propose a hybrid parallel programming approach for a numerical solution of a two-dimensional acoustic wave equation using an implicit difference scheme for a single computer. The calculations are carried out in an implicit finite difference scheme. First, we transform the differential equation into an implicit finite-difference equation and then using the alternating direction implicit (ADI) method, we split the equation into two sub-equations. Using the cyclic reduction algorithm, we calculate an approximate solution. Finally, we change this algorithm to parallelize on graphics processing unit (GPU), GPU + Open Multi-Processing (OpenMP), and Hybrid (GPU + OpenMP + message passing interface (MPI)) computing platforms. The special focus is on improving the performance of the parallel algorithms to calculate the acceleration based on the execution time. We show that the code that runs on the hybrid approach gives the expected results by comparing our results to those obtained by running the same simulation on a classical processor core, Compute Unified Device Architecture (CUDA), and CUDA + OpenMP implementations.
- Published
- 2020
36. Performance gains with Compute Unified Device Architecture-enabled eddy current correction for diffusion MRI
- Author
-
Stuart M. Grieve, Jerome Joseph Maller, Simon Vogrin, and Thomas Welton
- Subjects
Adult ,Male ,0301 basic medicine ,Workstation ,Computer science ,Motherboard ,Graphics processing unit ,Computational science ,law.invention ,03 medical and health sciences ,CUDA ,0302 clinical medicine ,Software ,law ,Connectome ,Image Processing, Computer-Assisted ,Eddy current ,Humans ,Graphics ,Human Connectome Project ,business.industry ,General Neuroscience ,Brain ,Diffusion Magnetic Resonance Imaging ,030104 developmental biology ,business ,030217 neurology & neurosurgery - Abstract
Correcting for eddy currents, movement-induced distortion and gradient inhomogeneities is imperative when processing diffusion MRI (dMRI) data, but is highly computing resource-intensive. Recently, Compute Unified Device Architecture (CUDA) was implemented for the widely-used eddy-correction software, 'eddy', which reduces processing time and allows more comprehensive correction. We investigated processing speed, performance and compatibility of CUDA-enabled eddy-current correction processing compared to commonly-used non-CUDA implementations. Four representative dMRI datasets from the Human Connectome Project, Alzheimer's Disease Neuroimaging Initiative and Chronic Diseases Connectome Project were processed on high-specification and regular workstations through three different configurations of 'eddy'. Processing times and graphics processing unit (GPU) resources used were monitored and compared. Using CUDA reduced the 'eddy' processing time by a factor of up to five. The CUDA slice-to-volume correction method was also faster than non-CUDA eddy except when datasets were large. We make a series of recommendations for eddy configuration and hardware. We suggest that users of eddy-correction software for dMRI processing utilise CUDA and take advantage of the slice-to-volume correction option. We recommend that users run eddy on computers with at least 32GB motherboard random access memory (RAM), and a graphics card with at least 4.5GB RAM and 3750 cores to optimise processing time.
- Published
- 2020
37. GPU-based matrix-free finite element solver exploiting symmetry of elemental matrices
- Author
-
Sachin S. Gautam, Utpal Kiran, and Deepak Sharma
- Subjects
Numerical Analysis ,Speedup ,Computer science ,Graphics processing unit ,020206 networking & telecommunications ,02 engineering and technology ,Solver ,Finite element method ,Computer Science Applications ,Theoretical Computer Science ,Computational science ,Computational Mathematics ,CUDA ,Matrix (mathematics) ,Computational Theory and Mathematics ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Multiplication ,Software ,Sparse matrix - Abstract
Matrix-free solvers for finite element method (FEM) avoid assembly of elemental matrices and replace sparse matrix-vector multiplication required in iterative solution method by an element level dense matrix-vector product. In this paper, a novel matrix-free strategy for FEM is proposed which computes element level matrix-vector product by using only the symmetric part of the elemental matrices. The proposed strategy is developed to take advantage of the massive parallelism of Graphics Processing Unit (GPU). A unique data structure is also introduced which ensures localized and coalesced memory access suitable for a GPU while storing only the symmetric part of the elemental matrices. In addition, the proposed strategy emphasizes the efficient use of register cache, uniform workload distribution, reducing thread synchronization, and maintaining sufficient granularity to make the best use of GPU resources. The performance of the proposed strategy is evaluated by solving elasticity and heat conduction problems using 4-noded quadrilateral element with two degrees of freedom (DOFs) and one DOF per node, respectively. The performance is compared with the matrix-free solver strategies on GPU from the literature. It is found that a maximum speedup of 4.9 $$\times $$ is obtained for the elasticity problem and a maximum of 3.2 $$\times $$ speedup for the heat conduction problem. Further, the proposed strategy takes the least amount of GPU memory as compared to the existing strategies.
- Published
- 2020
38. Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems
- Author
-
Anthony S. Walker, Daniel J. Magee, and Kyle E. Niemeyer
- Subjects
020203 distributed computing ,Speedup ,Computer science ,Graphics processing unit ,Symmetric multiprocessor system ,02 engineering and technology ,Parallel computing ,Stencil ,Theoretical Computer Science ,Euler equations ,CUDA ,symbols.namesake ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,symbols ,General-purpose computing on graphics processing units ,Massively parallel ,Software ,Information Systems - Abstract
Applications that exploit the architectural details of high-performance computing (HPC) systems have become increasingly invaluable in academia and industry over the past two decades. The most important hardware development of the last decade in HPC has been the general purpose graphics processing unit (GPGPU), a class of massively parallel devices that now contributes the majority of computational power in the top 500 supercomputers. As these systems grow, small costs such as latency—due to the fixed cost of memory accesses and communication—accumulate in a large simulation and become a significant barrier to performance. The swept time-space decomposition rule is a communication-avoiding technique for time-stepping stencil update formulas that attempts to reduce latency costs. This work extends the swept rule by targeting heterogeneous, CPU/GPU architectures representing current and future HPC systems. We compare our approach to a naive decomposition scheme with two test equations using an MPI+CUDA pattern on 40 processes over two nodes containing one GPU. The swept rule produces a factor of 1.9 to 23 speedup for the heat equation and a factor of 1.1 to 2.0 speedup for the Euler equations, using the same processors and work distribution, and with the best possible configurations. These results show the potential effectiveness of the swept rule for different equations and numerical schemes on massively parallel compute systems that incur substantial latency costs.
- Published
- 2020
39. A comparative study of bounce-back and immersed boundary method in LBM for turbulent flow simulation
- Author
-
Akshay Prakash, Alankar Agarwal, and Sandeep Gupta
- Subjects
010302 applied physics ,Physics ,Drag coefficient ,Turbulence ,Graphics processing unit ,Lattice Boltzmann methods ,02 engineering and technology ,Mechanics ,Immersed boundary method ,021001 nanoscience & nanotechnology ,01 natural sciences ,Physics::Fluid Dynamics ,CUDA ,0103 physical sciences ,Boundary value problem ,0210 nano-technology ,ComputingMethodologies_COMPUTERGRAPHICS ,Large eddy simulation - Abstract
In this study, we compared the modified bounce-back (BB) method and immersed boundary (IB) method for the treatments of no-slip boundary condition based on the Lattice Boltzmann Method (LBM) for turbulent flow. The benchmark case of flow past a square cylinder confined in a rectangular duct has been simulated with two schemes at Re d = 3000. The turbulent flow was modeled using Large Eddy Simulation (LES). The conventional Smagorinsky subgrid scheme was used to resolve the small-scale motions. The proposed algorithm is parallelized to run on NVIDIA Tesla P100 Graphics Processing Unit (GPU) using Compute Unified Device Architecture (CUDA) programming language. The results obtained from the bounce-back and immersed boundary method has been validated with the experimental and LES data reported in the literature for different turbulent statistics including time-averaged velocities, root-mean-square velocity fluctuations, and Reynolds shear stress. Later, the comparison between the two approaches discussed in terms of accurate predictions for the drag coefficient. The computational efficiency on GPU for both methods has also been reported. [copyright information to be updated in production process]
- Published
- 2020
40. Strength Check of Aircraft Parts Based on Multi-GPU Clusters for Fast Calculation of Sparse Linear Equations
- Author
-
Yuhua Zhang and Binxing Hu
- Subjects
architecture ,General Computer Science ,Computer science ,General Engineering ,Graphics processing unit ,Parallel algorithm ,Condition monitoring ,CUDA ,Finite element method ,Computational science ,Sparse matrix ,PCG ,Matrix (mathematics) ,Conjugate gradient method ,RCM ,General Materials Science ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,lcsh:TK1-9971 ,Linear equation ,Cholesky decomposition - Abstract
In order to improve the cost-effectiveness ratio, the next-generation vehicle needs to meet the requirements of reuse, while adopting a lighter structural weight, so it is necessary to realize the strength calculation and condition monitoring of key components in the digital twin. Most of the current monitoring methods are based on the characteristics of various data acquisition systems, but they need the support of a large number of flight data. The disadvantages of the above strategy can be avoided by reducing the structure of aircraft components to a finite element model and quickly checking the key components in the health management system. In order to solve the problem of fast calculation of the finite element model of the key components of the aircraft, a parallel algorithm and framework of large-scale sparse matrix preprocessing conjugate gradient method based on CUDA(Compute Unified Device Architecture) technology is proposed in the multi GPU(Graphics Processing Unit) workstation cluster environment. Once the sparse matrix is too large to be processed in a single workstation, this paper discusses how to realize the optimized data segmentation in the distributed multi-GPU computing environment. For the problem of iterative solution of matrix preprocessing, two preprocessing strategies of matrix bandwidth reduction parallelization and incomplete Cholesky decomposition are proposed, and asynchronous task concurrency and load balancing strategies are designed on the architecture. The calculation of some examples in the standard sparse matrix database shows that the algorithm and architecture proposed in this paper have the ability to solve large-scale sparse matrix quickly and efficiently, and can complete the fast strength verification of vehicle components.
- Published
- 2020
41. An Effective SAT Solver Utilizing ACO Based on Heterogenous Systems
- Author
-
Muhammad Osama, Aziza I. Hussein, Ammar M. Hassan, Mohammed Moness, and Hassan Youness
- Subjects
Multi-core processor ,General Computer Science ,Computer science ,Ant colony optimization algorithms ,General Engineering ,Graphics processing unit ,GPU ,CUDA ,02 engineering and technology ,Parallel computing ,Solver ,pure-literal elimination ,Ant colony optimization ,TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,satisfiability ,DPLL algorithm ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,General Materials Science ,Variable elimination ,heterogeneous ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Boolean satisfiability problem ,lcsh:TK1-9971 - Abstract
This paper presents new parallel strategies for preprocessing and solving the issue of Boolean Satisfaction (SAT) on Heterogeneous systems of multicore and many-core CPU and Graphics Processing Unit (GPU) using Open Multi-Processor (OpenMP) and NVIDIA - CUDA. We propose exceptionally proficient and parallel techniques for SAT simplifications using the variable elimination method based on the Davis-Putnam-Logemann-Loveland (DPLL) slitting rule algorithm performed with a shared-memory model on a multicore CPU platform, where the clause elimination subsumption and the pure-literal removal techniques are completely performed on the CUDA framework. We demonstrate how efficient an evolutionary SAT solver is by using the suggested heterogeneous pre-processing, leading to important acceleration improvements in the solution's quality enhancement. The penalization of the transformative SAT solver is executed with Ant Colony Optimization (ACO) scheme utilizing CUDA. (Compute Unified Device Architecture) We perform thorough benchmarks to test the performance of our preprocessor and solver implementations against various random SAT formulas. The promoted H-SAT pre-processor scheme has gotten a speed-up of a factor 15x over the sequential implementation with statistical reductions on the original CNF which becomes up to 49% and 43% in case of literals and clauses numbers exclusively, where the H-SAT gain strength the solvability of the ACO solver by 100% in some cases.
- Published
- 2020
42. FPGA-Based Scale-Out Prototyping of Degridding Algorithm for Accelerating Square Kilometre Array Telescope Data Processing
- Author
-
Yuefeng Song, Sen Du, Shijin Song, Junjie Hou, and Yongxin Zhu
- Subjects
gridding/degridding ,Data processing ,Speedup ,General Computer Science ,Computer science ,scientific data processing ,General Engineering ,Graphics processing unit ,square kilometre array ,Antenna array ,CUDA ,Memory architecture ,Benchmark (computing) ,General Materials Science ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Field-programmable gate array ,Algorithm ,lcsh:TK1-9971 ,FPGA - Abstract
The SKA (Square Kilometre Array) radio telescope will become the most sensitive telescope by correlating a large number of antenna nodes to form a giant antenna array. The data generated from such a large number of antenna nodes will pose a huge storage problem and require real-time data processing to make the best use of data, and the SKA Scientific Data Processing becomes the bottleneck of the whole processing flow. However, the existing high-performance CPU- and GPU (Graphics Processing Unit)-based solutions cannot satisfy the performance requirements and power budget requirements well [1] . Due to the consideration of the high energy efficiency of hardware accelerators and the flexibility and cost of prototype design, in this paper, we explore the FPGA(Field Programmable Gate Array)-based prototype of one of the most computationally demanding procedures in SKA scientific data processing: degridding. Through the analysis of algorithm behavior and bottlenecks, we design and optimize the memory architecture and computing logic of an FPGA-based prototype. Besides, with the consideration of the relations between the required data of processing multiple spectral channels, we reuse the shared data in processing neighboring spectral channels, and the performance further improves. The functionality and performance of our design have been verified on the target FPGA board, and the software-based benchmarks were also measured on comparable CPU and GPU platforms, indicating that the FPGA-based prototype achieves 2.74 times and 2.03 times speedup, 7.64 times and 7.42 times energy efficiency than the MPI(Message Passing Interface)-based CPU benchmark and the CUDA (Compute Unified Device Architecture)-based GPU benchmark, respectively.
- Published
- 2020
43. Real-Time Lung Tumor Tracking Using a CUDA Enabled Nonrigid Registration Algorithm for MRI
- Author
-
Pierre Boulanger, Kumaradevan Punithakumar, Michelle Noga, Jihyun Yun, Gino Fallone, and Nazanin Tahmasebi
- Subjects
lcsh:Medical technology ,Computer science ,Computation ,Biomedical Engineering ,Graphics processing unit ,Non-rigid image registration ,lcsh:Computer applications to medicine. Medical informatics ,radiation therapy ,Article ,030218 nuclear medicine & medical imaging ,03 medical and health sciences ,CUDA ,0302 clinical medicine ,image segmentation ,parallel computing ,General Medicine ,Image segmentation ,lung mobile tumors ,GPU computing ,tumor tracking ,Parallel processing (DSP implementation) ,Shared memory ,lcsh:R855-855.5 ,compute unified device architecture ,lcsh:R858-859.7 ,Central processing unit ,General-purpose computing on graphics processing units ,Algorithm ,030217 neurology & neurosurgery - Abstract
Objective: This study intends to develop an accurate, real-time tumor tracking algorithm for the automated radiation therapy for cancer treatment using Graphics Processing Unit (GPU) computing. Although a previous moving mesh based tumor tracking approach has been shown to be successful in delineating the tumor regions from a sequence of magnetic resonance image, the algorithm is computationally intensive and its computation time on standard Central Processing Unit (CPU) processors is too slow to be used clinically especially for automated radiation therapy system. Method: A re-implementation of the algorithm on a low-cost parallel GPU-based computing platform is utilized to accelerate this computation at a speed that is amicable to clinical usages. Several components in the registration algorithm such as the computation of similarity metric are inherently parallel which fits well with the GPU parallel processing capabilities. Solving a partial differential equation numerically to generate the mesh deformation is one of the computationally intensive components which has been accelerated by utilizing a much faster shared memory on the GPU. Results: Implemented on an NVIDIA Tesla K40c GPU, the proposed approach yielded a computational acceleration improvement of over 5 times its implementation on a CPU. The proposed approach yielded an average Dice score of 0.87 evaluated over 600 images acquired from six patients. Conclusion: This study demonstrated that the GPU computing approach can be used to accelerate tumor tracking for automated radiation therapy for mobile lung tumors. Clinical Impact: Accurately tracking mobile tumor boundaries in real-time is important to automate radiation therapy and the proposed study offers an excellent option for fast tumor region tracking for cancer treatment., We present the estimation of 3D left ventricular (LV) motion using the fusion of different 2D cine magnetic resonance (CMR) sequences acquired during routine imaging sessions. We show that fusing the information from short and long-axis views of CMR improves the accuracy of cardiac tissue motion estimation.
- Published
- 2020
44. Neural network training acceleration using NVIDIA CUDA technology for image recognition
- Author
-
Alexander A Fertsev
- Subjects
cuda ,image recognition ,neural networks ,levenberg-marquardt method ,graphics processing unit ,Mathematics ,QA1-939 - Abstract
In this paper, an implementation of neural network trained by algorithm based on Levenberg-Marquardt method is presented. Training of neural network increased by almost 9 times using NVIDIA CUDA technology. Implemented neural network is used for the recognition of noised images.
- Published
- 2012
45. Accelerating Haze Removal Algorithm Using CUDA
- Author
-
Xianyun Wu, Keyan Wang, Yunsong Li, Bormin Huang, and Kai Liu
- Subjects
Speedup ,010504 meteorology & atmospheric sciences ,parallel computing implementation ,Computer science ,Graphics processing unit ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,haze removal ,1080p ,real-time ,02 engineering and technology ,Frame rate ,01 natural sciences ,Stream processing ,CUDA ,High-definition video ,dark channel prior ,Filter (video) ,0202 electrical engineering, electronic engineering, information engineering ,Median filter ,General Earth and Planetary Sciences ,020201 artificial intelligence & image processing ,lcsh:Q ,lcsh:Science ,Algorithm ,0105 earth and related environmental sciences - Abstract
The dark channel prior (DCP)-based single image removal algorithm achieved excellent performance. However, due to the high complexity of the algorithm, it is difficult to satisfy the demands of real-time processing. In this article, we present a Graphics Processing Unit (GPU) accelerated parallel computing method for the real-time processing of high-definition video haze removal. First, based on the memory access pattern, we propose a simple but effective filter method called transposed filter combined with the fast local minimum filter algorithm and integral image algorithm. The proposed method successfully accelerates the parallel minimum filter algorithm and the parallel mean filter algorithm. Meanwhile, we adopt the inter-frame atmospheric light constraint to suppress the flicker noise in the video haze removal and simplify the estimation of atmospheric light. Experimental results show that our implementation can process the 1080p video sequence with 167 frames per second. Compared with single thread Central Processing Units (CPU) implementation, the speedup is up to 226×, with asynchronous stream processing and qualified for the real-time high definition video haze removal.
- Published
- 2021
46. Analysis of Ionicity-Magnetism Competition in 2D-MX3 Halides towards a Low-Dimensional Materials Study Based on GPU-Enabled Computational Systems
- Author
-
Sergey I. Malkovsky, A. N. Chibisov, and Alexey Kartsev
- Subjects
ESSL ,Computer science ,General Chemical Engineering ,ferromagentism ,Graphics processing unit ,GPU ,Context (language use) ,CUDA ,Material Design ,biquadratic exchange ,Article ,Computational science ,CrI3 ,Chemistry ,Quantum ESPRESSO ,Scalability ,2D magnets ,Periodic boundary conditions ,General Materials Science ,RuCl3 ,Central processing unit ,CPU ,QD1-999 ,density functional theory - Abstract
The acceleration of parallel high-throughput first-principle calculations in the context of 3D (three dimensional) periodic boundary conditions for low-dimensional systems, and particularly 2D materials, is an important issue for new material design. Where the scalability rapidly deflated due to the use of large void unit cells along with a significant number of atoms, which should mimic layered structures in the vacuum space. In this report, we explored the scalability and performance of the Quantum ESPRESSO package in the hybrid central processing unit - graphics processing unit (CPU-GPU) environment. The study carried out in the comparison to CPU-based systems for simulations of 2D magnets where significant improvement of computational speed was achieved based on the IBM ESSL SMP CUDA library. As an example of physics-related results, we have computed and discussed the ionicity-covalency and related ferro- (FM) and antiferro-magnetic (AFM) exchange competitions computed for some CrX3 compounds. Further, it has been demonstrated how this exchange interplay leads to high-order effects for the magnetism of the 1L-RuCl3 compound.
- Published
- 2021
47. GPU-accelerated Hungarian algorithms for the Linear Assignment Problem.
- Author
-
Date, Ketan and Nagi, Rakesh
- Subjects
- *
GRAPHICS processing units , *PARALLEL programming , *CUDA (Computer architecture) , *MOTHERBOARDS , *PARALLEL algorithms - Abstract
In this paper, we describe parallel versions of two different variants (classical and alternating tree) of the Hungarian algorithm for solving the Linear Assignment Problem (LAP). We have chosen Compute Unified Device Architecture (CUDA) enabled NVIDIA Graphics Processing Units (GPU) as the parallel programming architecture because of its ability to perform intense computations on arrays and matrices. The main contribution of this paper is an efficient parallelization of the augmenting path search phase of the Hungarian algorithm. Computational experiments on problems with up to 25 million variables reveal that the GPU-accelerated versions are extremely efficient in solving large problems, as compared to their CPU counterparts. Tremendous parallel speedups are achieved for problems with up to 400 million variables, which are solved within 13 seconds on average. We also tested multi-GPU versions of the two variants on up to 16 GPUs, which show decent scaling behavior for problems with up to 1.6 billion variables and dense cost matrix structure. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
48. Application of Ray-Tracing Method in Electromagnetic Numerical Simulation Algorithm
- Author
-
Haibo He, Ming Jin, and Xueda Niu
- Subjects
Commercial software ,Radar cross-section ,CUDA ,FEKO ,Computer simulation ,Computer science ,Graphics processing unit ,Reflection (physics) ,Ray tracing (graphics) ,Algorithm - Abstract
Several methods are proposed to accelerate the calculation of RCS (radar cross section), and integrate shooting and bouncing rays-physical optics (SBR-PO) methods into graphics processing unit (GPU). The SBR-PO method is used to solve the multiple reflection behavior of electromagnetic waves on the target model, and the scattered field is calculated by PO integral. The ray tracing engine OptiX and the Unified Computing Architecture (CUDA) are used to reduce the calculation time of the SBR-PO method and improve the calculation efficiency of RCS. Compare the results calculated by the SBR-PO method with the results calculated by the commercial software FEKO. The results fully demonstrate the feasibility and efficiency of integrating the SBR-PO method into OptiX and GPU.
- Published
- 2021
49. CUDA-enabled Programming for Accelerating Flood Simulation
- Author
-
Min-Jui Huang, Chin-Pin Ko, Mohammad Alkhaleefah, Yang-Lang Chang, Chiang-An Hsu, and Praveen Kumar Chittem
- Subjects
CUDA ,Data dependency ,Parallelizable manifold ,Flood myth ,Computer science ,Simulation modeling ,Graphics processing unit ,Parallel computing ,Supercomputer - Abstract
Floods are the most frequent disasters, causing widespread damage resulting in loss of lives and properties. In this paper, we present Sinotech Engineering Consultants Hydrodynamic (SEC-HY21) simulation modeling to predicted floods and estimate their damage efficiently. However, SEC-HY21 still suffers from the slow simulation rate due to its data dependency structure which does not make the numerical model of SEC-HY21 parallelizable. In this research, a near real-time flood simulation has been reached by Compute Unified Device Architecture (CUDA) parallel implementation on NVIDIA Graphics Processing Unit (GPU) to accelerate the performance of the slowest module in the SEC-HY21 package, namely iFlux. The experimental results have shown that the CUDA-based parallel implementation has made the SEC-HY21 simulation modeling ∼ 14× faster than before.
- Published
- 2021
50. Parallelization of GKV benchmark using OpenACC
- Author
-
Makoto Morishita, Toru Nagai, Takahiro Katagiri, and Satoshi Ohshima
- Subjects
CUDA ,TOP500 ,Speedup ,Memory hierarchy ,Computer science ,Benchmark (computing) ,Graphics processing unit ,Central processing unit ,Parallel computing ,General-purpose computing on graphics processing units ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
The computing power of the Graphics Processing Unit (GPU) has received great attention in recent years, as 140 supercomputers with NVIDIA GPUs were ranked in the TOP500 for November 2020 [1]. However, CUDA, which is widely used in GPU programming, needs to be written at a low level and often requires the specialized knowledge of the GPU memory hierarchy and execution models. In this study, we used OpenACC [2], which semi-automatically generates kernel code by inserting directives into a program to speed up the application. The target application was benchmark program based on the plasma turbulence analysis code, gyrokinetic Vlasov code (GKV). With our implementation of OpenACC, kernel2, kernel3, and kernel4 of the benchmark were 31.43, 7.08, and 10.74 times faster, respectively, compared to CPU sequential execution. Thus, we succeeded in increasing the applications’ speed. In the future, we will port the rest of the code to the GPU environment to run the entire GKV on GPUs.
- Published
- 2021
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.