2,187 results on '"LU decomposition"'
Search Results
152. UAV positioning algorithm based on LU decomposition preprocessing
- Author
-
Lei Yang, Zhigang Li, and Yonglun Li
- Subjects
Compressed sensing ,Positioning system ,law ,Computer science ,Received signal strength indication ,Node (networking) ,Grid ,Wireless sensor network ,Algorithm ,LU decomposition ,Restricted isometry property ,law.invention - Abstract
Based on the problem that the observation matrix in the traditional target positioning algorithm in wireless sensor networks, does not satisfy Restricted Isometry Property, a sparse target positioning algorithm based on LU decomposition is proposed. The algorithm applies the principle of compressed sensing to the grid target positioning problem based on the Received Signal Strength Indication. The LU decomposition method is used to decompose the observation matrix, which not only satisfies Restricted Isometry Property, but also reduces the impact on the original signal sparsity. After the experiment of the UAV positioning system, it is proved that the positioning performance of the target positioning algorithm based on LU decomposition is superior to that of the sparse node positioning algorithm based on Orth preprocessing.
- Published
- 2021
153. Accelerated 3D Image Reconstruction for Resource Constrained Systems
- Author
-
Andreas Asmann, Brian Stewart, Andrew M. Wallace, and Yun Wu
- Subjects
Computer science ,020206 networking & telecommunications ,02 engineering and technology ,Iterative reconstruction ,Least squares ,LU decomposition ,law.invention ,Compressed sensing ,law ,Linear algebra ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Image resolution ,Algorithm ,Linear equation - Abstract
We demonstrate an efficient and accelerated implementation of a parallel sparse depth reconstruction framework using compressed sensing (CS) techniques. Recent work suggests that CS can be split up into smaller sub problems. This allows us to efficiently pre-compute important components of the LU decomposition and subsequent linear algebra to solve a set of linear equations found in algorithms such as the alternating direction method of multipliers (ADMM). For comparison, a fully discrete least square reconstruction method is also presented.We also investigate how reduced precision is leveraged to reduce the number of logic units in field-programmable gate array (FPGA) implementations for such sparse imaging systems. We show that the amount of logic units, memory requirements and power consumption is reduced significantly by over 70% with minimal impact on the quality of reconstruction. This demonstrates the feasibility of novel high resolution, low power and high frame rate light detection and ranging (LiDAR) depth imagers based on sparse illumination.
- Published
- 2021
154. HPC LINPACK Parameter Optimization on Homo-/Heterogeneous System of ARM Neoverse N1SDP
- Author
-
Ju-Yeob Kim, Yong Cheol Peter Cho, Young-Su Kwon, Jinho Han, Jinkyu Kim, Je-Seok Ham, and Chun-Gi Lyuh
- Subjects
Profiling (computer programming) ,Matrix (mathematics) ,Acceleration ,Computer science ,law ,Subroutine ,Benchmark (computing) ,Graph (abstract data type) ,Parallel computing ,System of linear equations ,LU decomposition ,law.invention - Abstract
HPL(High Performance Linpack) is the standard benchmark used to evaluate supercomputers (high-performance computing systems) around the world. HPL solves a linear system of equations, Ax=b, through a series of mathematical processes such as 2D Block-Cyclic matrix distribution and LU Decomposition. In order to achieve the best performance, optimization of key HPL parameters is essential. In this paper, we propose an optimization analysis technique of HPL parameters based on HPLinpack performance results and the efficiency graph pattern on a homogeneous system consisting of the Neoverse N1SDP(System Development Platform) from ARM. Results show the significant influence of GEMM operations on performance and the need for its acceleration were further verified through HPL function call profiling. Finally, we show the effectiveness of the proposed parameter optimization methods on a heterogeneous system. A HPL modified for heterogenous system exhibited considerably improved performance when tested on the Neoverse N1SDP and the Nvidia RTX 2060 GPU.
- Published
- 2021
155. Image Fusion Using LUKR in Multi-Modal Authentication
- Author
-
Suresh Yadlapati, P. E. S. N. Krishna Prasad, Sagar Yeruva, and Pavan Kumar Kalluri
- Subjects
Kronecker product ,Image fusion ,Authentication ,Computer science ,Data security ,computer.software_genre ,LU decomposition ,law.invention ,symbols.namesake ,Modal ,law ,Singular value decomposition ,symbols ,Decomposition (computer science) ,Data mining ,computer - Abstract
Now-a-days data are available in different forms and obtainable from multiple sources. Even though more number of algorithms are available but it is vulnerable to the hackers. Hence it is required to improve the security levels of the algorithm so that data security can be increased. This kind of security is expected more in banking, railway, hospital and like more number of areas. Most of the cases, security is available at a single level in the existing algorithms and at the same time, it is important how to extract the features. In this paper, we have presented how security can be implemented in multi-level and features can be extracted in Lower-Upper (LU) decomposition and Singular value decomposition (SVD). For security purposes, extracted features can be decrypted by using Khatri-Rao Product in both the cases. When comparing the results LU decomposition with Khatri-Rao (LUKR) gives better results than Singular value decomposition with Khatri-Rao (SVDKR).
- Published
- 2021
156. ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems
- Author
-
Vinh Dang, Sivasankaran Rajamanickam, and J.D. Kotulski
- Subjects
Software portability ,law ,Linear system ,Computer Science::Mathematical Software ,Distributed memory ,Parallel computing ,Solver ,System of linear equations ,Supercomputer ,LU decomposition ,Pivot element ,law.invention - Abstract
Solving dense systems of linear equations is essential in applications encountered in physics, mathematics, and engineering. This paper describes our current efforts toward the development of the ADELUS package for current and next generation distributed, accelerator-based, high-performance computing platforms. The package solves dense linear systems using partial pivoting LU factorization on distributed-memory systems with CPUs/GPUs. The matrix is block-mapped onto distributed memory on CPUs/GPUs and is solved as if it was torus-wrapped for an optimal balance of computation and communication. A permutation operation is performed to restore the results so the torus-wrap distribution is transparent to the user. This package targets performance portability by leveraging the abstractions provided in the Kokkos and Kokkos Kernels libraries. Comparison of the performance gains versus the state-of-the-art SLATE and DPLASMA GESV functionalities on the Summit supercomputer are provided. Preliminary performance results from large-scale electromagnetic simulations using ADELUS are also presented. The solver achieves 7.7 Petaflops on 7600 GPUs of the Sierra supercomputer translating to 16.9% efficiency.
- Published
- 2021
157. Affine Projection Algorithm Over Acoustic Sensor Networks for Active Noise Control
- Author
-
Miguel Ferrer, Alberto Gonzalez, Maria de Diego, and Gema Pinero
- Subjects
Acoustics and Ultrasonics ,Computer science ,Affine projection algorithm ,law.invention ,Reduction (complexity) ,030507 speech-language pathology & audiology ,03 medical and health sciences ,symbols.namesake ,Gaussian elimination ,law ,TEORIA DE LA SEÑAL Y COMUNICACIONES ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering ,Projection (set theory) ,Woodbury matrix identity ,Active noise control ,Adaptive algorithm ,Approximation algorithm ,LU decomposition ,Computational Mathematics ,Acoustic sensor networks ,symbols ,Adaptive filters ,Distributed algorithms ,0305 other medical science ,Algorithm - Abstract
[EN] Acoustic sensor networks (ASNs) are an effective solution to implement active noise control (ANC) systems by using distributed adaptive algorithms. On one hand, ASNs provide scalable systems where the signal processing load is distributed among the network nodes. On the other hand, their noise reduction performance is comparable to that of their respective centralized processing systems. In this sense, the distributed multiple error filtered-x least mean squares (DMEFxLMS) adaptive algorithm has shown to obtain the same performance than its centralized counterpart as long as there are no communications constraints in the underlying ASN. Regarding affine projection (AP) adaptive algorithms, some distributed approaches that are approximated versions of the multichannel filtered-x affine projection (MFxAP) algorithm have been previously proposed. These AP algorithms can efficiently share the processing load among the nodes, but at the expense of worsening their convergence properties. In this paper we develop the exact distributed multichannel filtered-x AP (EFxAP) algorithm, which obtains the same solution as that of the MFxAP algorithm as long as there are no communications constraints in the underlying ASN. In the EFxAP algorithm each node can compute a part or the entire inverse matrix needed by the centralized MFxAP algorithm. Thus, we propose three different strategies that obtain significant computational saving: 1) Gauss Elimination, 2) block LU factorization, and 3) matrix inversion lemma. As a result, each node computes only between 25%¿60% of the number of multiplications required by the direct inversion of the matrix. Regarding the performance in transient and steady states, the EFxAP exhibits the fastest convergence and the highest noise level reduction for any size of the acoustic network and any projection order of the AP algorithm compared to the DMEFxLMS and two previously reported distributed AP algorithms., This work was supported by EU together with Spanish Government through RTI2018-098085B-C41 (MINECO/FEDER) and Generalitat Valenciana through PROMETEO/2019/109.
- Published
- 2021
158. New Models for Solving Time-Varying LU Decomposition by Using ZNN Method and ZeaD Formulas
- Author
-
Xiao Liu, Liangjie Ming, Jinjin Guo, Yunong Zhang, and Zhonghua Li
- Subjects
Article Subject ,Discretization ,General Mathematics ,020208 electrical & electronic engineering ,02 engineering and technology ,LU decomposition ,law.invention ,symbols.namesake ,law ,0202 electrical engineering, electronic engineering, information engineering ,Euler's formula ,symbols ,QA1-939 ,Applied mathematics ,020201 artificial intelligence & image processing ,Realization (systems) ,Zhang neural network ,Mathematics - Abstract
In this paper, by employing the Zhang neural network (ZNN) method, an effective continuous-time LU decomposition (CTLUD) model is firstly proposed, analyzed, and investigated for solving the time-varying LU decomposition problem. Then, for the convenience of digital hardware realization, this paper proposes three discrete-time models by using Euler, 4-instant Zhang et al. discretization (ZeaD), and 8-instant ZeaD formulas to discretize the proposed CTLUD model, respectively. Furthermore, the proposed models are used to perform the LU decomposition of three time-varying matrices with different dimensions. Results indicate that the proposed models are effective for solving the time-varying LU decomposition problem, and the 8-instant ZeaD LU decomposition model has the highest precision among the three discrete-time models.
- Published
- 2021
159. Numerical Solution of Biot Equations in Quasi-static State
- Author
-
Alena Kopylova, Mikhail Novikov, Vadim Lisitsa, and Sergey A. Solovyev
- Subjects
Algebraic equation ,Biot number ,law ,Frequency domain ,Finite difference ,Applied mathematics ,Solver ,Linear equation ,LU decomposition ,Mathematics ,law.invention ,Stiffness matrix - Abstract
This paper presents a numerical algorithm to simulate low-frequency loading of fluid-filled poroelastic materials and estimate the effective frequency-dependent strain-stress relations for such media. The algorithm solves Biot equation in quasi-static state in the frequency domain. Thus, a large-scale system of linear algebraic equations have to be solved for each temporal frequency. We use the direct solver, based on the LU decomposition to resolve the system of the linear equations. According to the presented numerical examples suggested algorithm allows reconstructing the stiffness tensor within a wide range of frequencies for the realistic large-scale samples within several minutes. Thus, the estimation of the frequency-dependent stiffness tensors can be done in a routine manner and statistical data may be accumulated.
- Published
- 2021
160. Gaussian Elimination versus Greedy Methods for the Synthesis of Linear Reversible Circuits
- Author
-
Benoît Valiron, Cyril Allouche, Timothée Goubault de Brugière, Marc Baboulin, Simon Martiel, Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Laboratoire Méthodes Formelles (LMF), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Ecole Normale Supérieure Paris-Saclay (ENS Paris Saclay), CentraleSupélec, and Atos Bull
- Subjects
FOS: Computer and information sciences ,synthesis ,Computer science ,Computer Science - Emerging Technologies ,FOS: Physical sciences ,costs ,law.invention ,symbols.namesake ,quantum ,Operator (computer programming) ,benchmark ,Gaussian elimination ,[PHYS.QPHY]Physics [physics]/Quantum Physics [quant-ph] ,law ,factorization ,[INFO]Computer Science [cs] ,qubit ,computer ,Electronic circuit ,Quantum computer ,Quantum Physics ,quantum computation ,random ,General Medicine ,LU decomposition ,Linear reversible circuit ,Emerging Technologies (cs.ET) ,Bounded function ,Qubit ,Scalability ,operator ,symbols ,Quantum Physics (quant-ph) ,Algorithm - Abstract
International audience; Linear reversible circuits represent a subclass of reversible circuits with many applications in quantum computing. These circuits can be efficiently simulated by classical computers and their size is polynomially bounded by the number of qubits, making them a good candidate to deploy efficient methods to reduce computational costs. We propose a new algorithm for synthesizing any linear reversible operator by using an optimized version of the Gaussian elimination algorithm coupled with a tuned LU factorization. We also improve the scalability of purely greedy methods. Overall, on random operators, our algorithms improve the state-of-the-art methods for specific ranges of problem sizes: The custom Gaussian elimination algorithm provides the best results for large problem sizes (n > 150), while the purely greedy methods provide quasi optimal results when n < 30. On a benchmark of reversible functions, we manage to significantly reduce the CNOT count and the depth of the circuit while keeping other metrics of importance (T-count, T-depth) as low as possible.
- Published
- 2021
161. Unconstrained representation of orthogonal matrices with application to common principal components
- Author
-
Antonio Punzo and Luca Bagnato
- Subjects
Statistics and Probability ,LU decomposition ,Common principal components ,Leptokurtic-normal distribution ,Triangular matrix ,Permutation matrix ,01 natural sciences ,law.invention ,Combinatorics ,010104 statistics & probability ,Matrix (mathematics) ,QR decomposition ,law ,0502 economics and business ,Orthogonal matrix ,0101 mathematics ,050205 econometrics ,Mathematics ,05 social sciences ,FG algorithm ,Computational Mathematics ,Invertible matrix ,Settore SECS-S/01 - STATISTICA ,Product (mathematics) ,Statistics, Probability and Uncertainty ,Unit (ring theory) - Abstract
Many statistical problems involve the estimation of a $$\left( d\times d\right) $$ d × d orthogonal matrix $$\varvec{Q}$$ Q . Such an estimation is often challenging due to the orthonormality constraints on $$\varvec{Q}$$ Q . To cope with this problem, we use the well-known PLU decomposition, which factorizes any invertible $$\left( d\times d\right) $$ d × d matrix as the product of a $$\left( d\times d\right) $$ d × d permutation matrix $$\varvec{P}$$ P , a $$\left( d\times d\right) $$ d × d unit lower triangular matrix $$\varvec{L}$$ L , and a $$\left( d\times d\right) $$ d × d upper triangular matrix $$\varvec{U}$$ U . Thanks to the QR decomposition, we find the formulation of $$\varvec{U}$$ U when the PLU decomposition is applied to $$\varvec{Q}$$ Q . We call the result as PLR decomposition; it produces a one-to-one correspondence between $$\varvec{Q}$$ Q and the $$d\left( d-1\right) /2$$ d d - 1 / 2 entries below the diagonal of $$\varvec{L}$$ L , which are advantageously unconstrained real values. Thus, once the decomposition is applied, regardless of the objective function under consideration, we can use any classical unconstrained optimization method to find the minimum (or maximum) of the objective function with respect to $$\varvec{L}$$ L . For illustrative purposes, we apply the PLR decomposition in common principle components analysis (CPCA) for the maximum likelihood estimation of the common orthogonal matrix when a multivariate leptokurtic-normal distribution is assumed in each group. Compared to the commonly used normal distribution, the leptokurtic-normal has an additional parameter governing the excess kurtosis; this makes the estimation of $$\varvec{Q}$$ Q in CPCA more robust against mild outliers. The usefulness of the PLR decomposition in leptokurtic-normal CPCA is illustrated by two biometric data analyses.
- Published
- 2021
162. One Step at a Time: The Origins of Sequential Simulation and Beyond
- Author
-
J. Jaime Gómez-Hernández and R. Mohan Srivastava
- Subjects
Generality ,INGENIERIA HIDRAULICA ,Computer science ,Covariance matrix ,Gaussian ,Random functions ,Triangular matrix ,Large grids ,LU decomposition ,law.invention ,symbols.namesake ,Mathematics (miscellaneous) ,Stochastic processes ,Kriging ,law ,Path (graph theory) ,symbols ,General Earth and Planetary Sciences ,Extreme value theory ,Algorithm - Abstract
[EN] In the mid-1980s, still in his young 40s, Andre Journel was already recognized as one of the giants of geostatistics. Many of the contributions from his new research program at Stanford University had centered around the indicator methods that he developed: indicator kriging and multiple indicator kriging. But when his second crop of graduate students arrived at Stanford, indicator methods still lacked an approach to conditional simulation that was not tainted by what Andre called the 'Gaussian disease'; early indicator simulations went through the tortuous path of converting all indicators to Gaussian variables, running a turning bands simulation, and truncating the resulting multi-Gaussian realizations. When he conceived of sequential indicator simulation (SIS), even Andre likely did not recognize the generality of an approach to simulation that tackled the simulation task one step at a time. The early enthusiasm for SIS was its ability, in its multiple-indicator form, to cure the Gaussian disease and to build realizations in which spatial continuity did not deteriorate in the extreme values. Much of Stanford's work in the 1980s focused on petroleum geostatistics, where extreme values (the high-permeability fracture zones and the low-permeability shale barriers) have much stronger anisotropy, and much longer ranges of correlation in the maximum continuity direction, than mid-range values. With multi-Gaussian simulations necessarily imparting weaker continuity to the extremes, SIS was an important breakthrough. The generality of the sequential approach was soon recognized, first through its analogy with multi-variate unconditional simulation achieved using the lower triangular matrix of an LU decomposition of the covariance matrix as the multiplier of random normal deviates. Modifying LU simulation so that it became conditional gave rise to sequential Gaussian simulation (SGS), an algorithm that shared much in common with SIS. With nagging implementation details like the sequential path and the search neighborhood being common to both methods, improvements in either SIS or SGS often became improvements to the other. Almost half of the contributors to this Special Issue became students of Andre in the classes of 1984-1988, and several are the pioneers of SIS and SGS. Others who studied later with Andre explored and developed the first multipoint statistics simulation procedures, which are based on the same concept that underlies sequential simulation. Among his many significant intellectual accomplishments, one of the cornerstones of Andre Journel's legacy was sequential simulation, built one step at a time., The first author wishes to acknowledge the financial contribution of the Spanish Ministry of Science and Innovation through Project Number PID2019-109131RB-I00.
- Published
- 2021
- Full Text
- View/download PDF
163. Parallel LU Decomposition Algorithm for Exa-Scale Computing Using Spark Ignite
- Author
-
S. Harini, Komal Thakkar, and Aswathy Ravikumar
- Subjects
Block LU decomposition ,Matrix (mathematics) ,law ,Computer science ,Data parallelism ,Spark (mathematics) ,Cache ,Parallel computing ,Supercomputer ,Time complexity ,LU decomposition ,law.invention - Abstract
LU decomposition is one of the most efficient algorithms that can be applied to various operations such as the solving of linear equations, finding the determinant of a given matrix and matrix inversion. In any real-time application involving one of the above application scenarios, the matrix size will be gigantic and will not be able to be efficiently decomposed on a single node. In this paper, we propose a scalable algorithm for LU decomposition which is frugal in terms of time. This is accomplished by pipelining block LU decomposition on a multi-node Apache Spark system that is integrated with Apache Ignite. With the introduction of Ignite RDD, the entire dataset is available with all the nodes due to the availability of a shared Ignite cache layer, and only references to memory locations need to be passed across nodes. This is especially significant with respect to exa-scale computing where network latency is a major issue. The proposed algorithm is future-oriented and ready to deal with an efficient decomposition of large matrices with time complexity of O(N2).
- Published
- 2021
164. Optimizing Sparse Matrix Storage for the Big Data Era
- Author
-
Raúl Marichal, Pablo Ezzatti, and Ernesto Dufrechou
- Subjects
Numerical linear algebra ,Computer science ,MathematicsofComputing_NUMERICALANALYSIS ,Evolutionary algorithm ,Context (language use) ,Graph theory ,Parallel computing ,computer.software_genre ,Row and column spaces ,LU decomposition ,law.invention ,Matrix (mathematics) ,law ,computer ,Sparse matrix - Abstract
The efficient handling of sparse matrices is essential in many applications. In particular, they are critical in Big Data applications that involve large graphs, as these are often represented as sparse matrices. In this context, it is necessary to provide matrix storage formats that save memory, avoid pointless computations, and enable convenient memory accesses. Reordering techniques, which permute the rows and columns of the matrix to achieve better nonzero patterns, have long been applied in the sparse numerical linear algebra field. However, their use has focused on reducing the matrix bandwidth or the fill-in during the LU factorization. In this work, we study the application of reordering techniques to enhance sparse hybrid storage formats. Concretely, we develop an evolutionary algorithm (EA) to provide the matrix reordering, following different optimization goals. The results show that the nonzero patterns obtained after applying the EA’s reorderings are superior to those obtained through standard reorderings, regarding the use of hybrid sparse matrix formats.
- Published
- 2021
165. Computationally-Efficient Frequency-Domain Wavefield Reconstruction Inversion with Direct Solver
- Author
-
Hossein S. Aghamiry, Ali Gholami, and Stéphane Operto
- Subjects
Loop (topology) ,Operator (computer programming) ,Computer science ,law ,Computation ,Frequency domain ,Biconvex optimization ,Solver ,Algorithm ,Inversion (discrete mathematics) ,LU decomposition ,law.invention - Abstract
Summary We propose a double iteration refinement approach for wavefield reconstruction inversion (WRI). The first is a refinement loop based on the alternating direction method of multipliers (ADMM), which solves the full waveform inversion (FWI), which is formulated as a PDE-constrained biconvex optimization problem, known as iteratively refined (IR)-WRI. The second loop, which is the main subject of this abstract, enables decreasing the number of LU factorization required in the IR-WRI. Thus we call this new algorithm double iteration refinement. The main computation burden of IR-WRI is due to the required LU factorization at each iteration to reconstruct the so-called data-assimilated wavefield. Here, we use the LU factorization of an approximate static operator for this task, which allows us to perform the LU factorization only once prior to the IR-WRI iteration. Then we use a second inner refinement loop to compensate for the errors introduced by this approximation and clean up the wavefield. Numerical examples show the effectiveness of this double iteration refinement approach for the efficient application of WRI.
- Published
- 2021
166. Mathematics in Computer Science volume / Common Factors in Fraction-Free Matrix Decompositions
- Author
-
Middeke, Johannes, Jeffrey, David J., and Koutschan, Christoph
- Subjects
LU decomposition ,Fraction-free algorithms ,Exact linear system solving ,Gaussian elimination ,Smith–Jacobson normal form - Abstract
We consider LU and QR matrix decompositions using exact computations. We show that fraction-free Gauß–Bareiss reduction leads to triangular matrices having a non-trivial number of common row factors. We identify two types of common factors: systematic and statistical. Systematic factors depend on the reduction process, independent of the data, while statistical factors depend on the specific data. We relate the existence of row factors in the LU decomposition to factors appearing in the Smith–Jacobson normal form of the matrix. For statistical factors, we identify some of the mechanisms that create them and give estimates of the frequency of their occurrence. Similar observations apply to the common factors in a fraction-free QR decomposition. Our conclusions are tested experimentally. Version of record
- Published
- 2021
167. Fault-Tolerant LU Factorization Is Low Cost
- Author
-
Daniel Alberto Torres González, Laure Petrucci, and Camille Coti
- Subjects
law ,Computer science ,Computation ,Distributed computing ,Scalability ,Overhead (computing) ,Fault tolerance ,Square matrix ,LU decomposition ,law.invention - Abstract
At large scale, failures are statistically frequent and need to be taken into account. Tolerating failures has arisen as a major challenge in parallel computing as the size of the systems grow, failures become more common and some computation units are expected to fail during the execution of a program. Algorithms used in these programs must be scalable, while being resilient to hardware failures that will happen during the execution. In this paper, we present an algorithm that takes advantage of intrinsic properties of the scalable communication-avoiding LU algorithms in order to make them fault-tolerant and proceed with the computation in spite of failures. We evaluate the overhead of the fault tolerance mechanisms with respect to failure-free execution on both tall-and-skinny matrices (TSLU) and square matrices (CALU), and the cost of a failure during the execution.
- Published
- 2021
168. Factorisation Path Based Refactorisation for High-Performance LU Decomposition in Real-Time Power System Simulation
- Author
-
Lennart Schumacher, Lukas Razik, Jan Dinkelbach, Andrea Benigni, and Antonello Monti
- Subjects
direct linear solvers ,matrix decomposition ,high-performance computing ,power system simulation ,Technology ,Control and Optimization ,Speedup ,Computer science ,Computation ,Energy Engineering and Power Technology ,Matrix decomposition ,law.invention ,Electric power system ,Matrix (mathematics) ,Power system simulation ,law ,Electrical and Electronic Engineering ,Engineering (miscellaneous) ,Renewable Energy, Sustainability and the Environment ,LU decomposition ,Benchmark (computing) ,ddc:620 ,Algorithm ,Energy (miscellaneous) - Abstract
Energies : open-access journal of related scientific research, technology development and studies in policy and management 14(23), 7989 (2021). doi:10.3390/en14237989 special issue: "Special Issue "Challenge and Research Trends of Power System Simulation" / Special Issue Editors: Prof. Dr. Chul-Hwan Kim, guest editor; Prof. Dr. Gi-Hyeon Gwon, guest editor", Published by MDPI, Basel
- Published
- 2021
- Full Text
- View/download PDF
169. Automatic Parallelization of Affine Programs for Distributed Memory Systems
- Author
-
Shamil Magomedov and Artem S. Lebedev
- Subjects
Optimization problem ,Computer science ,Node (networking) ,Parallel computing ,computer.software_genre ,LU decomposition ,law.invention ,Automatic parallelization ,law ,Data exchange ,Polytope model ,Compiler ,Affine transformation ,computer - Abstract
The paper addresses problems of spatial distribution of data and computations, organizing data exchange within pool of parallel processes to perform parallelizing optimization with data locality improvement when compiling affine programs for distributed memory systems with MPI support. The presented method of spatial distribution of data and computations rely on polyhedral framework implementing the idea of reducing construction of affine transformations of the program to multi-objective optimization problem. Data and computation placements are constructed accordingly spatial locality principle and satisfies forward communication only property. There is no single master node orchestrating the computational process and storing all the data to be processed – all the parallel processes are equal. Finally, an MPI-program with MPI_send and MPI_recv invocations is generated. A concept of communication polyhedron is introduced for modeling of information exchange within MPI communicator. Three algorithms of linear algebra are taken for benchmarks: LU decomposition, syr2k, atax. Results of parallelization are compared with Pluto compiler output in the aspect of performance.
- Published
- 2021
170. Block Low-Rank Matrices with Shared Bases: Potential and Limitations of the BLR^2 Format
- Author
-
Alfredo Buttari, Théo Mary, Cleve Ashcraft, ANSYS, Algorithmes Parallèles et Optimisation (IRIT-APO), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse 1 Capitole (UT1)-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1)-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées, Performance et Qualité des Algorithmes Numériques (PEQUAN), Laboratoire d'Informatique de Paris 6 (LIP6), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1), Centre National de la Recherche Scientifique (CNRS), and LIP6
- Subjects
Rank (linear algebra) ,Astrophysics::High Energy Astrophysical Phenomena ,010103 numerical & computational mathematics ,Astrophysics::Cosmology and Extragalactic Astrophysics ,computer.software_genre ,01 natural sciences ,law.invention ,Separable space ,Matrix (mathematics) ,law ,numerical linear algebra ,[INFO]Computer Science [cs] ,0101 mathematics ,[MATH]Mathematics [math] ,Column (data store) ,Astrophysics::Galaxy Astrophysics ,Mathematics ,Block (data storage) ,Discrete mathematics ,hierarchical matrices ,Numerical linear algebra ,block low-rank matrices ,Special class ,LU decomposition ,Data sparse matrices ,LU factorization ,computer ,Analysis ,block separable matrices - Abstract
International audience; We investigate a special class of data sparse rank-structured matrices that combine a flat block low-rank (BLR) partitioning with the use of shared (called nested in the hierarchical case) bases. This format is to H 2 matrices what BLR is to H matrices: we therefore call it the BLR 2 matrix format. We present algorithms for the construction and LU factorization of BLR 2 matrices, and perform their cost analysis-both asymptotically and for a fixed problem size. With weak admissibility, BLR 2 matrices reduce to block separable matrices (the flat version of HBS/HSS). Our analysis and numerical experiments reveal some limitations of BLR 2 matrices with weak admissibility, which we propose to overcome with two approaches: strong admissibility, and the use of multiple shared bases per row and column.
- Published
- 2020
171. Improved Data Locality Using Morton-order Curve on the Example of LU Decomposition
- Author
-
Claudia Plant, Christian Bohm, and Martin Perdacher
- Subjects
020203 distributed computing ,Hardware_MEMORYSTRUCTURES ,Memory hierarchy ,CPU cache ,Computer science ,020207 software engineering ,02 engineering and technology ,Parallel computing ,LU decomposition ,Matrix multiplication ,law.invention ,Matrix decomposition ,Tree traversal ,law ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Nested loop join - Abstract
The LU decomposition is an essential element used in many linear algebra applications. Furthermore, it is used in LINPACK to benchmark the performance of modern multi-core processor environments. These processors offer a large memory hierarchy including multiple registers and various levels of cache. Registers or L1 data cache are small in size but also very fast. The L2 or L3 cache memory is usually shared among other cores and larger but slower. For the LU decomposition, the latency of fetching data from the main memory to the registers to perform a calculation also depends on the input matrix's memory access pattern. Here, we look at the block factorization algorithm, where the LU decomposition performance depends on the performance of the matrix multiplication. In both cases, the LU decomposition and the matrix multiplication, such a matrix is traversed by three nested loops. In this paper, we propose to traverse such loops in an order defined by a space-filling curve. This traversal dramatically improves data locality and offers effective exploitation of the memory hierarchy. Besides the canonical (or line-by-line) access pattern, we demonstrate the traversal in Hilbert-, Peano and Morton order. Our extensive experiments show that the Morton order (or Z -order) and the inverse Morton order (or И-order) have a better runtime performance compared to the others.
- Published
- 2020
172. A Novel DDM Based on Full Basis Functions Division for Solving Electromagnetic Scattering
- Author
-
Nian Fushun, Zongjing Gu, Baoguo Yang, Yan Chen, and Shengli Liang
- Subjects
Computer science ,020206 networking & telecommunications ,Domain decomposition methods ,Basis function ,02 engineering and technology ,Division (mathematics) ,Topology ,LU decomposition ,law.invention ,Matrix (mathematics) ,law ,Triangle mesh ,0202 electrical engineering, electronic engineering, information engineering ,Perfect conductor ,Numerical stability - Abstract
A novel domain decomposition method based on Full Basis Function division (FBF-DDM) is proposed to analyze scattering problems from electrically large and complex perfect electric conductor (PEC) targets. Specially, the original model surface is divided into several open subdomains according to the distribution of basis functions, and it is not necessary to add artificial surface or buffer region between adjacent subdomains. Meanwhile, the single-layer toothed triangular mesh cells located on one side of the boundary line between adjacent subdomains is shared by these two adjacent subdomains to ensure the integrity of basis functions. For the purpose of improving the simulation accuracy, each subdomain is solved independently by using LU decomposition. Furthermore, the coupling between different subdomains is calculated in the manner of near field to avoid the storage of mutual impedance matrix, and thus, the memory requirement is reduced greatly. Numerical results demonstrate the validity and the stability of the proposed method.
- Published
- 2020
173. Generalized Expanded-Blaum-Roth Codes and Their Efficient Encoding/Decoding
- Author
-
Yunghsiang S. Han, Hanxu Hom, Patrick P. C. Lee, You Wu, and Guojun Han
- Subjects
Discrete mathematics ,Computer science ,Prime number ,020206 networking & telecommunications ,020207 software engineering ,02 engineering and technology ,Vandermonde matrix ,LU decomposition ,law.invention ,law ,Encoding (memory) ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Redundancy (engineering) ,Parity (mathematics) ,Column (data store) ,Decoding methods - Abstract
Expanded-Blaum-Roth (EBR) code encodes a (p – 1) × k information array into a p × p array such that any bit in a column can be recovered within the column and any k out of p columns can retrieve all (p – 1) × k information bits, where p is a prime number. In this paper, we generalize the construction of EBR code with a more flexible parameter, i.e., the number of bits stored in a column in the proposed construction can be not only a prime number but also an even number. In addition, we present an efficient encoding/decoding method for the proposed generalized EBR codes based on the LU factorization of Vandermonde matrix. We show that the proposed encoding/decoding method has less computational complexity than the existing method. Moreover, we show that the minimum symbol distance of generalized EBR codes is the same as that of EBR code for some parameters.
- Published
- 2020
174. 2D Static Resource Allocation for Compressed Linear Algebra and Communication Constraints
- Author
-
Lionel Eyraud-Dubois, Mathieu Verite, Olivier Beaumont, High-End Parallel Algorithms for Challenging Numerical Simulations (HiePACS), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Université de Bordeaux (UB), This work is supported by the Région Nouvelle-Aquitaine, under grant 2018-1R50119 'HPC scalable ecosystem'., ANR-19-CE46-0009,SOLHARIS,Solveurs pour architectures hétérogènes utilisant des supports d'exécution, objectif scalabilité(2019), ANR-17-CE25-0004,DASH,Ordonnancement de données pour le calcul haute-performance(2017), Verite, Mathieu, Solveurs pour architectures hétérogènes utilisant des supports d'exécution, objectif scalabilité - - SOLHARIS2019 - ANR-19-CE46-0009 - AAPG2019 - VALID, Ordonnancement de données pour le calcul haute-performance - - DASH2017 - ANR-17-CE25-0004 - AAPG2017 - VALID, Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, and Université Bordeaux Segalen - Bordeaux 2-Université Sciences et Technologies - Bordeaux 1-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université Bordeaux Segalen - Bordeaux 2-Université Sciences et Technologies - Bordeaux 1-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest
- Subjects
Computer science ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,Row and column spaces ,01 natural sciences ,Load Balancing ,law.invention ,BLR Format ,Load management ,law ,[INFO.INFO-DC] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,Bock Cyclic ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Communication complexity ,020203 distributed computing ,Linear system ,Compression ,Randomized Algorithms ,Load balancing (computing) ,LU decomposition ,Matrix multiplication ,Linear Algebra ,Linear algebra ,BLR ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] - Abstract
International audience; This paper adresses static resource allocation problems for irregular distributed parallel applications. More precisely, we focus on two classical tiled linear algebra kernels: the Matrix Multiplication (MM) and the LU decomposition (LU) algorithms on large linear systems. In the context of parallel distributed platforms, data exchanges can dramatically degrade the performance of linear algebra kernels and in this context, compression techniques such as Block Low Rank (BLR) compression techniques are good candidates both for limiting data storage on each processor and data exchanges between processors. On the other hand, the use of BLR representation makes the static allocation problem of tiles to processors more complex. Indeed, the load associated to each tile depends on its compression factor, which induces an heterogeneous load balancing problem. In turn, solving this load balancing problem optimally might lead to complex allocation schemes, where the tiles allocated to a given processor are scattered all over the matrix. This in turn induces communication costs, since matrix multiplication and LU decompositions heavily rely on broadcasting operations along rows and columns of processors, so that the communication volume is minimized when the maximal number of different processors on any row and column is minimized. In the fully homogeneous case, 2D Block Cyclic (BC) allocation solves both load balancing and communication minimization issues simultaneously , but it might lead to bad load balancing in the heterogeneous case. Our goal in this paper is to propose data allocation schemes dedicated to BLR format and to prove that it is possible to obtain good performance on makespan when simultaneously balancing the load and minimizing the maximal number of different processor in any row or column.
- Published
- 2020
175. A Modified Principal Component Regression Method for Quality-related Fault Detection
- Author
-
Wenxiao Gao, Aihua Zhang, and Zhongdang Yu
- Subjects
business.industry ,Computer science ,Pattern recognition ,LU decomposition ,Field (computer science) ,Fault detection and isolation ,Matrix decomposition ,law.invention ,law ,Principal component analysis ,Principal component regression ,Artificial intelligence ,business ,Coefficient matrix ,Statistical hypothesis testing - Abstract
As an effective data dimensionality reduction technique, PCA is widely used in the field of process monitoring. PCR, as an improved method of PCA, obtains the coefficient matrix between input and output by least square regression of score matrix and quality variable. However, the detection effect of PCR on quality-related faults still needs to be improved. Focusing on this issue, a MPCR method for quality-related fault detection is proposed in this paper. Where LU decomposition is introduced to further decompose the coefficient matrix of PCR, decompose process variables into quality-related and quality-independent parts, and design corresponding test statistics to make the algorithm more suitable for modern industrial systems. The effectiveness of the algorithm is verified by a numerical example and the Tennessee Eastman process. The results show that MPCR algorithm has higher fault detection rate and better tracking performance than the traditional PLS.
- Published
- 2020
176. Fast solvers for tridiagonal Toeplitz linear systems
- Author
-
Yi Yin, Yulin Zhang, Zhongyun Liu, Shan Li, and Universidade do Minho
- Subjects
15B05 ,65F05 ,010103 numerical & computational mathematics ,Tridiagonal Toeplitz matrices ,15A23 ,01 natural sciences ,law.invention ,Combinatorics ,law ,Beta (velocity) ,0101 mathematics ,65F10 ,Ciências Naturais::Matemáticas ,Physics ,Pivoting ,Science & Technology ,Tridiagonal matrix ,Diagonally dominant ,Applied Mathematics ,010102 general mathematics ,Linear system ,Block LU factorization ,Toeplitz matrix ,LU decomposition ,Computational Mathematics ,Schur complement ,Elementary transformation ,Matemáticas [Ciências Naturais] ,Diagonally dominant matrix - Abstract
Let A be a tridiagonal Toeplitz matrix denoted by A=Tritoep(β,α,γ). The matrix A is said to be: strictly diagonally dominant if |α|>|β|+|γ|, weakly diagonally dominant if |α|≥|β|+|γ|, subdiagonally dominant if |β|≥|α|+|γ|, and superdiagonally dominant if |γ|≥|α|+|β|. In this paper, we consider the solution of a tridiagonal Toeplitz system Ax=b, where A is subdiagonally dominant, superdiagonally dominant, or weakly diagonally dominant, respectively. We first consider the case of A being subdiagonally dominant. We transform A into a block 2×2 matrix by an elementary transformation and then solve such a linear system using the block LU factorization. Compared with the LU factorization method with pivoting, our algorithm takes less flops, and needs less memory storage and data transmission. In particular, our algorithm outperforms the LU factorization method with pivoting in terms of computing efficiency. Then, we deal with superdiagonally dominant and weakly diagonally dominant cases, respectively. Numerical experiments are finally given to illustrate the effectiveness of our algorithms, National Natural Science Foundation of China under Grant no. 11371075, the Hunan Key Laboratory of mathematical modeling and analysis in engineering, the research innovation program of Changsha University of Science and Technology for postgraduate students under Grant (CX2019SS34), and the Portuguese Funds through FCT-Fundação para a Ciência, within the Project UIDB/00013/2020 and UIDP/00013/2020
- Published
- 2020
177. ADD, TDA, Cıvıltı-z ve AÜ Ayrıştırma kullanarak Sağlam Görüntü Filigranı Oluşturma
- Author
-
Mary Agoyi
- Subjects
Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Image processing ,Watermark ,computer.file_format ,JPEG ,LU decomposition ,law.invention ,symbols.namesake ,Gaussian noise ,law ,Singular value decomposition ,symbols ,computer ,Digital watermarking ,Algorithm ,Histogram equalization - Abstract
The exponential increase of multimedia technologies usage has made connectivity simpler, more convenient and quicker, but has also contributed to a number of infringements of digital content. Digital watermarking has been known to offer a solution to compliance issues related to copyright infringements in multimedia technology. This paper proposes a robust approach to image watermarking by integrating the benefits of all four algorithms. The proposed method is based on discrete wavelet transform (DWT), chirp z-transform (CZT), lower and upper (LU) decomposition and singular decomposition value (SVD). In this method, using 1-level DWT, the image is broken down into its frequency sub-bands. The LL sub-band is then converted into a z domain using CZT. LU decomposition was then used to further decompose the image into two matrixes, each of which was used to insert the watermark. SVD has been applied to each of the matrixes and the unique value of the watermark is inserted to the matrixes of the decomposed image. The robustness and imperceptibility of the proposed solution was tested by checking the watermarked image for various image processing operations. Experimental findings show that the proposed technique has high imperceptibility capability and it shows a reasonable level of robustness against signal processing operations such as filtering, scaling, JPEG, rotating, gamma correction, blurring, cropping, gaussian noise, contrast enhancement, histogram equalization, and salt & pepper noise.
- Published
- 2020
178. Hedgehog: Understandable Scheduler-Free Heterogeneous Asynchronous Multithreaded Data-Flow Graphs
- Author
-
Gerson C. Kroiz, Loïc Yon, Walid Keyrouz, Alexandre Bardakoff, Timothy Blattner, Bruno Bachelet, National Institute of Standards and Technology [Gaithersburg] (NIST), Laboratoire d'Informatique, de Modélisation et d'Optimisation des Systèmes (LIMOS), Ecole Nationale Supérieure des Mines de St Etienne-Centre National de la Recherche Scientifique (CNRS)-Université Clermont Auvergne [2017-2020] (UCA [2017-2020]), Department of Mathematics [Maryland], University of Maryland [College Park], University of Maryland System-University of Maryland System, and Ecole Nationale Supérieure des Mines de St Etienne (ENSM ST-ETIENNE)-Université Clermont Auvergne [2017-2020] (UCA [2017-2020])-Centre National de la Recherche Scientifique (CNRS)
- Subjects
010302 applied physics ,Multi-core processor ,Computer science ,Dataflow ,Separation of concerns ,[INFO.INFO-SE]Computer Science [cs]/Software Engineering [cs.SE] ,02 engineering and technology ,Parallel computing ,Thread (computing) ,01 natural sciences ,LU decomposition ,law.invention ,Data flow diagram ,law ,Asynchronous communication ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,020201 artificial intelligence & image processing ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] - Abstract
Getting performance on high-end heterogeneous nodes is challenging. This is due to the large semantic gap between a computation's specification-possibly mathematical formulas or an abstract sequential algorithm-and its parallel implementation; this gap obscures the program's parallel structures and how it gains or loses performance. We present Hedgehog, a library aimed at coarse-grain parallelism. It explicitly embeds a dataflow graph in a program and uses this graph at runtime to drive the program's execution so it takes advantage of hardware parallelism (multicore CPUs and multiple accelerators). Hedgehog has asynchronicity built in. It statically binds individual threads to graph nodes, which are ready to fire when any of their inputs are available. This allows Hedgehog to avoid using a global scheduler and the loss of performance associated with global synchronizations and managing of thread pools. Hedgehog provides a separation of concerns and distinguishes between compute and state maintenance tasks. Its API reflects this separation and allows a developer to gain a better understanding of performance when executing the graph. Hedgehog is implemented as a C++ 17 headers-only library. One feature of the framework is its low overhead; it transfers control of data between two nodes in ≈ 1 µs. This low overhead combines with Hedgehog's API to provide essentially cost-free profiling of the graph, thereby enabling experimentation for performance, which enhances a developer's insight into a program's performance. Hedgehog's asynchronous data-flow graph supports a data streaming programming model both within and between graphs. We demonstrate the effectiveness of this approach by highlighting the performance of streaming implementations of two numerical linear algebra routines, which are comparable to existing libraries: matrix multiplication achieves >95 % of the theoretical peak of 4 GPUs; LU decomposition with partial pivoting starts streaming partial final result blocks 40 5 earlier than waiting for the full result. The relative ease and understandability of obtaining performance with Hedgehog promises to enable non-specialists to target performance on high-end single nodes.
- Published
- 2020
179. A novel zero-watermark algorithm based on LU decomposition in NSST domain.
- Author
-
Han, Shao-Cheng and Zhang, Zhao-Ning
- Abstract
A novel zero-watermark algorithm is proposed based on LU decomposition and Non-Subsampled Shearlet Transform(NSST) for copyright protection of digital images. In this method, firstly the host image is performed with NSST, and a sub-image is extracted from the low-frequency approximation image randomly by Logistic chaotic system, then the sub-image is divided into non-overlapping sub-blocks. Next, each sub-block is performed with LU decomposition, the zero-watermark is derived by judging the numerical relationship between the sum from the first row elements of each sub-block's U matrix and the mean of the sums from the first row elements of all sub-block's U matrix. Experimental results show that the method is robust to adding noise, filtering and JPEG compression, and can resist the Cropping and RST attack to some extent. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
180. Design and analysis of layered coarse-grained reconfigurable architecture.
- Author
-
Rakossy, Zoltan Endre, Naphade, Tejas, and Chattopadhyay, Anupam
- Abstract
Coarse-grained reconfigurable architectures (CGRAs) represent an important class of programmable accelerators with a significant performance advantage for data-driven, systolic algorithms. In this paper, we present a novel CGRA where data access, data transport and execution are separately layered into dedicated, independent structures. The proposed architecture concept allows for independent control and optimization on each layer to address the storage access bottleneck, faced by state-of-the-art CGRAs. The architecture is programmable and the implementation is derived from a high-level language specification, allowing fast design exploration, debugging and simulation. Up to 50% run-time performance improvement and 5× area-time-energy product gain of the layered CGRA over a non-layered one is demonstrated with 2 case studies from demanding linear algebra applications. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
181. IMAGE DENOISING USING LU DECOMPOSITION AND FEATURE EXTRACTION USING GLCM.
- Author
-
Nagarajan, D.
- Subjects
IMAGE denoising ,DOMAIN decomposition methods ,TEXTURE analysis (Image processing) ,DATA extraction ,DATA compression - Abstract
This paper proposes the removal of noise using LU decomposition and feature extraction using Gray level co-occurrence matrix (GLCM). The size of the image is reduced after decomposition and the compression ratio is calculated. When the compression ratio is less, the noise present in the data is also less. The image data contains contains redundant information and therefore it is necessary to decompose the data. There are many image decomposition techniques and they are spatial domain and frequency domain. Spatial domain operates the image on gray scale values. Texture is an important feature of an image. GLCM is used to obtain the second order statistical features for an image and it operates on spatial domain. The features of an image include color, texture, shape or domain specific features. . The texture features such as energy, entropy, homogeneity, correlation and contrast have been calculated. The aim of the paper is to extract the texture features of an image and to compare the size of an image before and after decomposition. The compression ratio is calculated and the performance of an image is evaluated before and after LU decomposition. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
182. Single-channel color information security system using LU decomposition in gyrator transform domains.
- Author
-
Abuturab, Muhammad Rafiq
- Subjects
- *
GYRATORS , *MATHEMATICAL decomposition , *INFORMATION theory , *DATA encryption , *CIPHERS , *COLOR image processing - Abstract
Abstract: A novel single-channel color information security system based on LU decomposition in gyrator transform domains is proposed. The original color image to be encoded is separated into its red, green and blue channels. They are modulated by corresponding random phase functions and then independently Fourier transformed. The transformed images of red and green channels are multiplied and then inverse Fourier transformed. The resulting image is phase- and amplitude truncated to obtain an encrypted image and an asymmetric decryption key, respectively. The encrypted image is multiplied by transformed image of blue channel and then performed LU decomposition. Finally, L and U parts are individually gyrator transformed at different transformation angles, which can be assigned to two different authorized users. The proposed single-channel encryption system is more compact than conventional three-channel encryption systems. Additionally, the ciphertexts are not color images but they are gray images which have obscure properties. The presented LU form is asymmetric. The two transformation angles of GT, three decryption keys for three channels and one asymmetric decryption key significantly improve the security and robustness of the proposed method. The encryption system can be realized digitally or optically. Numerical simulations demonstrate the feasibility and effectiveness of the suggested algorithms. [Copyright &y& Elsevier]
- Published
- 2014
- Full Text
- View/download PDF
183. A Low Complexity-High Throughput QC-LDPC Encoder.
- Author
-
Mahdi, Ahmed and Paliouras, Vassilis
- Subjects
- *
ENCODING , *LOW density parity check codes , *FACTORIZATION , *SPARSE matrices , *ERROR correction (Information theory) - Abstract
This paper introduces hardware architectures for encoding Quasi-Cyclic Low-Density Parity Check (QC-LDPC) codes. The proposed encoders are based on appropriate factorization and subsequent compression of involved matrices by means of a novel technique, which exploits features of recursively-constructed QC-LDPC codes. The particular approach derives to linear encoding time complexity and requires a constant number of clock cycles for the computation of parity bits for all the constructed codes of various lengths that stem from a common base matrix. The proposed architectures are flexible, as they are parameterized and can support multiple code rates and codes of different lengths simply by appropriate initialization of memories and determination of data bus widths. Implementation results show that the proposed encoding technique is more efficient for some LDPC codes than previously proposed solutions. Both serial and parallel architectures are proposed. Hardware instantiations of the proposed serial encoders demonstrate high throughput with low area complexity for code words of many thousand bits, achieving area reduction compared to prior art. Furthermore, parallelization is shown to efficiently support multi-Gbps solutions at the cost of moderate area increase. The proposed encoders are shown to outperform the current state-of-the-art in terms of throughput-area-ratio and area-time complexity by 10 to up to 80 times for codes of comparable error-correction strength. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
184. On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization
- Author
-
Timo Schneider, Torsten Hoefler, Alexandros Nikolaos Ziogas, Maciej Besta, Grzegorz Kwasniewski, and Tal Ben-Nun
- Subjects
Discrete mathematics ,FOS: Computer and information sciences ,020203 distributed computing ,Numerical linear algebra ,Computer science ,ScaLAPACK ,Model of computation ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Upper and lower bounds ,Parallel I/O ,LU decomposition ,law.invention ,Computer Science - Distributed, Parallel, and Cluster Computing ,law ,Linear algebra ,0202 electrical engineering, electronic engineering, information engineering ,Distributed, Parallel, and Cluster Computing (cs.DC) ,computer ,Abstraction (linguistics) - Abstract
Dense linear algebra kernels, such as linear solvers or tensor contractions, are fundamental components of many scientific computing applications. In this work, we present a novel method of deriving parallel I/O lower bounds for this broad family of programs. Based on the X-partitioning abstraction, our method explicitly captures inter-statement dependencies. Applying our analysis to LU factorization, we derive COnfLUX, an LU algorithm with the parallel I/O cost of $N^3 / (P \sqrt{M})$ communicated elements per processor -- only $1/3\times$ over our established lower bound. We evaluate COnfLUX on various problem sizes, demonstrating empirical results that match our theoretical analysis, communicating asymptotically less than Cray ScaLAPACK or SLATE, and outperforming the asymptotically-optimal CANDMC library. Running on $1$,$024$ nodes of Piz Daint, COnfLUX communicates 1.6$\times$ less than the second-best implementation and is expected to communicate 2.1$\times$ less on a full-scale run on Summit., 13 pages without references, 12 figures, submitted to PPoPP 2021: 26th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
- Published
- 2020
185. s-Expanded Transfer Function using UaL Decomposition with Convolution & Deconvolution
- Author
-
Reza Hashemian
- Subjects
Discrete mathematics ,Matrix (mathematics) ,law ,Decomposition (computer science) ,Deconvolution ,Tuple ,LU decomposition ,RL circuit ,Matrix decomposition ,Mathematics ,Convolution ,law.invention - Abstract
This article provides two new features that in combinations facilitate the construction of a transfer function in RC and RL circuits, and in s-expanded polynomials. First is the introduction of tuple-based array manipulations, where multiplications and divisions are replaced with convolutions and deconvolutions, respectively. Second, it introduces a new UaL decomposition for tuple-based matrices. The UaL decomposition is proven to be computationally more efficient compared to the traditional LU factorization. It is shown that the growth of the entries in both U and L matrices are gradual and follow a geometrical pattern, as the decomposition progresses. Another important feature of the UaL decomposition is that it does not generate nonzero remainders, or simply, it is “cancellation free”. The method is applied to the MNA representation of an RC (or RL) circuit. Finally, it is shown that working in the tuple-based format puts the U and L matrix entries into s-expanded form, such as P(s) = a n sn + a n-1 sn-1 +…+ a 0 , which in turn produces s-expanded transfer functions.
- Published
- 2020
186. Analog Solutions of Systems of Linear Equations on a Configurable Platform
- Author
-
Aishwarya Natarajan and Jennifer Hasler
- Subjects
Computer science ,Differential equation ,Numerical analysis ,Linear system ,MathematicsofComputing_NUMERICALANALYSIS ,System of linear equations ,LU decomposition ,law.invention ,Computational science ,Matrix decomposition ,Matrix (mathematics) ,law ,ComputingMethodologies_SYMBOLICANDALGEBRAICMANIPULATION ,Linear equation - Abstract
Even though analog computation is better suited for differential equation solutions (ODE, PDE), sometimes it needs to solve systems of linear equations. This discussion focuses on analog solutions of linear equation systems, implemented on a configurable platform. Digital systems rely on solving linear equations as the fundamental numerical computation. Systems of linear equations are used to solve static circuits illustrating that at least a reduced class of analog physical linear system computation should be possible. The analog approaches utilize iterative techniques, setting up a set of ODEs to solve the system of linear equations, rather than relying on matrix decompositions (e.g. LU decomposition). The approach allows for multiple potential configurable circuit approaches. A set of amplifier networks has been designed to demonstrate the solutions for different matrices. These techniques provide energy-efficient continuous-time solutions. The resulting algorithm has been studied considering the analog numerical analysis for the solution and convergence time.
- Published
- 2020
187. Swept-frequency order-recursive method of moments for the efficient analysis of resonant microwave circuits.
- Author
-
Naishadham, Krishna
- Subjects
- *
MICROWAVE circuits , *WAVELENGTHS , *OPTICAL MEMS , *DOMAIN decomposition methods , *DIELECTRIC resonators , *COMPUTATIONAL complexity - Abstract
ABSTRACT In the analysis of resonant microwave circuits by the method of moments (MoM), we typically subdivide the surface in terms of cell density determined by the smallest wavelength and compute the response at several frequencies using this single mesh. Although interpolation can be used in frequency, this approach typically involves matrix operations of complexity O (kN3) at the k discrete frequencies of MoM computation, and can become prohibitive for large structures. In this article, we develop an order-recursive method for solving the linear system, wherein the solution at the higher frequencies is iteratively constructed from the LU decomposition at the lower frequencies without having to solve the linear system from scratch at each frequency. The proposed order-recursive MoM allows for systematic adaptive gridding with the increase in frequency, and results in a large portion of matrix recomputation to be avoided, potentially enabling significant reduction in computation time and memory. © 2014 Wiley Periodicals, Inc. Microwave Opt Technol Lett 56:195-198, 2014 [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
188. Investigating the Use of Pipelined LU Decomposition to Solve Systems of Linear Equations
- Author
-
Anas Bushnag
- Subjects
Computer science ,law ,Multithreading ,Linear system ,Parallel computing ,Method of moments (statistics) ,Coefficient matrix ,System of linear equations ,LU decomposition ,Matrix decomposition ,law.invention ,Block (data storage) - Abstract
Large linear systems are used in many applications, and millions of equations can be utilized (e.g., in aircraft manufacturing). These equations need to be simplified, which allows faster processing and easier implementation. The LU decomposition method and its pipelined version are implemented to illustrate their efficiency. A comprehensive comparison between the sequential LU decomposition, multithreading LU decomposition, and TPL methods is conducted under different coefficient matrix block sizes to represent these systems. The results show that the multithreading and TPL approaches reduce the processing time required to compute the coefficient matrix compared to sequential processing. This finding will help improve the performance of some techniques in the literature by paralyzing the LU decomposition part of the system.
- Published
- 2020
189. Solving Block Low-Rank Linear Systems by LU Factorization is Numerically Stable
- Author
-
Nicholas J. Higham, Théo Mary, Department of Mathematics [Manchester] (School of Mathematics), University of Manchester [Manchester], Performance et Qualité des Algorithmes Numériques (PEQUAN), LIP6, Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS), and Sorbonne Université (SU)
- Subjects
Rank (linear algebra) ,floating-point arithmetic ,General Mathematics ,010103 numerical & computational mathematics ,computer.software_genre ,01 natural sciences ,Stability (probability) ,law.invention ,Factorization ,law ,Applied mathematics ,[INFO]Computer Science [cs] ,0101 mathematics ,[MATH]Mathematics [math] ,Mathematics ,Numerical linear algebra ,Applied Mathematics ,Linear system ,[INFO.INFO-NA]Computer Science [cs]/Numerical Analysis [cs.NA] ,LU decomposition ,rounding error analysis ,010101 applied mathematics ,Computational Mathematics ,numerical stability ,Block low-rank matrices ,LU factorization ,Round-off error ,computer ,[MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA] ,Numerical stability - Abstract
Block low-rank (BLR) matrices possess a blockwise low-rank property that can be exploited to reduce the complexity of numerical linear algebra algorithms. The impact of these low-rank approximations on the numerical stability of the algorithms in floating-point arithmetic has not previously been analysed. We present rounding error analysis for the solution of a linear system by LU factorization of BLR matrices. Assuming that a stable pivoting scheme is used, we prove backward stability: the relative backward error is bounded by a modest constant times $\varepsilon $, where the low-rank threshold $\varepsilon $ is the parameter controlling the accuracy of the blockwise low-rank approximations. In addition to this key result, our analysis offers three new insights into the numerical behaviour of BLR algorithms. First, we compare the use of a global or local low-rank threshold and find that a global one should be preferred. Second, we show that performing intermediate recompressions during the factorization can significantly reduce its cost without compromising numerical stability. Third, we consider different BLR factorization variants and determine the update–compress–factor variant to be the best. Tests on a wide range of matrices from various real-life applications show that the predictions from the analysis are realized in practice.
- Published
- 2020
190. Mamdani FIS Combined With LU Decomposition Method and Two-Level LWT for Image Watermarking Technique
- Author
-
Areej M. Abduldaim, Raghad I. Sabri, and Jumana Waleed
- Subjects
business.industry ,Computer science ,Data_MISCELLANEOUS ,Wavelet transform ,Watermark ,Pattern recognition ,Image processing ,Fuzzy logic ,LU decomposition ,Matrix decomposition ,law.invention ,Robustness (computer science) ,law ,Artificial intelligence ,business ,Digital watermarking - Abstract
Fuzzy theory and linear algebra have a substantial role in various sciences and the biggest effect was in the share of computer science. In the field of image processing, the process of digital image watermarking represents the inclusion of a secret logo or an image called watermark into a host image to get a watermarked image to guarantee copyright protection or ownership. The watermarking will be acceptable if it achieves specific qualities like imperceptibility and robustness. In this paper, Mamdani fuzzy inference system (FIS) was employed on the original image for obtaining the controller parameter which works on finding a trade-off between imperceptibility and robustness, and on the other hand, is combined with the algebraic matrix decomposition method LU for proposing a digital watermarking technique in which the watermark is included in the LH2 band produced by applying two-level of the lifting wavelet transform (LWT) using the modular arithmetic (mod). The experimental results show that the proposed algorithm resistant to many attacks.
- Published
- 2020
191. LU Factorization of Any Matrix
- Author
-
Marc Stromberg
- Subjects
Pure mathematics ,Matrix (mathematics) ,Invertible matrix ,Factorization ,law ,Moore–Penrose pseudoinverse ,LU decomposition ,Square (algebra) ,Mathematics ,law.invention - Abstract
The usual methods of factoring a matrix depend on its being square and nonsingular. We show that neither of these properties is required for factorization and that any matrix has a factorization of the form PA = LU, and then present a few consequences of the factorization.
- Published
- 2020
192. Stability analysis of a diagonally implicit scheme of block backward differentiation formula for stiff pharmacokinetics models
- Author
-
Zarina Bibi Ibrahim, Zanariah Abdul Majid, Hazizah Mohd Ijam, and Norazak Senu
- Subjects
Backward differentiation formula ,Algebra and Number Theory ,Partial differential equation ,lcsh:Mathematics ,Applied Mathematics ,Diagonal ,Triangular matrix ,Stability (learning theory) ,Diagonally implicit ,lcsh:QA1-939 ,LU decomposition ,law.invention ,Block backward differentiation formula ,law ,Ordinary differential equation ,Applied mathematics ,Stability ,Pharmacokinetics models ,Stiff ODEs ,Analysis ,Block (data storage) ,Mathematics - Abstract
In this paper, we analyze the criteria for the stability of a method suited to the ordinary differential equations models. The relevant proof that the method satisfies the condition of stiff stability is also provided. The aim of this paper is therefore to construct an efficient two-point block method based on backward differentiation formula which is A-stable and converged. The new diagonally implicit scheme is formulated to approximate the solution of the pharmacokinetics models. By implementing the algorithm, the numerical solution to the models is compared with a few existing methods and established stiff solvers. It yields significant advantages when the diagonally implicit method with a lower triangular matrix and identical diagonal elements is considered. The formula is designed in such a way that it permits a maximum of one LU decomposition for each integration stage.
- Published
- 2020
193. Solving linear systems over idempotent semifields through LU-factorization
- Author
-
Shaban Ghalandarzadeh, Sedighe Jamshidvand, Fateme Olia, and Amirhossein Amiraslani
- Subjects
Algebra ,law ,General Mathematics ,Linear system ,Idempotence ,Extension (predicate logic) ,Algebra over a field ,Square matrix ,Square (algebra) ,LU decomposition ,Mathematics ,law.invention - Abstract
In this paper, we introduce and analyze a new LU-factorization technique for square matrices over idempotent semifields. In particular, more emphasis is put on “max-plus” algebra here but the work is extended to other idempotent semifields as well. We first determine the conditions under which a square matrix has LU factors. Next, using this technique, we propose a method for solving square linear systems of equations whose system matrices are LU-factorizable. We also give conditions for an LU-factorizable system to have solutions. This work is an extension of similar techniques over fields. Maple® procedures for this LU-factorization are also included.
- Published
- 2020
194. Constructing Circuit Transfer Functions directly from the State Equations, Using Convolutions
- Author
-
Reza Hashemian
- Subjects
Discrete mathematics ,Polynomial ,Nodal admittance matrix ,020208 electrical & electronic engineering ,Characteristic equation ,02 engineering and technology ,LU decomposition ,020202 computer hardware & architecture ,Matrix decomposition ,Admittance parameters ,law.invention ,Matrix (mathematics) ,law ,0202 electrical engineering, electronic engineering, information engineering ,Tuple ,Mathematics - Abstract
Given the Nodal Admittance Matrix (NAM) of a linear circuit this article provides means to generate any desired transfer function, or s-expanded polynomial, in a very efficient way. There are three features that are used in combinations to facilitate this task. The first feature is the use of the state space and characteristic equation. The second feature is tuple-based array manipulations. In these manipulations multiplications and divisions are replaced with convolutions and deconvolutions. The last feature to work with is the newly developed UaL decomposition for tuple-based matrices [7]. The UaL decomposition is proven to be computationally more efficient compared to the traditional LU factorization. It is shown that the growth of the entries in both $U$ and $L$ matrices are gradual and it follow a geometrical pattern as the decomposition progresses. Another important feature of the UaL decomposition is that it does not generate nonzero remainders, or simply, it is an “error free” procedure. The method is applied to the Branch Admittance Matrix (BAM) representation of an RC (or RL) circuit. Finally, it is shown that working in the tuple-based format puts the $U$ and $L$ matrix entries into s-expanded form, such as $P(s)=a_{n}s^{n}+a_{n-1}s^{n-I}+\ldots+a_{0}$ , which in turn produces $s$ -expanded transfer functions.
- Published
- 2020
195. A Preconditioned Krylov Subspace Iterative Methods for Inverse Source Problem by Virtue of a Regularizing LM-DRBEM
- Author
-
Abdessamad El Madkouri and Abdellatif Ellabib
- Subjects
Biconjugate gradient method ,Iterative method ,Applied Mathematics ,010102 general mathematics ,Krylov subspace ,Residual ,01 natural sciences ,LU decomposition ,law.invention ,010101 applied mathematics ,Computational Mathematics ,law ,Reciprocity (electromagnetism) ,Conjugate gradient method ,Applied mathematics ,0101 mathematics ,Boundary element method ,Mathematics - Abstract
In this paper, we investigate an effectual methodology to solve the problem of identifying the source term standing on a combined discontinuous dual reciprocity boundary element method with a regularized Levenberg–Marquardt algorithm. The mathematical model is described by optimizing the variation between the measured potential in some points within the problem domain and the estimated one by the discontinuous dual reciprocity boundary element approximation. To provide a qualitative evaluation, the derived dense and non symmetric matrix system of the forward problem is solved by virtue of the LU factorization and using some preconditioned Krylov subspace iterative algorithms, expressly the biconjugate gradient stabilized, the generalized minimum residual and the squared conjugate gradient method. Due to noise interference, a regularized Levenberg–Marquardt algorithm is adopted to retrieve the unknown source term. Extensive experiments evince greater effectiveness and stability criteria of the proposed approach.
- Published
- 2020
196. Node Numbering for Stabilizing Preconditioners Based on Incomplete LU Decomposition
- Author
-
Stephen L. Wood, William K. Anderson, and Kevin Jacobson
- Subjects
Combinatorics ,law ,Computer science ,Node (networking) ,Numbering ,LU decomposition ,law.invention - Published
- 2020
197. Comparison on Vandermond and Cauchy MDS Array Codes in Distributed Storage Systems
- Author
-
Xiangshou Yu, Hanxu Hou, and Guojun Han
- Subjects
Discrete mathematics ,Computational complexity theory ,Computer science ,Mathematics::History and Overview ,MathematicsofComputing_NUMERICALANALYSIS ,Cauchy distribution ,Data_CODINGANDINFORMATIONTHEORY ,Vandermonde matrix ,LU decomposition ,Mathematics::Numerical Analysis ,law.invention ,Matrix (mathematics) ,ComputingMethodologies_PATTERNRECOGNITION ,law ,Encoding (memory) ,ComputingMethodologies_SYMBOLICANDALGEBRAICMANIPULATION ,Computer Science::Multimedia ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Decoding methods ,Cauchy matrix ,Computer Science::Information Theory - Abstract
Binary maximum distance separable (MDS) array codes are an important class of MDS codes with low computational complexity, as only XOR operations are involved in the encoding/decoding procedures. Most existing binary MDS array codes are constructed based on Vandermonde matrix or Cauchy matrix, as we have efficient decoding methods for Vandermonde and Cauchy linear systems. We call the array codes with encoding matrix being Vandermonde matrix and Cauchy matrix as Vandermonde MDS array codes and Cauchy MDS array codes, respectively. In this paper, we implement Vandermonde MDS array codes and Cauchy MDS array codes, and evaluate their encoding and decoding performance. Our implemented results show that Vandermonde MDS array codes have better encoding/decoding performance than that of Cauchy MDS array codes. Specifically, the encoding rate of Vandermonde MDS array codes is about 58% higher than that of Cauchy MDS array codes, and the decoding rate of Vandermonde MDS array codes is about 70% higher than that of Cauchy MDS array codes. In our implementation, the efficient decoding method is based on the LU factorization of Vandermonde matrix and Cauchy matrix. Thus, only some pattern of the decoding procedure of Vandermonde MDS array codes is considered in this paper.
- Published
- 2020
198. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores
- Author
-
Nicholas J. Higham, Srikara Pranesh, Florent Lopez, Pierre Blanchard, Théo Mary, Department of Mathematics [Manchester] (School of Mathematics), University of Manchester [Manchester], Innovative Computing Laboratory [Knoxville] (ICL), The University of Tennessee [Knoxville], Centre National de la Recherche Scientifique (CNRS), Performance et Qualité des Algorithmes Numériques (PEQUAN), LIP6, and Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
fused multiply-add ,Floating point ,floating-point arithmetic ,Carry (arithmetic) ,matrix multiplication ,010103 numerical & computational mathematics ,01 natural sciences ,law.invention ,Computational science ,Matrix (mathematics) ,law ,Tensor (intrinsic definition) ,[INFO]Computer Science [cs] ,0101 mathematics ,[MATH]Mathematics [math] ,NVIDIA GPU ,Mathematics ,Block (data storage) ,Multiply–accumulate operation ,Applied Mathematics ,[INFO.INFO-NA]Computer Science [cs]/Numerical Analysis [cs.NA] ,LU decomposition ,Matrix multiplication ,rounding error analysis ,Computational Mathematics ,tensor cores ,LU factorization ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,[MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA] - Abstract
International audience; Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative refinement with block FMAs, for which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which differ in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU confirm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.
- Published
- 2020
199. Design and Implementation of MIPS Processor for LU Decomposition based on FPGA
- Author
-
Safaa S. Omran and Rusul Saad Khalil
- Subjects
Computer science ,Hardware description language ,Integrated circuit ,Square matrix ,LU decomposition ,Computational science ,law.invention ,Computer Science::Hardware Architecture ,Microprocessor ,Matrix (mathematics) ,law ,VHDL ,Field-programmable gate array ,computer ,computer.programming_language - Abstract
The solution for a set of liner equations require to find the matrix inverse of a square matrix with same number of the linear equations, this operation require many mathematical calculations. To solve this problem, LU decomposition for the matrix is used, which computes two matrices, a lower triangle matrix and an upper triangle matrix. In this, paper a design for 32-bits MIPS (microprocessor without interlocked pipelined stages) processor with the required instructions that used to calculate the LU matrices. The design implemented using VHDL (Very high speed integrated circuit hardware description language) then integrated with FPGA (Field Programmable Gate Arrays) Xilinx Spartan 6. The results for the different parts of the processor are resented in the form of test bench waveform and the architecture of the system is demonstrated and the results was matched with theoretical results.
- Published
- 2020
200. Accelerating Linear Algebra Kernels on a Massively Parallel Reconfigurable Architecture
- Author
-
Ronald G. Dreslinski, David Blaauw, Hun-Seok Kim, Subhankar Pal, Chaitali Chakrabarti, Trevor Mudge, A. Soorishetty, and Jian Zhou
- Subjects
010302 applied physics ,Computer science ,Triangular matrix ,Control reconfiguration ,02 engineering and technology ,Parallel computing ,Solver ,01 natural sciences ,LU decomposition ,020202 computer hardware & architecture ,QR decomposition ,law.invention ,Computer Science::Hardware Architecture ,Matrix (mathematics) ,law ,0103 physical sciences ,Linear algebra ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Massively parallel ,Transformer (machine learning model) - Abstract
Much of the recent work on domain-specific architectures has focused on bridging the gap between performance/efficiency and programmability. We consider one such example architecture, Transformer, consisting of light-weight cores interconnected by caches and crossbars that supports run-time reconfiguration between shared and private cache mode operations. We present customized implementation of a select set of linear algebra kernels, namely, triangular matrix solver, LU decomposition, QR decomposition and matrix in-version, on Transformer. The performance of the kernel algorithms is evaluated with respect to execution time and energy efficiency. Our study shows that each kernel achieves high performance for a certain cache mode and that this cache mode can change when the matrix size changes, making a case for run-time reconfiguration.
- Published
- 2020
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.