770 results on '"Enrique S. Quintana-Ortí"'
Search Results
52. A New Generation of Task-Parallel Algorithms for Matrix Inversion in Many-Threaded CPUs.
53. High Performance and Energy Efficient Integer Matrix Multiplication for Deep Learning.
54. Evaluation of MPI Allreduce for Distributed Training of Convolutional Neural Networks.
55. Performance Modeling for Distributed Training of Convolutional Neural Networks.
56. Tiled Algorithms for Efficient Task-Parallel ℌ-Matrix Solvers.
57. Multiprecision Block-Jacobi for Iterative Triangular Solves.
58. Balanced and Compressed Coordinate Layout for the Sparse Matrix-Vector Product on GPUs.
59. High Performance and Portable Convolution Operators for Multicore Processors.
60. Accelerating distributed deep neural network training with pipelined MPI allreduce.
61. Machine learning for optimal selection of sparse triangular system solvers on GPUs.
62. Factorized solution of generalized stable Sylvester equations using many-core GPU accelerators.
63. On the performance of a GPU-based SoC in a distributed spatial audio system.
64. Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors.
65. DMRlib: Easy-Coding and Efficient Resource Management for Job Malleability.
66. Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software.
67. Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units.
68. Reformulating the direct convolution for high-performance deep learning inference on ARM processors.
69. Towards Continuous Benchmarking: An Automated Performance Evaluation Framework for High Performance Software.
70. Cholesky and Gram-Schmidt Orthogonalization for Tall-and-Skinny QR Factorizations on Graphics Processors.
71. Theoretical Scalability Analysis of Distributed Deep Convolutional Neural Networks.
72. Automatic Selection of Sparse Triangular Linear System Solvers on GPUs through Machine Learning Techniques.
73. Analysis of model parallelism for distributed neural networks.
74. Structure-Aware Calculation of Many-Electron Wave Function Overlaps on Multicore Processors.
75. Programming parallel dense matrix factorizations with look-ahead and OpenMP.
76. Integration and exploitation of intra-routine malleability in BLIS.
77. Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors.
78. Performance modeling of the sparse matrix-vector product via convolutional neural networks.
79. Acceleration of PageRank with Customized Precision Based on Mantissa Segmentation.
80. Analysis of Threading Libraries for High Performance Computing.
81. Reduction to Band Form for the Singular Value Decomposition on Graphics Accelerators.
82. Extending ILUPACK with a Task-Parallel Version of BiCG for Dual-GPU Servers.
83. Fast Blocking of Householder Reflectors on Graphics Processors.
84. High-Performance GPU Implementation of PageRank with Reduced Precision Based on Mantissa Segmentation.
85. Extending ILUPACK with a GPU Version of the BiCGStab Method.
86. Residual Replacement in Mixed-Precision Iterative Refinement for Sparse Linear Systems.
87. Selecting optimal SpMV realizations for GPUs via machine learning.
88. Introduction to the Special Issue related to the Power-Aware Computing Workshop 2019 - PACO 2019.
89. High performance and energy efficient inference for deep learning on ARM processors.
90. Exploiting nested task-parallelism in the H-LU factorization.
91. Look-ahead in the two-sided reduction to compact band forms for symmetric eigenvalue problems and the SVD.
92. Fast block QR update in digital signal processing.
93. Noise estimation for hyperspectral subspace identification on FPGAs.
94. An efficient GPU version of the preconditioned GMRES method.
95. Accelerating the SRP-PHAT algorithm on multi- and many-core platforms using OpenCL.
96. A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting.
97. FloatX: A C++ Library for Customized Floating-Point Arithmetic.
98. Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors.
99. Dynamic look-ahead in the reduction to band form for the singular value decomposition.
100. Accelerating the task/data-parallel version of ILUPACK's BiCG in multi-CPU/GPU configurations.
Catalog
Books, media, physical & digital resources
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.