Author: "Tze Meng Low" / Search Limiters: Full Text - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Tze Meng Low"' showing total 16 results

Start Over Author "Tze Meng Low" Search Limiters Full Text

16 results on '"Tze Meng Low"'

1. Editorial: Scalable Bioinformatics: Methods, Software Tools, and Hardware Architectures

Author: Nikolaos Alachiotis, Tze Meng Low, and Pavlos Pavlidis
Subjects: autism spectrum disorder, basic local-alignment search tool, BLAST, SMUFIN, copy number variations, SARS–CoV–2, Genetics, QH426-470
Published: 2022
Full Text: View/download PDF

2. quickLD: An efficient software for linkage disequilibrium analyses

Author: Pavlos Pavlidis, Nikolaos Alachiotis, Tze Meng Low, Charalampos Theodoris, Computer Architecture Design and Test for Embedded Systems, and Digital Society Institute
Subjects: 0106 biological sciences, 0301 basic medicine, Genetic Linkage, UT-Hybrid-D, Computer program, Parallel computing, Biology, 010603 evolutionary biology, 01 natural sciences, Linkage Disequilibrium, Computer Programs, 03 medical and health sciences, Software, Genetic drift, High-performance software, computer program, Linkage disequilibrium, Genetics, Resource Article, Graphics, Ecology, Evolution, Behavior and Systematics, Selection (genetic algorithm), Multi-core processor, business.industry, Aggregate (data warehouse), RESOURCE ARTICLES, high‐performance software, 030104 developmental biology, Pairwise comparison, business, Algorithms, Biotechnology
Abstract: Summarization: Software tools for linkage disequilibrium (LD) analyses are designed to calculate LD among all genetic variants in a single region. Since compute and memory requirements grow quadratically with the distance between variants, using these tools for long-range LD calculations leads to long execution times and increased allocation of memory resources. Furthermore, widely used tools do not fully utilize the computational resources of modern processors and/or graphics processing cards, limiting future large-scale analyses on thousands of samples. We present quickLD, a stand-alone and open-source software that computes several LD-related statistics, including the commonly used r2. quickLD calculates pairwise LD between genetic variants in a single region or in arbitrarily distant regions with negligible memory requirements. Moreover, quickLD achieves up to 95% and 97% of the theoretical peak performance of a CPU and a GPU, respectively, enabling 21.5× faster processing than current state-of-the-art software on a multicore processor and 49.5× faster processing when the aggregate processing power of a multicore CPU and a GPU is harnessed. quickLD can also be used in studies of selection, recombination, genetic drift, inbreeding and gene flow. The software is available at https://github.com/pephco/quickLD. Presented on: Molecular Ecology Resources
Published: 2021

3. Editorial: Scalable Bioinformatics

Author: Nikolaos Alachiotis, Tze Meng Low, Pavlos Pavlidis, Computer Architecture Design and Test for Embedded Systems, and Digital Society Institute
Subjects: SARS–CoV–2, Genetics, SMUFIN, Molecular Medicine, Basic local-alignment search tool, Copy number variations, QH426-470, Autism spectrum disorder, BLAST, Genetics (clinical)
Published: 2022

4. SPIRAL: Extreme Performance Portability

Author: Doru Thom Popovici, Daniele G. Spampinato, Tze Meng Low, James C. Hoe, Markus Püschel, Jose M. F. Moura, Jeremy Johnson, Richard Veras, and Franz Franchetti
Subjects: Multi-core processor, Correctness, Computer science, business.industry, Image processing, 010103 numerical & computational mathematics, 02 engineering and technology, Supercomputer, 01 natural sciences, Software portability, Software, Computer engineering, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0101 mathematics, Electrical and Electronic Engineering, Equivalence (formal languages), business, Mobile device
Abstract: In this paper, we address the question of how to automatically map computational kernels to highly efficient code for a wide range of computing platforms and establish the correctness of the synthesized code. More specifically, we focus on two fundamental problems that software developers are faced with: performance portability across the ever-changing landscape of parallel platforms and correctness guarantees for sophisticated floating-point code. The problem is approached as follows: We develop a formal framework to capture computational algorithms, computing platforms, and program transformations of interest, using a unifying mathematical formalism we call operator language (OL). Then we cast the problem of synthesizing highly optimized computational kernels for a given machine as a strongly constrained optimization problem that is solved by search and a multistage rewriting system. Since all rewrite steps are semantics preserving, our approach establishes equivalence between the kernel specification and the synthesized program. This approach is implemented in the SPIRAL system, and we demonstrate it with a selection of computational kernels from the signal and image processing domain, software-defined radio, and robotic vehicle control. Our target platforms range from mobile devices, desktops, and server multicore processors to large-scale high-performance and supercomputing systems, and we demonstrate performance comparable to expertly hand-tuned code across kernels and platforms.
Published: 2018

5. Delayed Asynchronous Iterative Graph Algorithms

Author: Mark P. Blanco, Scott McMillan, and Tze Meng Low
Subjects: Performance (cs.PF), FOS: Computer and information sciences, Computer Science - Performance, Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC)
Abstract: Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated output values are used as inputs for computations in current or subsequent iterations; hence the number of iterations required for values to converge can potentially reduce if the newest values are asynchronously made available to other updates computed in the same iteration. In a multi-threaded shared memory system, the immediate propagation of updated values can cause memory contention that may offset the benefit of propagating updates sooner. In some cases, the benefit of a smaller number of iterations may be diminished by each iteration taking longer. Our key idea is to combine the low memory contention that synchronous approaches have with the faster information sharing of asynchronous approaches. Our hybrid approach buffers updates from threads locally before committing them to the global store to control how often threads may cause conflicts for others while still sharing data within one iteration and hence speeding convergence. On a 112-thread CPU system, our hybrid approach attains up to 4.5% - 19.4% speedup over an asynchronous approach for Pagerank and up to 1.9% - 17% speedup over asynchronous Bellman Ford SSSP. Further, our hybrid approach attains 2.56x better performance than the synchronous approach. Finally, we provide insights as to why delaying updates is not helpful on certain graphs where connectivity is clustered on the main diagonal of the adjacency matrix., Comment: 6 pages, 6 figures, 2 tables, IEEE High Performance Extreme Computing (HPEC) Conference 2021
Published: 2021
Full Text: View/download PDF

6. qLD: High-performance Computation of Linkage Disequilibrium on CPU and GPU

Author: Tze Meng Low, Pavlos Pavlidis, Nikolaos Alachiotis, Charalampos Theodoris, Computer Architecture Design and Test for Embedded Systems, and Digital Society Institute
Subjects: 0106 biological sciences, 0303 health sciences, Linkage disequilibrium, Exploit, Computer science, business.industry, Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), Computation, High performance computation, 22/2 OA procedure, GPU, Parallel computing, 010603 evolutionary biology, 01 natural sciences, 03 medical and health sciences, Kernel (linear algebra), Software, Linear algebra, business, 030304 developmental biology
Abstract: Summarization: Linkage disequilibrium (LD) is the non-random association between alleles at different loci. Assessing LD in thousands of genomes and/or millions of single-nucleotide poly-morphisms (SNPs) exhibits excessive time and memory requirements that can potentially hinder future large-scale genomic analyses. To this end, we introduce qLD (quickLD) (https//lgithub.com/StrayLamb2lqLD), a highly optimized open-source software that assesses LD based on Pearson's correlation coefficient. qLD exploits the fact that the computational kernel for calculating LD can be cast in terms of dense linear algebra operations. In addition, the software employs memory-aware techniques to lower memory requirements, and parallel GPU architectures to further shorten analysis times. qLD delivers up to 5x faster processing than the current state-of-the-art software implementation when run on the same CPU, and up to 29x when computation is offloaded to a GPU. Furthermore, the software is designed to quantity allele associations between arbitrarily distant loci in a time-and memory-efficient way, thereby facilitating the evaluation of long-range LD and the detection of co-evolved genes. We showcase qLD on the analysis of 22,554 complete SARS-CoV-2 genomes. Παρουσιάστηκε στο: 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering
Published: 2020

7. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication

Author: Viveck R. Cadambe, Vipul Gupta, Tze Meng Low, Yaoqing Yang, Pulkit Grover, Kannan Ramchandran, Christian Engelmann, and Haewon Jeong
Subjects: Computer science, Computation, 0202 electrical engineering, electronic engineering, information engineering, Redundancy (engineering), 020206 networking & telecommunications, 010103 numerical & computational mathematics, 02 engineering and technology, Parallel computing, 0101 mathematics, Error detection and correction, 01 natural sciences, Matrix multiplication
Abstract: In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires $\sim $50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
Published: 2020

8. Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU

Author: Kyungjoo Kim, Mark Blanco, and Tze Meng Low
Subjects: FOS: Computer and information sciences, Computer Science - Performance, Computer science, Computation, Truss, 020207 software engineering, 02 engineering and technology, Parallel computing, Load balancing (computing), Performance (cs.PF), Computer Science - Distributed, Parallel, and Cluster Computing, 020204 information systems, Linear algebra, 0202 electrical engineering, electronic engineering, information engineering, Central processing unit, Graph algorithms, Distributed, Parallel, and Cluster Computing (cs.DC), Implementation
Abstract: In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.261. 48x improvement on the CPU and a 9.97-16.92x improvement on the GPU due to our fine-grained parallel formulation., Comment: 2019 IEEE High Performance Extreme Computing Conference (HPEC)
Published: 2020
Full Text: View/download PDF

9. Analytical cache modeling and tilesize optimization for tensor contractions

Author: Fabrice Rastello, Rui Li, Tze Meng Low, Aravind Sukumaran-Rajam, P. Sadayappan, Richard Veras, Atanas Rountev, Ohio State University [Columbus] (OSU), Louisiana State University (LSU), Carnegie Mellon University [Pittsburgh] (CMU), Compiler Optimization and Run-time Systems (CORSE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Grenoble (LIG ), and Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])
Subjects: Memory hierarchy, Computer science, Locality, tensor contraction, performance modeling, [INFO.INFO-SE]Computer Science [cs]/Software Engineering [cs.SE], 010103 numerical & computational mathematics, 02 engineering and technology, model-driven design-space exploration, computer.software_genre, 01 natural sciences, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Bottleneck, Permutation, Tensor (intrinsic definition), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Compiler, Cache, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], 0101 mathematics, domain-specific compiler optimization, computer, Algorithm
Abstract: International audience; Data movement between processor and memory hierarchyis a fundamental bottleneck that limits the performance ofmany applications on modern computer architectures. Tilingand loop permutation are key techniques for improving datalocality. However, selecting effective tile-sizes and loop permutations is particularly challenging for tensor contractions dueto the large number of loops. Even state-of-the-art compilersusually produce sub-optimal tile-sizes and loop permutations,as they rely on naive cost models. In this paper we provide an analytical model based approach to multi-level tilesize optimization and permutation selection for tensor contractions. Our experimental results show that this approachachieves comparable or better performance than state-of-theart frameworks and libraries for tensor contractions.
Published: 2019

10. Linear algebraic depth-first search

Author: Daniele G. Spampinato, Tze Meng Low, and Upasana Sridhar
Subjects: Computer Science::Performance, Discrete mathematics, Tree traversal, Computer science, Breadth-first search, Linear algebra, Topological sorting, Graph (abstract data type), Adjacency matrix, Algebraic number, Computer Science::Data Structures and Algorithms, Distributed File System, MathematicsofComputing_DISCRETEMATHEMATICS
Abstract: There is a recent push by a segment of the graph community to implement graph algorithms in the language of linear algebra. However, graph algorithms that depend on depth-first search (DFS) techniques are often highlighted as limitations of the linear algebraic approach as linear algebraic formulation of DFS algorithms are few, if any. This paper provides a linear algebraic approach for developing DFS graph algorithms and demonstrates its use for defining three classical DFS-based computations: Binary tree traversal, topological sort, and biconnected components.
Published: 2019

11. Delta-stepping SSSP: from Vertices and Edges to GraphBLAS Implementations

Author: Upasana Sridhar, Rahul Mayuranath, Tze Meng Low, Daniele G. Spampinato, Scott McMillan, and Mark Blanco
Subjects: FOS: Computer and information sciences, 020203 distributed computing, Theoretical computer science, Computer science, Duality (optimization), 020207 software engineering, 02 engineering and technology, Vertex (geometry), Linear algebra, Computer Science - Data Structures and Algorithms, 0202 electrical engineering, electronic engineering, information engineering, Data Structures and Algorithms (cs.DS), Algebraic number, Implementation, Dijkstra's algorithm, Sparse matrix, MathematicsofComputing_DISCRETEMATHEMATICS
Abstract: GraphBLAS is an interface for implementing graph algorithms. Algorithms implemented using the GraphBLAS interface are cast in terms of linear algebra-like operations. However, many graph algorithms are canonically described in terms of operations on vertices and/or edges. Despite the known duality between these two representations, the differences in the way algorithms are described using the two approaches can pose considerable difficulties in the adoption of the GraphBLAS as standard interface for development. This paper investigates a systematic approach for translating a graph algorithm described in the canonical vertex and edge representation into an implementation that leverages the GraphBLAS interface. We present a two-step approach to this problem. First, we express common vertex- and edge-centric design patterns using a linear algebraic language. Second, we map this intermediate representation to the GraphBLAS interface. We illustrate our approach by translating the delta-stepping single source shortest path algorithm from its canonical description to a GraphBLAS implementation, and highlight lessons learned when implementing using GraphBLAS., Comment: 10 pages, 4 figures, IPDPSW GRAPL 2019 Workshop
Published: 2019
Full Text: View/download PDF

12. The BLIS Framework

Author: Michael Kistler, Mikhail Smelyanskiy, Tze Meng Low, Vernon Austel, Robert A. van de Geijn, Lee Killough, Tyler M. Smith, Field G. Van Zee, John A. Gunnels, Francisco D. Igual, Bryan Marker, and Xianyi Zhang
Subjects: 020203 distributed computing, Vendor, Computer science, Programming language, Applied Mathematics, 010103 numerical & computational mathematics, 02 engineering and technology, computer.software_genre, 01 natural sciences, Software framework, Software portability, Computer architecture, Linear algebra, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 0101 mathematics, IBM, Implementation, computer, Software
Abstract: BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD’s ACML, IBM’s ESSL, and Intel’s MKL libraries. Although most of this article focuses on single-core implementation, we also provide compelling results that suggest the framework’s leverage extends to the multithreaded domain.
Published: 2016

13. A Unified Coded Deep Neural Network Training Strategy based on Generalized PolyDot codes

Author: Tze Meng Low, Ziqian Bai, Pulkit Grover, Sanghamitra Dutra, and Haewon Jeong
Subjects: Nonlinear system, Artificial neural network, Computer science, Computation, 0202 electrical engineering, electronic engineering, information engineering, 020206 networking & telecommunications, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Algorithm, Decoding methods, Matrix multiplication, 0105 earth and related environmental sciences
Abstract: This paper has two main contributions. First, we propose a novel coding technique - Generalized PolyDot - for matrix-vector products that advances on existing techniques for coded matrix operations under storage and communication constraints. Next, we use Generalized PolyDot for the problem of training large Deep Neural Networks (DNNs) using unreliable nodes that are prone to soft-errors, e.g., bit flips during computation that produce erroneous outputs. An additional difficulty imposed by the problem of DNN training is that the parameter values (weight matrices) are updated at every iteration, and thus require a prohibitively large encoding cost at every iteration if we naively extend existing coded computing techniques. Thus, we propose a “unified” coded DNN training strategy where we weave coding into the operations of DNN training itself, so that the weight matrices, once initially encoded, remain encoded during updates with negligible encoding/decoding overhead per iteration. Moreover, our strategy can also allow for errors even in the nonlinear step of training. Finally, our coded DNN training strategy is completely decentralized: no assumptions on the presence of a master node are made, which avoids any single point of failure under soft-errors. Our strategy can provide unboundedly better error tolerance than the competing replication strategy and an MDS-code-based strategy [1].
Published: 2018

14. Analytical Modeling is Enough for High Performance BLIS

Author: Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ortí
Subjects: Matrix multiplication, Theoretical computer science, Computer science, Libraries, 010103 numerical & computational mathematics, 02 engineering and technology, computer.software_genre, 01 natural sciences, Mathematics of computing, Mathematical software performance, Analytical modeling, Software, Development (topology), High performance, 0202 electrical engineering, electronic engineering, information engineering, Linear algebra, 0101 mathematics, 020203 distributed computing, business.industry, Programming language, Applied Mathematics, Multiplication, business, computer
Abstract: We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight). This research was sponsored in part by NSF grants ACI-1148125/1340293 and CCF-0917167. Enrique S. Quintana-Ortí was supported by project TIN2011-23283 of the Ministerio de Ciencia e Innovacióon and FEDER. Francisco D. Igual was supported by project TIN2012-32180 of the Ministerio de Ciencia e Innovación.
Published: 2016

15. Scalable parallelization of FLAME code via the workqueuing model

Author: Robert A. van de Geijn, Tze Meng Low, Paolo Bientinesi, and Field G. Van Zee
Subjects: Automatic parallelization, Shared memory, Computer science, Applied Mathematics, Linear algebra, Memory architecture, Scalability, Parallel algorithm, Multiprocessing, Distributed memory, Parallel computing, Software
Abstract: We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.
Published: 2008

16. Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors

Author: Martin D. Schatz, Tze Meng Low, Tamara G. Kolda, and Robert A. van de Geijn
Subjects: FOS: Computer and information sciences, Multilinear algebra, Applied Mathematics, Computation, Numerical Analysis (math.NA), 010103 numerical & computational mathematics, 15-02 (Primary), Data structure, Topology, 01 natural sciences, 010305 fluids & plasmas, Computational Mathematics, Matrix (mathematics), 0103 physical sciences, FOS: Mathematics, Symmetric tensor, Multiplication, Computer Science - Mathematical Software, Tensor, Mathematics - Numerical Analysis, 0101 mathematics, Symmetry (geometry), Mathematical Software (cs.MS), Mathematics
Abstract: Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (Blocked Compact Symmetric Storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric tensor. We propose an algorithm-by-blocks, already shown of benefit for matrix computations, that exploits this storage format by utilizing a series of temporary tensors to avoid redundant computation. Further, partial symmetry within temporaries is exploited to further avoid redundant storage and redundant computation. A detailed analysis shows that, relative to storing and computing with tensors without taking advantage of symmetry and partial symmetry, storage requirements are reduced by a factor of $ O\left( m! \right)$ and computational requirements by a factor of $O\left( (m+1)!/2^m \right)$, where $ m $ is the order of the tensor. However, as the analysis shows, care must be taken in choosing the correct block size to ensure these storage and computational benefits are achieved (particularly for low-order tensors). An implementation demonstrates that storage is greatly reduced and the complexity introduced by storing and computing with tensors by blocks is manageable. Preliminary results demonstrate that computational time is also reduced. The paper concludes with a discussion of how insights in this paper point to opportunities for generalizing recent advances in the domain of linear algebra libraries to the field of multi-linear computation.
Published: 2013

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

16 results on '"Tze Meng Low"'

1. Editorial: Scalable Bioinformatics: Methods, Software Tools, and Hardware Architectures

2. quickLD: An efficient software for linkage disequilibrium analyses

3. Editorial: Scalable Bioinformatics

4. SPIRAL: Extreme Performance Portability

5. Delayed Asynchronous Iterative Graph Algorithms

6. qLD: High-performance Computation of Linkage Disequilibrium on CPU and GPU

7. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication

8. Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU

9. Analytical cache modeling and tilesize optimization for tensor contractions

10. Linear algebraic depth-first search

11. Delta-stepping SSSP: from Vertices and Edges to GraphBLAS Implementations

12. The BLIS Framework

13. A Unified Coded Deep Neural Network Training Strategy based on Generalized PolyDot codes

14. Analytical Modeling is Enough for High Performance BLIS

15. Scalable parallelization of FLAME code via the workqueuing model

16. Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

16 results on '"Tze Meng Low"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources