16 results on '"Tze Meng Low"'
Search Results
2. quickLD: An efficient software for linkage disequilibrium analyses
- Author
-
Pavlos Pavlidis, Nikolaos Alachiotis, Tze Meng Low, Charalampos Theodoris, Computer Architecture Design and Test for Embedded Systems, and Digital Society Institute
- Subjects
0106 biological sciences ,0301 basic medicine ,Genetic Linkage ,UT-Hybrid-D ,Computer program ,Parallel computing ,Biology ,010603 evolutionary biology ,01 natural sciences ,Linkage Disequilibrium ,Computer Programs ,03 medical and health sciences ,Software ,Genetic drift ,High-performance software ,computer program ,Linkage disequilibrium ,Genetics ,Resource Article ,Graphics ,Ecology, Evolution, Behavior and Systematics ,Selection (genetic algorithm) ,Multi-core processor ,business.industry ,Aggregate (data warehouse) ,RESOURCE ARTICLES ,high‐performance software ,030104 developmental biology ,Pairwise comparison ,business ,Algorithms ,Biotechnology - Abstract
Summarization: Software tools for linkage disequilibrium (LD) analyses are designed to calculate LD among all genetic variants in a single region. Since compute and memory requirements grow quadratically with the distance between variants, using these tools for long-range LD calculations leads to long execution times and increased allocation of memory resources. Furthermore, widely used tools do not fully utilize the computational resources of modern processors and/or graphics processing cards, limiting future large-scale analyses on thousands of samples. We present quickLD, a stand-alone and open-source software that computes several LD-related statistics, including the commonly used r2. quickLD calculates pairwise LD between genetic variants in a single region or in arbitrarily distant regions with negligible memory requirements. Moreover, quickLD achieves up to 95% and 97% of the theoretical peak performance of a CPU and a GPU, respectively, enabling 21.5× faster processing than current state-of-the-art software on a multicore processor and 49.5× faster processing when the aggregate processing power of a multicore CPU and a GPU is harnessed. quickLD can also be used in studies of selection, recombination, genetic drift, inbreeding and gene flow. The software is available at https://github.com/pephco/quickLD. Presented on: Molecular Ecology Resources
- Published
- 2021
3. Editorial: Scalable Bioinformatics
- Author
-
Nikolaos Alachiotis, Tze Meng Low, Pavlos Pavlidis, Computer Architecture Design and Test for Embedded Systems, and Digital Society Institute
- Subjects
SARS–CoV–2 ,Genetics ,SMUFIN ,Molecular Medicine ,Basic local-alignment search tool ,Copy number variations ,QH426-470 ,Autism spectrum disorder ,BLAST ,Genetics (clinical) - Published
- 2022
4. SPIRAL: Extreme Performance Portability
- Author
-
Doru Thom Popovici, Daniele G. Spampinato, Tze Meng Low, James C. Hoe, Markus Püschel, Jose M. F. Moura, Jeremy Johnson, Richard Veras, and Franz Franchetti
- Subjects
Multi-core processor ,Correctness ,Computer science ,business.industry ,Image processing ,010103 numerical & computational mathematics ,02 engineering and technology ,Supercomputer ,01 natural sciences ,Software portability ,Software ,Computer engineering ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,0101 mathematics ,Electrical and Electronic Engineering ,Equivalence (formal languages) ,business ,Mobile device - Abstract
In this paper, we address the question of how to automatically map computational kernels to highly efficient code for a wide range of computing platforms and establish the correctness of the synthesized code. More specifically, we focus on two fundamental problems that software developers are faced with: performance portability across the ever-changing landscape of parallel platforms and correctness guarantees for sophisticated floating-point code. The problem is approached as follows: We develop a formal framework to capture computational algorithms, computing platforms, and program transformations of interest, using a unifying mathematical formalism we call operator language (OL). Then we cast the problem of synthesizing highly optimized computational kernels for a given machine as a strongly constrained optimization problem that is solved by search and a multistage rewriting system. Since all rewrite steps are semantics preserving, our approach establishes equivalence between the kernel specification and the synthesized program. This approach is implemented in the SPIRAL system, and we demonstrate it with a selection of computational kernels from the signal and image processing domain, software-defined radio, and robotic vehicle control. Our target platforms range from mobile devices, desktops, and server multicore processors to large-scale high-performance and supercomputing systems, and we demonstrate performance comparable to expertly hand-tuned code across kernels and platforms.
- Published
- 2018
5. Delayed Asynchronous Iterative Graph Algorithms
- Author
-
Mark P. Blanco, Scott McMillan, and Tze Meng Low
- Subjects
Performance (cs.PF) ,FOS: Computer and information sciences ,Computer Science - Performance ,Computer Science - Distributed, Parallel, and Cluster Computing ,Distributed, Parallel, and Cluster Computing (cs.DC) - Abstract
Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated output values are used as inputs for computations in current or subsequent iterations; hence the number of iterations required for values to converge can potentially reduce if the newest values are asynchronously made available to other updates computed in the same iteration. In a multi-threaded shared memory system, the immediate propagation of updated values can cause memory contention that may offset the benefit of propagating updates sooner. In some cases, the benefit of a smaller number of iterations may be diminished by each iteration taking longer. Our key idea is to combine the low memory contention that synchronous approaches have with the faster information sharing of asynchronous approaches. Our hybrid approach buffers updates from threads locally before committing them to the global store to control how often threads may cause conflicts for others while still sharing data within one iteration and hence speeding convergence. On a 112-thread CPU system, our hybrid approach attains up to 4.5% - 19.4% speedup over an asynchronous approach for Pagerank and up to 1.9% - 17% speedup over asynchronous Bellman Ford SSSP. Further, our hybrid approach attains 2.56x better performance than the synchronous approach. Finally, we provide insights as to why delaying updates is not helpful on certain graphs where connectivity is clustered on the main diagonal of the adjacency matrix., Comment: 6 pages, 6 figures, 2 tables, IEEE High Performance Extreme Computing (HPEC) Conference 2021
- Published
- 2021
- Full Text
- View/download PDF
6. qLD: High-performance Computation of Linkage Disequilibrium on CPU and GPU
- Author
-
Tze Meng Low, Pavlos Pavlidis, Nikolaos Alachiotis, Charalampos Theodoris, Computer Architecture Design and Test for Embedded Systems, and Digital Society Institute
- Subjects
0106 biological sciences ,0303 health sciences ,Linkage disequilibrium ,Exploit ,Computer science ,business.industry ,Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) ,Computation ,High performance computation ,22/2 OA procedure ,GPU ,Parallel computing ,010603 evolutionary biology ,01 natural sciences ,03 medical and health sciences ,Kernel (linear algebra) ,Software ,Linear algebra ,business ,030304 developmental biology - Abstract
Summarization: Linkage disequilibrium (LD) is the non-random association between alleles at different loci. Assessing LD in thousands of genomes and/or millions of single-nucleotide poly-morphisms (SNPs) exhibits excessive time and memory requirements that can potentially hinder future large-scale genomic analyses. To this end, we introduce qLD (quickLD) (https//lgithub.com/StrayLamb2lqLD), a highly optimized open-source software that assesses LD based on Pearson's correlation coefficient. qLD exploits the fact that the computational kernel for calculating LD can be cast in terms of dense linear algebra operations. In addition, the software employs memory-aware techniques to lower memory requirements, and parallel GPU architectures to further shorten analysis times. qLD delivers up to 5x faster processing than the current state-of-the-art software implementation when run on the same CPU, and up to 29x when computation is offloaded to a GPU. Furthermore, the software is designed to quantity allele associations between arbitrarily distant loci in a time-and memory-efficient way, thereby facilitating the evaluation of long-range LD and the detection of co-evolved genes. We showcase qLD on the analysis of 22,554 complete SARS-CoV-2 genomes. Παρουσιάστηκε στο: 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering
- Published
- 2020
7. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication
- Author
-
Viveck R. Cadambe, Vipul Gupta, Tze Meng Low, Yaoqing Yang, Pulkit Grover, Kannan Ramchandran, Christian Engelmann, and Haewon Jeong
- Subjects
Computer science ,Computation ,0202 electrical engineering, electronic engineering, information engineering ,Redundancy (engineering) ,020206 networking & telecommunications ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,0101 mathematics ,Error detection and correction ,01 natural sciences ,Matrix multiplication - Abstract
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires \(\sim \)50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
- Published
- 2020
8. Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU
- Author
-
Kyungjoo Kim, Mark Blanco, and Tze Meng Low
- Subjects
FOS: Computer and information sciences ,Computer Science - Performance ,Computer science ,Computation ,Truss ,020207 software engineering ,02 engineering and technology ,Parallel computing ,Load balancing (computing) ,Performance (cs.PF) ,Computer Science - Distributed, Parallel, and Cluster Computing ,020204 information systems ,Linear algebra ,0202 electrical engineering, electronic engineering, information engineering ,Central processing unit ,Graph algorithms ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Implementation - Abstract
In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.261. 48x improvement on the CPU and a 9.97-16.92x improvement on the GPU due to our fine-grained parallel formulation., Comment: 2019 IEEE High Performance Extreme Computing Conference (HPEC)
- Published
- 2020
- Full Text
- View/download PDF
9. Analytical cache modeling and tilesize optimization for tensor contractions
- Author
-
Fabrice Rastello, Rui Li, Tze Meng Low, Aravind Sukumaran-Rajam, P. Sadayappan, Richard Veras, Atanas Rountev, Ohio State University [Columbus] (OSU), Louisiana State University (LSU), Carnegie Mellon University [Pittsburgh] (CMU), Compiler Optimization and Run-time Systems (CORSE), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Informatique de Grenoble (LIG ), and Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes [2016-2019] (UGA [2016-2019])
- Subjects
Memory hierarchy ,Computer science ,Locality ,tensor contraction ,performance modeling ,[INFO.INFO-SE]Computer Science [cs]/Software Engineering [cs.SE] ,010103 numerical & computational mathematics ,02 engineering and technology ,model-driven design-space exploration ,computer.software_genre ,01 natural sciences ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,Bottleneck ,Permutation ,Tensor (intrinsic definition) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Compiler ,Cache ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,0101 mathematics ,domain-specific compiler optimization ,computer ,Algorithm - Abstract
International audience; Data movement between processor and memory hierarchyis a fundamental bottleneck that limits the performance ofmany applications on modern computer architectures. Tilingand loop permutation are key techniques for improving datalocality. However, selecting effective tile-sizes and loop permutations is particularly challenging for tensor contractions dueto the large number of loops. Even state-of-the-art compilersusually produce sub-optimal tile-sizes and loop permutations,as they rely on naive cost models. In this paper we provide an analytical model based approach to multi-level tilesize optimization and permutation selection for tensor contractions. Our experimental results show that this approachachieves comparable or better performance than state-of-theart frameworks and libraries for tensor contractions.
- Published
- 2019
10. Linear algebraic depth-first search
- Author
-
Daniele G. Spampinato, Tze Meng Low, and Upasana Sridhar
- Subjects
Computer Science::Performance ,Discrete mathematics ,Tree traversal ,Computer science ,Breadth-first search ,Linear algebra ,Topological sorting ,Graph (abstract data type) ,Adjacency matrix ,Algebraic number ,Computer Science::Data Structures and Algorithms ,Distributed File System ,MathematicsofComputing_DISCRETEMATHEMATICS - Abstract
There is a recent push by a segment of the graph community to implement graph algorithms in the language of linear algebra. However, graph algorithms that depend on depth-first search (DFS) techniques are often highlighted as limitations of the linear algebraic approach as linear algebraic formulation of DFS algorithms are few, if any. This paper provides a linear algebraic approach for developing DFS graph algorithms and demonstrates its use for defining three classical DFS-based computations: Binary tree traversal, topological sort, and biconnected components.
- Published
- 2019
11. Delta-stepping SSSP: from Vertices and Edges to GraphBLAS Implementations
- Author
-
Upasana Sridhar, Rahul Mayuranath, Tze Meng Low, Daniele G. Spampinato, Scott McMillan, and Mark Blanco
- Subjects
FOS: Computer and information sciences ,020203 distributed computing ,Theoretical computer science ,Computer science ,Duality (optimization) ,020207 software engineering ,02 engineering and technology ,Vertex (geometry) ,Linear algebra ,Computer Science - Data Structures and Algorithms ,0202 electrical engineering, electronic engineering, information engineering ,Data Structures and Algorithms (cs.DS) ,Algebraic number ,Implementation ,Dijkstra's algorithm ,Sparse matrix ,MathematicsofComputing_DISCRETEMATHEMATICS - Abstract
GraphBLAS is an interface for implementing graph algorithms. Algorithms implemented using the GraphBLAS interface are cast in terms of linear algebra-like operations. However, many graph algorithms are canonically described in terms of operations on vertices and/or edges. Despite the known duality between these two representations, the differences in the way algorithms are described using the two approaches can pose considerable difficulties in the adoption of the GraphBLAS as standard interface for development. This paper investigates a systematic approach for translating a graph algorithm described in the canonical vertex and edge representation into an implementation that leverages the GraphBLAS interface. We present a two-step approach to this problem. First, we express common vertex- and edge-centric design patterns using a linear algebraic language. Second, we map this intermediate representation to the GraphBLAS interface. We illustrate our approach by translating the delta-stepping single source shortest path algorithm from its canonical description to a GraphBLAS implementation, and highlight lessons learned when implementing using GraphBLAS., Comment: 10 pages, 4 figures, IPDPSW GRAPL 2019 Workshop
- Published
- 2019
- Full Text
- View/download PDF
12. The BLIS Framework
- Author
-
Michael Kistler, Mikhail Smelyanskiy, Tze Meng Low, Vernon Austel, Robert A. van de Geijn, Lee Killough, Tyler M. Smith, Field G. Van Zee, John A. Gunnels, Francisco D. Igual, Bryan Marker, and Xianyi Zhang
- Subjects
020203 distributed computing ,Vendor ,Computer science ,Programming language ,Applied Mathematics ,010103 numerical & computational mathematics ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Software framework ,Software portability ,Computer architecture ,Linear algebra ,0202 electrical engineering, electronic engineering, information engineering ,Leverage (statistics) ,0101 mathematics ,IBM ,Implementation ,computer ,Software - Abstract
BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD’s ACML, IBM’s ESSL, and Intel’s MKL libraries. Although most of this article focuses on single-core implementation, we also provide compelling results that suggest the framework’s leverage extends to the multithreaded domain.
- Published
- 2016
13. A Unified Coded Deep Neural Network Training Strategy based on Generalized PolyDot codes
- Author
-
Tze Meng Low, Ziqian Bai, Pulkit Grover, Sanghamitra Dutra, and Haewon Jeong
- Subjects
Nonlinear system ,Artificial neural network ,Computer science ,Computation ,0202 electrical engineering, electronic engineering, information engineering ,020206 networking & telecommunications ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Algorithm ,Decoding methods ,Matrix multiplication ,0105 earth and related environmental sciences - Abstract
This paper has two main contributions. First, we propose a novel coding technique - Generalized PolyDot - for matrix-vector products that advances on existing techniques for coded matrix operations under storage and communication constraints. Next, we use Generalized PolyDot for the problem of training large Deep Neural Networks (DNNs) using unreliable nodes that are prone to soft-errors, e.g., bit flips during computation that produce erroneous outputs. An additional difficulty imposed by the problem of DNN training is that the parameter values (weight matrices) are updated at every iteration, and thus require a prohibitively large encoding cost at every iteration if we naively extend existing coded computing techniques. Thus, we propose a “unified” coded DNN training strategy where we weave coding into the operations of DNN training itself, so that the weight matrices, once initially encoded, remain encoded during updates with negligible encoding/decoding overhead per iteration. Moreover, our strategy can also allow for errors even in the nonlinear step of training. Finally, our coded DNN training strategy is completely decentralized: no assumptions on the presence of a master node are made, which avoids any single point of failure under soft-errors. Our strategy can provide unboundedly better error tolerance than the competing replication strategy and an MDS-code-based strategy [1].
- Published
- 2018
14. Analytical Modeling is Enough for High Performance BLIS
- Author
-
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ortí
- Subjects
Matrix multiplication ,Theoretical computer science ,Computer science ,Libraries ,010103 numerical & computational mathematics ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Mathematics of computing ,Mathematical software performance ,Analytical modeling ,Software ,Development (topology) ,High performance ,0202 electrical engineering, electronic engineering, information engineering ,Linear algebra ,0101 mathematics ,020203 distributed computing ,business.industry ,Programming language ,Applied Mathematics ,Multiplication ,business ,computer - Abstract
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight). This research was sponsored in part by NSF grants ACI-1148125/1340293 and CCF-0917167. Enrique S. Quintana-Ortí was supported by project TIN2011-23283 of the Ministerio de Ciencia e Innovacióon and FEDER. Francisco D. Igual was supported by project TIN2012-32180 of the Ministerio de Ciencia e Innovación.
- Published
- 2016
15. Scalable parallelization of FLAME code via the workqueuing model
- Author
-
Robert A. van de Geijn, Tze Meng Low, Paolo Bientinesi, and Field G. Van Zee
- Subjects
Automatic parallelization ,Shared memory ,Computer science ,Applied Mathematics ,Linear algebra ,Memory architecture ,Scalability ,Parallel algorithm ,Multiprocessing ,Distributed memory ,Parallel computing ,Software - Abstract
We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.
- Published
- 2008
16. Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors
- Author
-
Martin D. Schatz, Tze Meng Low, Tamara G. Kolda, and Robert A. van de Geijn
- Subjects
FOS: Computer and information sciences ,Multilinear algebra ,Applied Mathematics ,Computation ,Numerical Analysis (math.NA) ,010103 numerical & computational mathematics ,15-02 (Primary) ,Data structure ,Topology ,01 natural sciences ,010305 fluids & plasmas ,Computational Mathematics ,Matrix (mathematics) ,0103 physical sciences ,FOS: Mathematics ,Symmetric tensor ,Multiplication ,Computer Science - Mathematical Software ,Tensor ,Mathematics - Numerical Analysis ,0101 mathematics ,Symmetry (geometry) ,Mathematical Software (cs.MS) ,Mathematics - Abstract
Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (Blocked Compact Symmetric Storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric tensor. We propose an algorithm-by-blocks, already shown of benefit for matrix computations, that exploits this storage format by utilizing a series of temporary tensors to avoid redundant computation. Further, partial symmetry within temporaries is exploited to further avoid redundant storage and redundant computation. A detailed analysis shows that, relative to storing and computing with tensors without taking advantage of symmetry and partial symmetry, storage requirements are reduced by a factor of $ O\left( m! \right)$ and computational requirements by a factor of $O\left( (m+1)!/2^m \right)$, where $ m $ is the order of the tensor. However, as the analysis shows, care must be taken in choosing the correct block size to ensure these storage and computational benefits are achieved (particularly for low-order tensors). An implementation demonstrates that storage is greatly reduced and the complexity introduced by storing and computing with tensors by blocks is manageable. Preliminary results demonstrate that computational time is also reduced. The paper concludes with a discussion of how insights in this paper point to opportunities for generalizing recent advances in the domain of linear algebra libraries to the field of multi-linear computation.
- Published
- 2013
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.