14 results on '"Felix Wolf"'
Search Results
2. Tool-Supported Mini-App Extraction to Facilitate Program Analysis and Parallelization
- Author
-
Florian Dewald, Heiko Mantel, Christian Bischof, Felix Wolf, Mohammad Norouzi, and Jan-Patrick Lehr
- Subjects
Reduction (complexity) ,Identification (information) ,Source lines of code ,Program analysis ,Computer science ,Automatic identification and data capture ,Code (cryptography) ,Key (cryptography) ,Cyclomatic complexity ,Parallel computing - Abstract
The size and complexity of high-performance computing applications present a serious challenge to manual reasoning about program behavior. The vastness and diversity of code bases often break automatic analysis tools, which could otherwise be used. As a consequence, developers resort to mini-apps, i.e., trimmed-down proxies of the original programs that retain key performance characteristics. Unfortunately, their construction is difficult and time consuming and prevents their mass production. In this paper, we propose a systematic and tool-supported approach to extract mini-apps from large-scale applications that reduces the manual effort needed to create them. Our approach covers the stages kernel identification, data capture, code extraction and representativeness validation. We demonstrate it using an astrophysics simulation with ≈ 8.5 million lines of code and extract a mini-app with only ≈ 1, 100 lines of code. For the mini-app, we evaluate the reduction of code complexity and execution similarity, and show how it enables the tool-supported discovery of unexploited parallelization opportunities, reducing the simulation’s runtime significantly.
- Published
- 2021
- Full Text
- View/download PDF
3. Accelerating winograd convolutions using symbolic computation and meta-programming
- Author
-
Matthew W. Moskewicz, Ali Jannesari, Felix Wolf, Tim Beringer, and Arya Mazaheri
- Subjects
Computational complexity theory ,business.industry ,Computer science ,Deep learning ,020206 networking & telecommunications ,02 engineering and technology ,Parallel computing ,Symbolic computation ,Metaprogramming ,Convolution ,CUDA ,Software portability ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Code generation ,Artificial intelligence ,business - Abstract
Convolution operations are essential constituents of convolutional neural networks. Their efficient and performance-portable implementation demands tremendous programming effort and fine-tuning. Winograd's minimal filtering algorithm is a well-known method to reduce the computational complexity of convolution operations. Unfortunately, existing implementations of this algorithm are either vendor-specific or hard-coded to support a small subset of convolutions, thus limiting their versatility and performance portability. In this paper, we propose a novel method to optimize Winograd convolutions based on symbolic computation. Taking advantage meta-programming and auto-tuning, we further introduce a system to automate the generation of efficient and portable Winograd convolution code for various GPUs. We show that our optimization technique can effectively exploit repetitive patterns, enabling us to reduce the number of arithmetic operations by up to 62% without compromising numerical stability. Moreover, we demonstrate in experiments that we can generate efficient kernels with runtimes close to deep-learning libraries, requiring only a minimum of programming effort, which confirms the performance portability of our approach.
- Published
- 2020
- Full Text
- View/download PDF
4. Automatic construct selection and variable classification in OpenMP
- Author
-
Felix Wolf, Mohammad Norouzi, and Ali Jannesari
- Subjects
020203 distributed computing ,Correctness ,Speedup ,Semantics (computer science) ,Computer science ,02 engineering and technology ,Construct (python library) ,Parallel computing ,020202 computer hardware & architecture ,Variable (computer science) ,Task (computing) ,Software design pattern ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) - Abstract
A major task of parallelization with OpenMP is to decide where in a program to insert which OpenMP construct such that speedup is maximized and correctness is preserved. Another challenge is the classification of variables that appear in a construct according to their data-sharing semantics. Manual classification is tedious and error prone. Moreover, the choice of the data-sharing attribute can significantly affect performance. Grounded on the notion of parallel design patterns, we propose a method that identifies code regions to parallelize and selects appropriate OpenMP constructs for them. Also, we classify variables in the chosen constructs by analyzing data dependences that have been dynamically extracted from the program. Using our approach, we created OpenMP versions of 49 sequential benchmarks and compared them with the code produced by three state-of-the-art parallelization tools: Our codes are faster in most cases with average speedups relative to any of the three ranging from 1.8 to 2.7. Additionally, we automatically reclassified variables of OpenMP programs parallelized manually or with the help of these tools, improving their execution time by up to 29%.
- Published
- 2019
- Full Text
- View/download PDF
5. Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics
- Author
-
Ali Jannesari, Felix Wolf, and Arya Mazaheri
- Subjects
010302 applied physics ,Profiling (computer programming) ,Computer science ,Distributed computing ,Locality ,020207 software engineering ,02 engineering and technology ,Thread (computing) ,Data structure ,01 natural sciences ,Processor affinity ,Shared memory ,Multithreading ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Cache - Abstract
A critical factor for developing robust shared-memory applications is the efficient use of the cache and the communication between threads. Inappropriate data structures, algorithm design, and inefficient thread affinity may result in superfluous communication between threads/cores and severe performance problems. For this reason, state-of-the-art profiling tools focus on thread communication and behavior to present different metrics that enable programmers to write cache-friendly programs. The data shared between a pair of threads should be reused with a reasonable distance to preserve data locality. However, existing tools do not take into account the locality of communication events and mainly focus on analyzing the amount of communication instead. In this paper, we introduce a new method to analyze performance and communication bottlenecks that arise from data-access patterns and thread interactions of each code region. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. We evaluated our approach on the SPLASH and Rodinia benchmark suites. Experimental results validate the effectiveness of our approach by finding communication locality issues due to inefficient data structures and/or poor algorithm implementations. By applying the suggested optimizations, we improved the performance in Rodinia benchmarks by up to 56%. Furthermore, by varying the input size we demonstrated the ability of our method to assess the cache usage and scalability of a given application in terms of its inherent communication.
- Published
- 2018
- Full Text
- View/download PDF
6. Brief Announcement
- Author
-
Ali Jannesari, Felix Wolf, and Rohit Atre
- Subjects
010302 applied physics ,Profiling (computer programming) ,020203 distributed computing ,Data parallelism ,Programming language ,Computer science ,Computation ,Task parallelism ,02 engineering and technology ,Parallel computing ,Static analysis ,16. Peace & justice ,computer.software_genre ,01 natural sciences ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Implicit parallelism ,Instruction-level parallelism ,computer ,Language construct - Abstract
Discovering which code sections in a sequential program can be made to run in parallel is the first step in parallelizing it, and programmers routinely struggle in this step. Most of the current parallelism discovery techniques focus on specific language constructs while trying to identify such code sections. In contrast, we propose to concentrate on the computations performed by a program. In our approach, a program is treated as a collection of computations communicating with one another using a number of variables. Each computation is represented as a Computational Unit (CU). A CU contains the inputs and outputs of a computation, and the three phases of a computation: read, compute, and write. Based on the notion of CU, We present a unified framework to identify both loop and task parallelism in sequential programs.
- Published
- 2017
- Full Text
- View/download PDF
7. Preventing the explosion of exascale profile data with smart thread-level aggregation
- Author
-
Felix Wolf, Sergei Shudler, and Daniel Lorenz
- Subjects
Engineering ,business.industry ,Compression ratio ,Parallel computing ,Thread (computing) ,Analysis tools ,business ,Exascale computing ,Data compression - Abstract
State of the art performance analysis tools, such as Score-P, record performance profiles on a per-thread basis. However, for exascale systems the number of threads is expected to be in the order of a billion threads, and this would result in extremely large performance profiles. In most cases the user almost never inspects the individual per-thread data. In this paper, we propose to aggregate per-thread performance data in each process to reduce its amount to a reasonable size. Our goal is to aggregate the threads such that the thread-level performance issues are still visible and analyzable. Therefore, we implemented four aggregation strategies in Score-P: (i) SUM -- aggregates all threads of a process into a process profile; (ii) SET -- calculates statistical key data as well as the sum; (iii) KEY -- identifies three threads (i.e., key threads) of particular interest for performance analysis and aggregates the rest of the threads; (iv) CALLTREE -- clusters threads that have the same call-tree structure. For each one of these strategies we evaluate the compression ratio and how they maintain thread-level performance behavior information. The aggregation does not incur any additional performance overhead at application run-time.
- Published
- 2015
- Full Text
- View/download PDF
8. Exascaling Your Library
- Author
-
Alexandru Calotoiu, Alexandre Strube, Felix Wolf, Sergei Shudler, and Torsten Hoefler
- Subjects
Analytical expressions ,Computer science ,Programming language ,business.industry ,Scalability ,Regression testing ,Benchmarking ,computer.software_genre ,Software engineering ,business ,Supercomputer ,computer ,Field (computer science) - Abstract
Many libraries in the HPC field encapsulate sophisticated algorithms with clear theoretical scalability expectations. However, hardware constraints or programming bugs may sometimes render these expectations inaccurate or even plainly wrong. While algorithm engineers have already been advocating the systematic combination of analytical performance models with practical measurements for a very long time, we go one step further and show how this comparison can become part of automated testing procedures. The most important applications of our method include initial validation, regression testing, and benchmarking to compare implementation and platform alternatives. Advancing the concept of performance assertions, we verify asymptotic scaling trends rather than precise analytical expressions, relieving the developer from the burden of having to specify and maintain very fine grained and potentially non-portable expectations. In this way, scalability validation can be continuously applied throughout the whole development cycle with very little effort. Using MPI as an example, we show how our method can help uncover non-obvious limitations of both libraries and underlying platforms.
- Published
- 2015
- Full Text
- View/download PDF
9. The Basic Building Blocks of Parallel Tasks
- Author
-
Ali Jannesari, Rohit Atre, and Felix Wolf
- Subjects
Identification (information) ,Computer science ,Process (engineering) ,Block (programming) ,Benchmark (computing) ,Code (cryptography) ,Parallelism (grammar) ,Task parallelism ,Parallel computing ,Implementation - Abstract
Discovery of parallelization opportunities in sequential programs can greatly reduce the time and effort required to parallelize any application. Identification and analysis of code that contains little to no internal parallelism can also help expose potential parallelism. This paper provides a technique to identify a block of code called Computational Unit (CU) that performs a unit of work in a program. A CU can assist in discovering the potential parallelism in a sequential program by acting as a basic building block for tasks. CUs are used along with dynamic analysis information to identify the tasks that contain tightly coupled code within them. This process in turn reveals the tasks that are weakly dependent or independent. The independent tasks can be run in parallel and the dependent tasks can be analyzed to check if the dependences can be resolved. To evaluate our technique, different benchmark applications are parallelized using our identified tasks and the speedups are reported. In addition, existing parallel implementations of the applications are compared with the identified tasks for the respective applications.
- Published
- 2015
- Full Text
- View/download PDF
10. Dependence-Based Code Transformation for Coarse-Grained Parallelism
- Author
-
Zhen Li, Bo Zhao, Weiguo Wu, Ali Jannesari, and Felix Wolf
- Subjects
Profiling (computer programming) ,Multi-core processor ,Code refactoring ,Data parallelism ,Computer science ,Task parallelism ,Parallel computing ,Implicit parallelism ,Scalable parallelism ,computer.software_genre ,Instruction-level parallelism ,computer - Abstract
Multicore architectures are becoming more common today. Many software products implemented sequentially have failed to exploit the potential parallelism of multicore architectures. Significant re-engineering and refactoring of existing software is needed to support the use of new hardware features. Due to the high cost of manual transformation, an automated approach to transforming existing software and taking advantage of multicore architectures would be highly beneficial. We propose a novel auto-parallelization approach, which integrates data-dependence profiling, task parallelism extraction and source-to-source transformation. Coarse-grained task parallelism is detected based on a concept called Computational Unit(CU). We use dynamic profiling information to gather control- and data-dependences among tasks and generate a task graph. In addition, we develop a source-to-source transformation tool based on LLVM, which can perform high-level code restructuring. It transforms the generated task graph with loop parallelism and task parallelism of sequential code into parallel code using Intel Threading Building Blocks (TBB). We have evaluated NAS Parallel Benchmark applications, three applications from PARSEC benchmark suite, and real world applications. The obtained results confirm that our approach is able to achieve promising performance with minor user interference. The average speedups of loop parallelization and task parallelization are 3.12x and 9.92x respectively.
- Published
- 2015
- Full Text
- View/download PDF
11. Catching Idlers with Ease
- Author
-
Marc-André Hermanns, David Böhme, Markus Geimer, Felix Wolf, Daniel Lorenz, and Guoyong Mao
- Subjects
Profiling (computer programming) ,Wait state ,Software portability ,Low overhead ,Computer science ,Scalability ,Real-time computing ,Timestamp - Abstract
Load imbalance usually introduces wait states into the execution of parallel programs. Being able to identify and quantify wait states is therefore essential for the diagnosis and remediation of this phenomenon. An established method of detecting wait states is to generate event traces and compare relevant timestamps across process boundaries. However, large trace volumes usually prevent the analysis of longer execution periods. In this paper, we present an extremely lightweight wait-state profiler which does not rely on traces that can be used to estimate wait states in MPI codes with arbitrarily long runtimes. The profiler combines scalability with portability and low overhead.
- Published
- 2014
- Full Text
- View/download PDF
12. Understanding the formation of wait states in applications with one-sided communication
- Author
-
Felix Wolf, Marc-André Hermanns, David Böhme, and Manfred Miklosch
- Subjects
Core (game theory) ,Cover (telecommunications) ,Computer science ,One sided ,Distributed computing ,Benchmark (computing) ,Root cause ,Critical path method - Abstract
To better understand the formation of wait states in MPI programs and to support the user in finding optimization targets in the case of load imbalance, a major source of wait states, we added in our earlier work two new trace-analysis techniques to Scalasca, a performance analysis tool designed for large-scale applications. In this paper, we show how the two techniques, which were originally restricted to two-sided and collective MPI communication, are extended to cover also one-sided communication. We demonstrate our experiences with benchmark programs and a mini-application representing the core of the POP ocean model.
- Published
- 2013
- Full Text
- View/download PDF
13. Space-efficient time-series call-path profiling of parallel applications
- Author
-
Brian J. N. Wylie, Felix Wolf, and Zoltán Szebenyi
- Subjects
Profiling (computer programming) ,Computer science ,Parallel computing ,Cluster analysis ,Semantic compression - Abstract
The performance behavior of parallel simulations often changes considerably as the simulation progresses --- with potentially process-dependent variations of temporal patterns. While call-path profiling is an established method of linking a performance problem to the context in which it occurs, call paths reveal only little information about the temporal evolution of performance phenomena. However, generating call-path profiles separately for thousands of iterations may exceed available buffer space --- especially when the call tree is large and more than one metric is collected. In this paper, we present a runtime approach for the semantic compression of call-path profiles based on incremental clustering of a series of single-iteration profiles that scales in terms of the number of iterations without sacrificing important performance details. Our approach offers low runtime overhead by using only a condensed version of the profile data when calculating distances and accounts for process-dependent variations by making all clustering decisions locally.
- Published
- 2009
- Full Text
- View/download PDF
14. Scalable massively parallel I/O to task-local files
- Author
-
Ventsislav Petkov, Wolfgang Frings, and Felix Wolf
- Subjects
File system ,Input/output ,Computer science ,Computer file ,Stub file ,Device file ,computer.file_format ,Class implementation file ,computer.software_genre ,Unix file types ,Virtual file system ,Torrent file ,File Control Block ,Self-certifying File System ,Journaling file system ,Data file ,Data_FILES ,Operating system ,Versioning file system ,Fork (file system) ,SSH File Transfer Protocol ,File synchronization ,computer ,File system fragmentation ,Flash file system - Abstract
Parallel applications often store data in multiple task-local files, for example, to remember checkpoints, to circumvent memory limitations, or to record performance data. When operating at very large processor configurations, such applications often experience scalability limitations when the simultaneous creation of thousands of files causes metadataserver contention or simply when large file counts complicate file management or operations on those files even destabilize the file system. SIONlib is a parallel I/O library that addresses this problem by transparently mapping a large number of task-local files onto a small number of physical files via internal metadata handling and block alignment to ensure high performance. While requiring only minimal source code changes, SIONlib significantly reduces file creation overhead and simplifies file handling without penalizing read and write performance. We evaluate SIONlib's efficiency with up to 288 K tasks and report significant performance improvements in two application scenarios.
- Published
- 2009
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.