Start Over

A dynamic block-level execution profiler

Authors :: Francis B. Moreira
Matthias Diener
Marco A. Z. Alves
Philippe O. A. Navaux
Israel Koren
Source :: Parallel Computing. 54:15-28
Publication Year :: 2016
Publisher :: Elsevier BV, 2016.
Abstract: We introduce a hardware-based mechanism to dynamically profile application blocks.Profiling information is used to prioritize critical memory loads during execution.Our mechanism yields better accuracy and performance gains than previous proposals.We extensively analyze how our mechanism improves performance.Results show that it alleviates prefetch inter-core interference. Most performance enhancing mechanisms in current processors, such as branch predictors or prefetchers, rely on program characteristics monitored at the granularity of single instructions. However, many of these characteristics can be obtained at the basic block-level instead. The coarser granularity allows a larger portion of the code to be examined, enabling a more accurate profiling and a detailed analysis of the different types of instructions executed within a block. Therefore, block-level analysis can be advantageous for performance enhancing mechanisms, as it allows us to look at how the instructions influence each other, and thus detect complex behavior patterns.In this paper, we present the Dynamic Block-Level Execution Profiler (DBLEP), a basic block level online mechanism that profiles micro-architectural bottlenecks, such as delinquent memory loads, hard-to-predict branches and contention for functional units. DBLEP operates at the basic block level and provides information that can be used to reduce the impact of these bottlenecks. A prefetch dropping scheme and a memory controller policy were developed to use the code profiling information provided by DBLEP. By taking advantage of the high profiling accuracy, these mechanisms are able to improve the processor's performance by up to 18.6% (5.3% on average). We show that our mechanism's performance is comparable to mechanisms that work on single instruction granularity, using less hardware.