42 results on '"Joan-Manuel Parcerisa"'
Search Results
2. DTM-NUCA: Dynamic Texture Mapping-NUCA for Energy-Efficient Graphics Rendering.
- Author
-
David Corbalán-Navarro, Juan L. Aragón, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 2022
- Full Text
- View/download PDF
3. DTexL: Decoupled Raster Pipeline for Texture Locality.
- Author
-
Diya Joseph, Juan L. Aragón, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 2022
- Full Text
- View/download PDF
4. TCOR: A Tile Cache with Optimal Replacement.
- Author
-
Diya Joseph, Juan L. Aragón, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 2022
- Full Text
- View/download PDF
5. Rendering Elimination: Early Discard of Redundant Tiles in the Graphics Pipeline.
- Author
-
Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, Pedro Marcuello, and Antonio González 0001
- Published
- 2019
- Full Text
- View/download PDF
6. Early Visibility Resolution for Removing Ineffectual Computations in the Graphics Pipeline.
- Author
-
Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, and Antonio González 0001
- Published
- 2019
- Full Text
- View/download PDF
7. Ultra-low power render-based collision detection for CPU/GPU systems.
- Author
-
Enrique de Lucas, Pedro Marcuello, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 2015
- Full Text
- View/download PDF
8. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization.
- Author
-
José-María Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis
- Published
- 2014
- Full Text
- View/download PDF
9. Parallel frame rendering: Trading responsiveness for energy on a mobile GPU.
- Author
-
José-María Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis
- Published
- 2013
- Full Text
- View/download PDF
10. TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems.
- Author
-
José-María Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis
- Published
- 2013
- Full Text
- View/download PDF
11. Boosting mobile GPU performance with a decoupled access/execute fragment processor.
- Author
-
José-María Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis
- Published
- 2012
- Full Text
- View/download PDF
12. Early Register Release for Out-of-Order Processors with RegisterWindows.
- Author
-
Eduardo Quiñones, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 2007
- Full Text
- View/download PDF
13. Improving Branch Prediction and Predicated Execution in Out-of-Order Processors.
- Author
-
Eduardo Quiñones, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 2007
- Full Text
- View/download PDF
14. Selective predicate prediction for out-of-order processors.
- Author
-
Eduardo Quiñones, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 2006
- Full Text
- View/download PDF
15. Memory Bank Predictors.
- Author
-
Stefan Bieschewski, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 2005
- Full Text
- View/download PDF
16. Efficient Interconnects for Clustered Microarchitectures.
- Author
-
Joan-Manuel Parcerisa, Julio Sahuquillo, Antonio González 0001, and José Duato
- Published
- 2002
- Full Text
- View/download PDF
17. Reducing wire delay penalty through value prediction.
- Author
-
Joan-Manuel Parcerisa and Antonio González 0001
- Published
- 2000
- Full Text
- View/download PDF
18. Dynamic Cluster Assignment Mechanisms.
- Author
-
Ramon Canal, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 2000
- Full Text
- View/download PDF
19. A Cost-Effective Clustered Architecture.
- Author
-
Ramon Canal, Joan-Manuel Parcerisa, and Antonio González 0001
- Published
- 1999
- Full Text
- View/download PDF
20. The Synergy of Multithreading and Access/Execute Decoupling.
- Author
-
Joan-Manuel Parcerisa and Antonio González 0001
- Published
- 1999
- Full Text
- View/download PDF
21. Dynamic Sampling Rate: Harnessing Frame Coherence in Graphics Applications for Energy-Efficient GPUs
- Author
-
Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
FOS: Computer and information sciences ,Informàtica::Infografia [Àrees temàtiques de la UPC] ,Energia -- Consum ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,GPU ,Fragment shading ,Unitats de processament gràfic ,Theoretical Computer Science ,Energy consumption ,Hardware and Architecture ,Hardware Architecture (cs.AR) ,Three-dimensional imaging ,Tile-based rendering ,Computer Science - Hardware Architecture ,Sampling ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Graphics processing units ,Software ,Imatgeria tridimensional ,Information Systems ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
In real-time rendering, a 3D scene is modelled with meshes of triangles that the GPU projects to the screen. They are discretized by sampling each triangle at regular space intervals to generate fragments which are then added texture and lighting effects by a shader program. Realistic scenes require detailed geometric models, complex shaders, high-resolution displays and high screen refreshing rates, which all come at a great compute time and energy cost. This cost is often dominated by the fragment shader, which runs for each sampled fragment. Conventional GPUs sample the triangles once per pixel; however, there are many screen regions containing low variation that produce identical fragments and could be sampled at lower than pixel-rate with no loss in quality. Additionally, as temporal frame coherence makes consecutive frames very similar, such variations are usually maintained from frame to frame. This work proposes Dynamic Sampling Rate (DSR), a novel hardware mechanism to reduce redundancy and improve the energy efficiency in graphics applications. DSR analyzes the spatial frequencies of the scene once it has been rendered. Then, it leverages the temporal coherence in consecutive frames to decide, for each region of the screen, the lowest sampling rate to employ in the next frame that maintains image quality. We evaluate the performance of a state-of-the-art mobile GPU architecture extended with DSR for a wide variety of applications. Experimental results show that DSR is able to remove most of the redundancy inherent in the color computations at fragment granularity, which brings average speedups of 1.68x and energy savings of 40%. This work has been supported by the the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (Grant No. 833057), Spanish State Research Agency (MCIN/AEI) under Grant PID2020-113172RB-I00, the ICREA Academia program, and the Generalitat de Catalunya under Grant FI-DGR 2016. Funding was provided by Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. TIN2016-75344-R).
- Published
- 2022
- Full Text
- View/download PDF
22. The Latency Hiding Effectiveness of Decoupled Access/Execute Processors.
- Author
-
Joan-Manuel Parcerisa and Antonio González 0001
- Published
- 1998
- Full Text
- View/download PDF
23. Eliminating Cache Conflict Misses through XOR-Based Placement Functions.
- Author
-
Antonio González 0001, Mateo Valero, Nigel P. Topham, and Joan-Manuel Parcerisa
- Published
- 1997
- Full Text
- View/download PDF
24. Early Visibility Resolution for Removing Ineffectual Computations in the Graphics Pipeline
- Author
-
Enrique de Lucas, Juan L. Aragón, Antonio Gonzalez, Joan-Manuel Parcerisa, Marti Anglada, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Computation ,Resolution (electron density) ,Visibility (geometry) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Energies::Eficiència energètica [Àrees temàtiques de la UPC] ,Infografia ,So, imatge i multimèdia::Creació multimèdia::Imatge digital [Àrees temàtiques de la UPC] ,Energy conservation ,Graphics pipeline ,Pipeline transport ,Computer graphics ,Energy efficiency ,Computer graphics (images) ,Tile-based rendering ,Visibility ,Energia -- Estalvi ,Tiled rendering ,ComputingMethodologies_COMPUTERGRAPHICS ,Efficient energy use - Abstract
GPUs' main workload is real-time image rendering. These applications take a description of a (animated) scene and produce the corresponding image(s). An image is rendered by computing the colors of all its pixels. It is normal that multiple objects overlap at each pixel. Consequently, a significant amount of processing is devoted to objects that will not be visible in the final image, in spite of the widespread use of the Early Depth Test in modern GPUs, which attempts to discard computations related to occluded objects. Since animations are created by a sequence of similar images, visibility usually does not change much across consecutive frames. Based on this observation, we present Early Visibility Resolution (EVR), a mechanism that leverages the visibility information obtained in a frame to predict the visibility in the following one. Our proposal speculatively determines visibility much earlier in the pipeline than the Early Depth Test. We leverage this early visibility estimation to remove ineffectual computations at two different granularities: pixel-level and tile-level. Results show that such optimizations lead to 39% performance improvement and 43% energy savings for a set of commercial Android graphics applications running on stateof-the-art mobile GPUs.
- Published
- 2019
25. Rendering Elimination: Early Discard of Redundant Tiles in the Graphics Pipeline
- Author
-
Marti Anglada, Juan L. Aragón, Antonio González, Pedro Marcuello, Joan-Manuel Parcerisa, Enrique de Lucas, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
FOS: Computer and information sciences ,Speedup ,Computer science ,Memory bandwidth ,Energy consumption ,Parallel computing ,Supercomputers ,Graphics pipeline ,Rendering (computer graphics) ,Energy efficiency ,Supercomputadors ,Hardware Architecture (cs.AR) ,Tiled rendering ,High performance computing ,Graphics ,Computer Science - Hardware Architecture ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,Càlcul intensiu (Informàtica) ,Efficient energy use ,Tile based rendering ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
GPUs are one of the most energy-consuming components for real-time rendering applications, since a large number of fragment shading computations and memory accesses are involved. Main memory bandwidth is especially taxing battery-operated devices such as smart-phones. TileBased Rendering GPUs divide the screen space into multiple tiles that are independently rendered in on-chip buffers, thus reducing memory bandwidth and energy consumption. We have observed that, in many animated graphics workloads, a large number of screen tiles have the same color across adjacent frames. In this paper, we propose Rendering Elimination (RE), a novel micro-architectural technique that accurately determines if a tile will be identical to the same tile in the preceding frame before rasterization by means of comparing signatures. Since RE identifies redundant tiles early in the graphics pipeline, it completely avoids the computation and memory accesses of the most power consuming stages of the pipeline, which substantially reduces the execution time and the energy consumption of the GPU. For widely used Android applications, we show that RE achieves an average speedup of 1.74x and energy reduction of 43% for the GPU/Memory system, surpassing by far the benefits of Transaction Elimination, a state-of-the-art memory bandwidth reduction technique available in some commercial Tile-Based Rendering GPUs.
- Published
- 2018
- Full Text
- View/download PDF
26. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization
- Author
-
Polychronis Xekalakis, Jose-Maria Arnau, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Informàtica::Infografia [Àrees temàtiques de la UPC] ,Speedup ,Computer science ,Memoization ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Software rendering ,General Medicine ,Parallel computing ,Animation ,Rendering (computer graphics) ,Redundancy ,Redundancy (engineering) ,Rendering (Computer graphics) ,Locality of reference ,Animació per ordinador ,Graphics processing units ,Shader - Abstract
Redundancy is at the heart of graphical applications. In fact, generating an animation typically involves the succession of extremely similar images. In terms of rendering these images, this behavior translates into the creation of many fragment programs with the exact same input data. We have measured this fragment redundancy for a set of commercial Android applications, and found that more than 40% of the fragments used in a frame have been already computed in a prior frame. In this paper we try to exploit this redundancy, using fragment memoization. Unfortunately, this is not an easy task as most of the redundancy exists across frames, rendering most HW based schemes unfeasible. We thus first take a step back and try to analyze the temporal locality of the redundant fragments, their complexity, and the number of inputs typically seen in fragment programs. The result of our analysis is a task level memoization scheme, that easily outperforms the current state-of-the-art in low power GPUs More specifically, our experimental results show that our scheme is able to remove 59.7% of the redundant fragment computations on average. This materializes to a significant speedup of 17.6% on average, while also improving the overall energy efficiency by 8.9% on average.
- Published
- 2014
27. An energy-efficient memory unit for clustered microarchitectures
- Author
-
Stefan Bieschewski, Antonio González, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Cache memory ,Memòria cau ,Distributed architectures ,02 engineering and technology ,Parallel computing ,Cache memories ,01 natural sciences ,Theoretical Computer Science ,Parallel architectures ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Memòria ràpida de treball (Informàtica) ,Microprocessors ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Store buffer ,010302 applied physics ,020202 computer hardware & architecture ,Microarchitecture ,Memory management ,Computational Theory and Mathematics ,Hardware and Architecture ,Microprocessadors ,Distributed memory ,Cache ,Clustered architectures ,Software ,SPECfp ,Efficient energy use - Abstract
Whereas clustered microarchitectures themselves have been extensively studied, the memory units for these clustered microarchitectures have received relatively little attention. This article discusses some of the inherent challenges of clustered memory units and shows how these can be overcome. Clustered memory pipelines work well with the late allocation of load/store queue entries and physically unordered queues. Yet this approach has characteristic problems such as queue overflows and allocation patterns that lead to deadlocks. We propose techniques to solve each of these problems and show that a distributed memory unit can offer significant energy savings and speedups over a centralized unit. For instance, compared to a centralized cache with a load/store queue of 64/24 entries, our four-cluster distributed memory unit with load/store queues of 16/8 entries each consumes 31 percent less energy and performs 4,7 percent better on SPECint and consumes 36 percent less energy and performs 7 percent better for SPECfp.
- Published
- 2016
28. Leveraging Register Windows to Reduce Physical Registers to the Bare Minimum
- Author
-
Joan-Manuel Parcerisa, Antonio González, Eduardo Quinones, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Instruction register ,Memory buffer register ,Computer science ,Informàtica::Enginyeria del software [Àrees temàtiques de la UPC] ,File organization (Computer science) ,Register file ,Parallel computing ,computer.software_genre ,Fitxers informàtics -- Organització ,Theoretical Computer Science ,Early register release ,Memory address register ,Control register ,Hardware register ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,Out-of-order execution ,Processor register ,Software architecture ,FLAGS register ,Register renaming ,Register window ,Stack register ,Microarchitecture ,Physical register file ,Computational Theory and Mathematics ,Hardware and Architecture ,Status register ,Operating system ,Programari -- Disseny ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Memory data register ,computer ,Software ,Register windows ,Register allocation - Abstract
Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.
- Published
- 2010
29. [Untitled]
- Author
-
Joan-Manuel Parcerisa, Ramon Canal, and Antonio González
- Subjects
Scheme (programming language) ,Multi-core processor ,Computer science ,Workload ,Parallel computing ,Theoretical Computer Science ,Microarchitecture ,Theory of computation ,Code (cryptography) ,Overhead (computing) ,Cluster analysis ,computer ,Software ,Information Systems ,computer.programming_language - Abstract
Recent works^(1) show that delays introduced in the issue and bypass logic will become critical for wide issue superscalar processors. One of the proposed solutions is clustering the processor core. Clustered architectures benefit from a less complex partitioned processor core and thus, incur in less critical delays. In this paper, we propose a dynamic instruction steering logic for these clustered architectures that decides at decode time the cluster where each instruction is executed. The performance of clustered architectures depends on the inter-cluster communication overhead and the workload balance. We present a scheme that uses runtime information to optimize the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35% over a conventional 8-way issue (4eint+4efp) machine and that it outperforms other previous proposals, either static or dynamic.
- Published
- 2001
30. Neither more nor less: optimizing thread-level parallelism for GPGPUs
- Author
-
Polychronis Xekalakis, Jose-Maria Arnau, and Joan-Manuel Parcerisa
- Subjects
Memory management ,Computer science ,business.industry ,Bandwidth (computing) ,Task parallelism ,Mobile telephony ,Parallel computing ,business ,Scheduling (computing) - Published
- 2013
31. TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems
- Author
-
Joan-Manuel Parcerisa, Polychronis Xekalakis, Jose-Maria Arnau, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Mobile computing ,Informàtica::Infografia [Àrees temàtiques de la UPC] ,Computer science ,Image quality ,Opengl es ,Computer graphics ,Informàtica mòbil ,Low-power graphics ,Computer engineering ,Computer graphics (images) ,Simulation infrastructure ,Android (operating system) ,Graphics ,Mobile gpu ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
In this paper we present TEAPOT, a full system GPU simulator, whose goal is to allow the evaluation of the GPUs that reside in mobile phones and tablets. To this extent, it has a cycle accurate GPU model for evaluating performance, power models for the GPU, the memory subsystem and for OLED screens, and image quality metrics. Unlike prior GPU simulators, TEAPOT supports the OpenGL ES 1.1/2.0 API, so that it can simulate all commercial graphical applications available for Android systems. To illustrate potential uses of this simulating infrastructure, we perform two case studies. We first turn our attention to evaluating the impact of the OS when simulating graphical applications. We show that the overall GPU power/performance is greatly aff ected by common OS tasks, such as image composition, and argue that application level simulation is not sufficient to understand the overall GPU behavior. We then utilize the capabilities of TEAPOT to perform studies that trade image quality for energy. We demonstrate that by allowing for small distortions in the overall image quality, a signifi cant amount of energy can be saved.
- Published
- 2013
- Full Text
- View/download PDF
32. Work in progress-improving feedback using an automatic assessment tool
- Author
-
Ruben Tous, C. Perez, Jordi Tubella, M. Fernandez, Joan-Manuel Parcerisa, D. López, Carlos Alvarez, Daniel Jiménez-González, Javier Alonso, and P. Barlet
- Subjects
Correctness ,Assembly language ,business.industry ,Computer science ,media_common.quotation_subject ,Work in process ,Test (assessment) ,Microarchitecture ,World Wide Web ,Debugging ,Web page ,ComputingMilieux_COMPUTERSANDEDUCATION ,Virtual learning environment ,Software engineering ,business ,computer ,computer.programming_language ,media_common - Abstract
Students of computer science freshman year usually develop assembler programs to learn processor architecture. Homework exercises are done on paper, while those in lab sessions are solved with the aid of programming tools. Students perceive theory and lab as different subjects, so they donpsilat use lab tools to test their theory solved problems. Moreover, during lab sessions, students often tend to ask for the teacherpsilas guide and advice instead of using the debugging tools because these are new and unfriendly for them, and do not offer a quick and clear feedback. In this paper we present an automatic and friendly assessment tool, SISA-EMU, with a novel feature: exercise driven feedback with teacherpsilas expertise. It provides correctness information and clues to help the students solve their most common mistakes for each individual problem (and not typical generic debug information) without the physical support of a teacher. SISA-EMU is currently in pre-deploy phase via a Moodle learning platform and we will have first evaluation results by the end of the current term.
- Published
- 2008
33. Selective predicate prediction for out-of-order processors
- Author
-
Antonio González, Eduardo Quiñones, and Joan-Manuel Parcerisa
- Subjects
Scheme (programming language) ,Out-of-order execution ,Transformation (function) ,Theoretical computer science ,Ideal (set theory) ,Computer science ,Branch misprediction ,Control (linguistics) ,computer ,Rename ,Predicate (grammar) ,computer.programming_language - Abstract
If-conversion transforms control dependencies to data dependencies by using a predication mechanism. It is useful to eliminate hard-to-predict branches and to reduce the severe performance impact of branch mispredictions. However, the use of predicated execution in out-of-order processors has to deal with two problems: there can be multiple definitions for a single destination register at rename time, and instructions with a false predicated consume unnecessary resources. Predicting predicates is an effective approach to address both problems. However, predicting predicates that come from hard-to-predict branches is not beneficial in general, because this approach reverses the if-conversion transformation, loosing its potential benefits. In this paper we propose a new scheme that dynamically selects which predicates are worthy to be predicted, and which one are more effective in its if-converted form. We show that our approach significantly outperforms previous proposed schemes. Moreover it performs within 5% of an ideal scheme with perfect predicate prediction.
- Published
- 2006
34. Memory bank predictors
- Author
-
Antonio González, S. Bieschewski, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Cache coloring ,CPU cache ,Cache memory ,Pipeline burst cache ,Memòria cau ,Cache pollution ,Cache-oblivious algorithm ,Non-uniform memory access ,Clustered microarchitectures ,Write-once ,Cache invalidation ,Superscalar ,Memòria ràpida de treball (Informàtica) ,Cache algorithms ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Snoopy cache ,Hardware_MEMORYSTRUCTURES ,Parallel processing (Electronic computers) ,business.industry ,MESI protocol ,Processament en paral·lel (Ordinadors) ,Cache-only memory architecture ,Uniform memory access ,MESIF protocol ,Distributed cache ,Microarchitecture ,Smart Cache ,Memory bank ,Memory bank prediction ,Computer architecture ,Bus sniffing ,Hit rate ,Page cache ,Cache ,business ,Computer network - Abstract
Cache memories are commonly implemented through multiple memory banks to improve bandwidth and latency. The early knowledge of the data cache bank that an instruction will access can help to improve the performance in several ways. One scenario that is likely to become increasingly important is clustered microprocessors with a distributed cache. This work presents a study of different cache bank predictors. We show that effective bank predictors can be implemented with relatively low cost. For instance, a predictor of approximately 4 Kbytes is shown to achieve an average hit rate of 78% for SPECint2000 when used to predict accesses to an 8-bank cache memory in a contemporary superscalar processor. We also show how a predictor can be used to reduce the communication latency caused by memory accesses in a clustered microarchitecture with a distributed cache design.
- Published
- 2005
- Full Text
- View/download PDF
35. The latency hiding effectiveness of decoupled access/execute processors
- Author
-
Antonio González, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Cache storage ,Hardware_MEMORYSTRUCTURES ,Parallel processing (Electronic computers) ,business.industry ,Computer science ,Cache memory ,Processament en paral·lel (Ordinadors) ,Memòria cau ,Parallel computing ,EPIC ,CAS latency ,Issue logic ,Embedded system ,Memòria ràpida de treball (Informàtica) ,Computer architecture ,Cache ,Delays ,Latency (engineering) ,Architecture ,business ,Speculation ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Compile time - Abstract
Several studies have demonstrated that out-of-order execution processors may not be the most adequate organization for wide-issue processors due to the increasing penalties that wire delays cause in the issue logic. The main target of out-of-order execution is to hide functional unit latencies and memory latency. However, the former can be quite effectively handled at compile time and this observation is one of the main arguments for the emerging EPIC architectures. In this paper, we demonstrate that a decoupled access/execute organization is very effective at hiding memory latency, even when it is very long. This paper presents a thorough evaluation of such processor organization. First, a generic decoupled access/execute architecture is defined and evaluated. Then the benefits of a lockup-free cache, control speculation and a store-load bypass mechanism under such an architecture are evaluated. Our analysis indicates that memory latency can be almost completely hidden by such techniques.
- Published
- 2002
36. Reducing wire delay penalty through value prediction
- Author
-
Antonio González, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Estructura lògica ,Real-time computing ,Value (computer science) ,Workstation clusters ,Workload ,Context (language use) ,Arquitectura d'ordinadors ,Reliability engineering ,Microarchitecture ,Logic synthesis ,Logic design ,Computer architecture ,Delays ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Degradation (telecommunications) - Abstract
In this paper we show that value prediction can be used to avoid the penalty of long wire delays by predicting the data that is communicated through these long wires and validating the prediction locally where the value is produced. Only in the case of misprediction, the long wire delay is experienced. We apply this concept to a clustered microarchitecture in order to reduce inter-cluster communication. The predictability of values provides the dynamic instruction partitioning hardware with less constraints to optimize the trade-off between communication requirements and workload balance, which is the most critical issue of the partitioning scheme. We show that value prediction reduces the penalties caused by inter-cluster communication by 18% on average for a realistic implementation of a 4-cluster microarchitecture.
37. Improving latency tolerance of multithreading through decoupling
- Author
-
Joan-Manuel Parcerisa, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Clock rate ,Processor scheduling ,Thread (computing) ,Parallel computing ,Simultaneous multithreading ,CAS latency ,Theoretical Computer Science ,Instruction set ,Super-threading ,Access/execute decoupling ,Hardware complexity ,Superscalar ,Simultaneous multithreading processors ,Temporal multithreading ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Parallel processing (Electronic computers) ,business.industry ,Processament en paral·lel (Ordinadors) ,Instruction-level parallelism ,Microarchitecture ,Computational Theory and Mathematics ,Hardware and Architecture ,Multithreading ,Embedded system ,Latency hiding ,business ,Software - Abstract
The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.
38. Dynamic cluster assignment mechanisms
- Author
-
Joan-Manuel Parcerisa, Ramon Canal, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Parallel processing (Electronic computers) ,Computer science ,Processament en paral·lel (Ordinadors) ,Workload ,Parallel computing ,Steering logic ,Microarchitecture ,Instruction set ,Read-write memory ,Clustered microarchitectures ,Dynamically scheduled processors ,Benchmark (computing) ,Code (cryptography) ,Resource allocation ,Dynamic code partitioning ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Integer (computer science) - Abstract
Clustered microarchitectures are an effective approach to reducing the penalties caused by wire delays inside a chip. Current superscalar processors have in fact a two-cluster microarchitecture with a naive code partitioning approach: integer instructions are allocated to one cluster and floating-point instructions to the other. This partitioning scheme is simple and results in no communications between the two clusters (just through memory) but it is in general far from optimal because she workload is not evenly distributed most of the time. In fact, when the processor is running integer programs, the workload is extremely unbalanced since the FP cluster is not used at all. In this work we investigate run-time mechanisms that dynamically distribute the instructions of a program among these two clusters. By optimizing the trade-off between inter-cluster communication penalty and workload balance, the proposed schemes can achieve an average speed-up of 36% for the SpecInt95 benchmark suite.
39. Visibility rendering order: Improving energy efficiency on mobile GPUs through frame coherence
- Author
-
Antonio González, Enrique de Lucas, Pedro Marcuello, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Graphics Pipeline ,Computer science ,Color computer graphics ,Rasterization ,GPU ,Rendering (computer graphics) ,Rendering ,Imatges -- Processament -- Tècniques digitals ,Rendering (Computer graphics) ,Tiled rendering ,Graphics ,Image processing -- Digital techniques ,ComputingMethodologies_COMPUTERGRAPHICS ,Tile based rendering ,Informàtica::Infografia [Àrees temàtiques de la UPC] ,Energy-efficiency ,Graphics pipeline ,Infografia en color ,Pixel shading ,Topological order ,Occlusion culling ,Tile based deferred rendering ,Computational Theory and Mathematics ,Computer engineering ,Hardware and Architecture ,Fragment processing ,Signal Processing ,Visibility ,Shading ,Graphics processing units ,Fragmen processing ,Efficient energy use - Abstract
During real-time graphics rendering, objects are processed by the GPU in the order they are submitted by the CPU, and occluded surfaces are often processed even though they will end up not being part of the final image, thus wasting precious time and energy. To help discard occluded surfaces, most current GPUs include an Early-Depth test before the fragment processing stage. However, to be effective it requires that opaque objects are processed in a front-to-back order. Depth sorting and other occlusion culling techniques at the object level incur overheads that are only offset for applications having substantial depth and/or fragment shading complexity, which is often not the case in mobile workloads. We propose a novel architectural technique for mobile GPUs, Visibility Rendering Order (VRO), which reorders objects front-to-back entirely in hardware by exploiting the fact that the objects in graphics animated applications tend to keep its relative depth order across consecutive frames (temporal coherence). Since order relationships are already tested by the Depth Test, VRO incurs minimal energy overheads because it just requires adding a small hardware to capture that information and use it later to guide the rendering of the following frame. Moreover, unlike other approaches, this unit works in parallel with the graphics pipeline without any performance overhead. We illustrate the benefits of VRO using various unmodified commercial 3D applications for which VRO achieves 27% speed-up and 14.8% energy reduction on average over a state-of-the-art mobile GPU.
40. Improving branch prediction and predicated execution in out-of-order processors
- Author
-
Antonio González, Eduardo Quinones, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Branch ,Parallel computing ,Out of order ,computer.software_genre ,Instruction set ,Branch predication ,Instruction sets ,Degradation ,Hardware ,Proposals ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Accuracy ,Compilers (Computer programs) ,Pipelines ,Out-of-order execution ,Compiladors (Programes d'ordinador) ,Registers ,Branch predictor ,Predicate (grammar) ,Costs ,Branch target predictor ,Computer aided instruction ,Compiler ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,computer ,Algorithm - Abstract
If-conversion is a compiler technique that reduces the misprediction penalties caused by hard-to-predict branches, transforming control dependencies into data dependencies. Although it is globally beneficial, it has a negative side-effect because the removal of branches eliminates useful correlation information necessary for conventional branch predictors. The remaining branches may become harder to predict. However, in predicated ISAs with a compare-branch model, the correlation information not only resides in branches, but also in compare instructions that compute their guarding predicates. When a branch is removed, its correlation information is still available in its compare instruction. We propose a branch prediction scheme based on predicate prediction. It has three advantages: First, since the prediction is not done on a branch basis but on a predicate define basis, branch removal after if-conversion does not lose any correlation information, so accuracy is not degraded. Second, the mechanism we propose permits using the computed value of the branch predicate when available, instead of the predicted value, thus effectively achieving 100% accuracy on such early-resolved branches. Third, as shown in previous work, the selective predicate prediction is a very effective technique to implement if-conversion on out-of-order processors, since it avoids the problem of multiple register definitions and reduces the unnecessary resource consumption of nullified instructions. Hence, our approach enables a very efficient implementation of if-conversion for an out-of-order processor, with almost no additional hardware cost, because the same hardware is used to predict the predicates of if-converted code and to predict branches without accuracy degradation.
41. On-chip interconnects and instruction steering schemes for clustered microarchitectures
- Author
-
Julio Sahuquillo, José Duato, Antonio González, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Multiprocessing ,Instruction set ,Superscalar ,Multiprocessors ,System on a chip ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Intercluster communication ,Interconnection ,business.industry ,Estructura lògica ,On-chip interconnects ,Complexity ,Multiprocessadors ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Clustered microarchitecture ,Microarchitecture ,Logic synthesis ,Computational Theory and Mathematics ,Computer architecture ,Hardware and Architecture ,Embedded system ,Logic design ,Signal Processing ,Instruction steering ,business - Abstract
Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered superscalar microarchitectures. This new class of interconnects has demands and characteristics different from traditional multiprocessor networks. In particular, in a clustered microarchitecture, a low intercluster communication latency is essential for high performance. We propose some point-to-point cluster interconnects and new improved instruction steering schemes. The results show that these point-to-point interconnects achieve much better performance than bus-based ones, and that the connectivity of the network together with effective steering schemes are key for high performance. We also show that these interconnects can be built with simple hardware and achieve a performance close to that of an idealized contention-free model.
42. Omega-Test: A Predictive Early-Z Culling to Improve the Graphics Pipeline Energy-Efficiency
- Author
-
David Corbalan-Navarro, Enrique de Lucas, Joan-Manuel Parcerisa, Juan Luis Aragon, Antonio Gonzalez, Marti Anglada, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Speedup ,Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,02 engineering and technology ,Mobile processors ,Processor architecture ,Rendering (computer graphics) ,Visibility determination ,0202 electrical engineering, electronic engineering, information engineering ,Graphics processors ,Rendering (Computer graphics) ,Computer vision ,Graphics ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,ComputingMethodologies_COMPUTERGRAPHICS ,Low-power design ,Hidden line/surface removal ,Pixel ,business.industry ,Visibility (geometry) ,Frame (networking) ,Portable devices ,020207 software engineering ,Computer Graphics and Computer-Aided Design ,Graphics pipeline ,Unitats de processament gràfic ,Energy-aware systems ,Hardware architecture ,Signal Processing ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,Graphics processing units ,Software ,Efficient energy use - Abstract
The most common task of GPUs is to render images in real time. When rendering a 3D scene, a key step is to determine which parts of every object are visible in the final image. There are different approaches to solve the visibility problem, the Z-Test being the most common. A main factor that significantly penalizes the energy efficiency of a GPU, especially in the mobile arena, is the so-called overdraw, which happens when a portion of an object is shaded and rendered but finally occluded by another object. This useless work results in a waste of energy; however, a conventional Z-Test only avoids a fraction of it. In this paper we present a novel microarchitectural technique, the Omega-Test, to drastically reduce the overdraw on a Tile-Based Rendering (TBR) architecture. Graphics applications have a great degree of inter-frame coherence, which makes the output of a frame very similar to the previous one. The proposed approach leverages the frame-to-frame coherence by using the resulting information of the Z-Test for a tile (a buffer containing all the calculated pixel depths for a tile), which is discarded by nowadays GPUs, to predict the visibility of the same tile in the next frame. As a result, the Omega-Test early identifies occluded parts of the scene and avoids the rendering of non-visible surfaces eliminating costly computations and off-chip memory accesses. Our experimental evaluation shows average EDP savings in the overall GPU/Memory system of 26.4% and an average speedup of 16.3% for the evaluated benchmarks. This work has been supported by the the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency under grant TIN2016-75344-R (AEI/FEDER, EU) and the ICREA Academia program. D. Corbalan-Navarro has been supported by a PhD research fellowship from the University of Murcia.
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.