19 results on '"Joan-Manuel Parcerisa"'
Search Results
2. Dynamic Sampling Rate: Harnessing Frame Coherence in Graphics Applications for Energy-Efficient GPUs
- Author
-
Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
FOS: Computer and information sciences ,Informàtica::Infografia [Àrees temàtiques de la UPC] ,Energia -- Consum ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,GPU ,Fragment shading ,Unitats de processament gràfic ,Theoretical Computer Science ,Energy consumption ,Hardware and Architecture ,Hardware Architecture (cs.AR) ,Three-dimensional imaging ,Tile-based rendering ,Computer Science - Hardware Architecture ,Sampling ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Graphics processing units ,Software ,Imatgeria tridimensional ,Information Systems ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
In real-time rendering, a 3D scene is modelled with meshes of triangles that the GPU projects to the screen. They are discretized by sampling each triangle at regular space intervals to generate fragments which are then added texture and lighting effects by a shader program. Realistic scenes require detailed geometric models, complex shaders, high-resolution displays and high screen refreshing rates, which all come at a great compute time and energy cost. This cost is often dominated by the fragment shader, which runs for each sampled fragment. Conventional GPUs sample the triangles once per pixel; however, there are many screen regions containing low variation that produce identical fragments and could be sampled at lower than pixel-rate with no loss in quality. Additionally, as temporal frame coherence makes consecutive frames very similar, such variations are usually maintained from frame to frame. This work proposes Dynamic Sampling Rate (DSR), a novel hardware mechanism to reduce redundancy and improve the energy efficiency in graphics applications. DSR analyzes the spatial frequencies of the scene once it has been rendered. Then, it leverages the temporal coherence in consecutive frames to decide, for each region of the screen, the lowest sampling rate to employ in the next frame that maintains image quality. We evaluate the performance of a state-of-the-art mobile GPU architecture extended with DSR for a wide variety of applications. Experimental results show that DSR is able to remove most of the redundancy inherent in the color computations at fragment granularity, which brings average speedups of 1.68x and energy savings of 40%. This work has been supported by the the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (Grant No. 833057), Spanish State Research Agency (MCIN/AEI) under Grant PID2020-113172RB-I00, the ICREA Academia program, and the Generalitat de Catalunya under Grant FI-DGR 2016. Funding was provided by Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. TIN2016-75344-R).
- Published
- 2022
- Full Text
- View/download PDF
3. Early Visibility Resolution for Removing Ineffectual Computations in the Graphics Pipeline
- Author
-
Enrique de Lucas, Juan L. Aragón, Antonio Gonzalez, Joan-Manuel Parcerisa, Marti Anglada, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Computation ,Resolution (electron density) ,Visibility (geometry) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Energies::Eficiència energètica [Àrees temàtiques de la UPC] ,Infografia ,So, imatge i multimèdia::Creació multimèdia::Imatge digital [Àrees temàtiques de la UPC] ,Energy conservation ,Graphics pipeline ,Pipeline transport ,Computer graphics ,Energy efficiency ,Computer graphics (images) ,Tile-based rendering ,Visibility ,Energia -- Estalvi ,Tiled rendering ,ComputingMethodologies_COMPUTERGRAPHICS ,Efficient energy use - Abstract
GPUs' main workload is real-time image rendering. These applications take a description of a (animated) scene and produce the corresponding image(s). An image is rendered by computing the colors of all its pixels. It is normal that multiple objects overlap at each pixel. Consequently, a significant amount of processing is devoted to objects that will not be visible in the final image, in spite of the widespread use of the Early Depth Test in modern GPUs, which attempts to discard computations related to occluded objects. Since animations are created by a sequence of similar images, visibility usually does not change much across consecutive frames. Based on this observation, we present Early Visibility Resolution (EVR), a mechanism that leverages the visibility information obtained in a frame to predict the visibility in the following one. Our proposal speculatively determines visibility much earlier in the pipeline than the Early Depth Test. We leverage this early visibility estimation to remove ineffectual computations at two different granularities: pixel-level and tile-level. Results show that such optimizations lead to 39% performance improvement and 43% energy savings for a set of commercial Android graphics applications running on stateof-the-art mobile GPUs.
- Published
- 2019
4. Rendering Elimination: Early Discard of Redundant Tiles in the Graphics Pipeline
- Author
-
Marti Anglada, Juan L. Aragón, Antonio González, Pedro Marcuello, Joan-Manuel Parcerisa, Enrique de Lucas, Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
FOS: Computer and information sciences ,Speedup ,Computer science ,Memory bandwidth ,Energy consumption ,Parallel computing ,Supercomputers ,Graphics pipeline ,Rendering (computer graphics) ,Energy efficiency ,Supercomputadors ,Hardware Architecture (cs.AR) ,Tiled rendering ,High performance computing ,Graphics ,Computer Science - Hardware Architecture ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,Càlcul intensiu (Informàtica) ,Efficient energy use ,Tile based rendering ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
GPUs are one of the most energy-consuming components for real-time rendering applications, since a large number of fragment shading computations and memory accesses are involved. Main memory bandwidth is especially taxing battery-operated devices such as smart-phones. TileBased Rendering GPUs divide the screen space into multiple tiles that are independently rendered in on-chip buffers, thus reducing memory bandwidth and energy consumption. We have observed that, in many animated graphics workloads, a large number of screen tiles have the same color across adjacent frames. In this paper, we propose Rendering Elimination (RE), a novel micro-architectural technique that accurately determines if a tile will be identical to the same tile in the preceding frame before rasterization by means of comparing signatures. Since RE identifies redundant tiles early in the graphics pipeline, it completely avoids the computation and memory accesses of the most power consuming stages of the pipeline, which substantially reduces the execution time and the energy consumption of the GPU. For widely used Android applications, we show that RE achieves an average speedup of 1.74x and energy reduction of 43% for the GPU/Memory system, surpassing by far the benefits of Transaction Elimination, a state-of-the-art memory bandwidth reduction technique available in some commercial Tile-Based Rendering GPUs.
- Published
- 2018
- Full Text
- View/download PDF
5. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization
- Author
-
Polychronis Xekalakis, Jose-Maria Arnau, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Informàtica::Infografia [Àrees temàtiques de la UPC] ,Speedup ,Computer science ,Memoization ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Software rendering ,General Medicine ,Parallel computing ,Animation ,Rendering (computer graphics) ,Redundancy ,Redundancy (engineering) ,Rendering (Computer graphics) ,Locality of reference ,Animació per ordinador ,Graphics processing units ,Shader - Abstract
Redundancy is at the heart of graphical applications. In fact, generating an animation typically involves the succession of extremely similar images. In terms of rendering these images, this behavior translates into the creation of many fragment programs with the exact same input data. We have measured this fragment redundancy for a set of commercial Android applications, and found that more than 40% of the fragments used in a frame have been already computed in a prior frame. In this paper we try to exploit this redundancy, using fragment memoization. Unfortunately, this is not an easy task as most of the redundancy exists across frames, rendering most HW based schemes unfeasible. We thus first take a step back and try to analyze the temporal locality of the redundant fragments, their complexity, and the number of inputs typically seen in fragment programs. The result of our analysis is a task level memoization scheme, that easily outperforms the current state-of-the-art in low power GPUs More specifically, our experimental results show that our scheme is able to remove 59.7% of the redundant fragment computations on average. This materializes to a significant speedup of 17.6% on average, while also improving the overall energy efficiency by 8.9% on average.
- Published
- 2014
6. Boosting mobile GPU performance with a decoupled access/execute fragment processor
- Author
-
Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis
- Subjects
Real-time computer graphics ,Computer architecture ,Computer science ,Multithreading ,Software rendering ,Decoupled architecture ,General Medicine ,Cache ,General-purpose computing on graphics processing units ,Texture memory ,CAS latency ,Efficient energy use ,Rendering (computer graphics) - Abstract
Smartphones represent one of the fastest growing markets, providing significant hardware/software improvements every few months. However, supporting these capabilities reduces the operating time per battery charge. The CPU/GPU component is only left with a shrinking fraction of the power budget, since most of the energy is consumed by the screen and the antenna. In this paper, we focus on improving the energy efficiency of the GPU since graphical applications consist an important part of the existing market. Moreover, the trend towards better screens will inevitably lead to a higher demand for improved graphics rendering. We show that the main bottleneck for these applications is the texture cache and that traditional techniques for hiding memory latency (prefetching, multithreading) do not work well or come at a high energy cost. We thus propose the migration of GPU designs towards the decoupled access-execute concept. Furthermore, we significantly reduce bandwidth usage in the decoupled architecture by exploiting inter-core data sharing. Using commercial Android applications, we show that the end design can achieve 93% of the performance of a heavily multithreaded GPU while providing energy savings of 34%.
- Published
- 2012
7. Leveraging Register Windows to Reduce Physical Registers to the Bare Minimum
- Author
-
Joan-Manuel Parcerisa, Antonio González, Eduardo Quinones, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Instruction register ,Memory buffer register ,Computer science ,Informàtica::Enginyeria del software [Àrees temàtiques de la UPC] ,File organization (Computer science) ,Register file ,Parallel computing ,computer.software_genre ,Fitxers informàtics -- Organització ,Theoretical Computer Science ,Early register release ,Memory address register ,Control register ,Hardware register ,Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION ,Out-of-order execution ,Processor register ,Software architecture ,FLAGS register ,Register renaming ,Register window ,Stack register ,Microarchitecture ,Physical register file ,Computational Theory and Mathematics ,Hardware and Architecture ,Status register ,Operating system ,Programari -- Disseny ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,Memory data register ,computer ,Software ,Register windows ,Register allocation - Abstract
Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.
- Published
- 2010
8. [Untitled]
- Author
-
Joan-Manuel Parcerisa, Ramon Canal, and Antonio González
- Subjects
Scheme (programming language) ,Multi-core processor ,Computer science ,Workload ,Parallel computing ,Theoretical Computer Science ,Microarchitecture ,Theory of computation ,Code (cryptography) ,Overhead (computing) ,Cluster analysis ,computer ,Software ,Information Systems ,computer.programming_language - Abstract
Recent works^(1) show that delays introduced in the issue and bypass logic will become critical for wide issue superscalar processors. One of the proposed solutions is clustering the processor core. Clustered architectures benefit from a less complex partitioned processor core and thus, incur in less critical delays. In this paper, we propose a dynamic instruction steering logic for these clustered architectures that decides at decode time the cluster where each instruction is executed. The performance of clustered architectures depends on the inter-cluster communication overhead and the workload balance. We present a scheme that uses runtime information to optimize the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35% over a conventional 8-way issue (4eint+4efp) machine and that it outperforms other previous proposals, either static or dynamic.
- Published
- 2001
9. Neither more nor less: optimizing thread-level parallelism for GPGPUs
- Author
-
Polychronis Xekalakis, Jose-Maria Arnau, and Joan-Manuel Parcerisa
- Subjects
Memory management ,Computer science ,business.industry ,Bandwidth (computing) ,Task parallelism ,Mobile telephony ,Parallel computing ,business ,Scheduling (computing) - Published
- 2013
10. Work in progress-improving feedback using an automatic assessment tool
- Author
-
Ruben Tous, C. Perez, Jordi Tubella, M. Fernandez, Joan-Manuel Parcerisa, D. López, Carlos Alvarez, Daniel Jiménez-González, Javier Alonso, and P. Barlet
- Subjects
Correctness ,Assembly language ,business.industry ,Computer science ,media_common.quotation_subject ,Work in process ,Test (assessment) ,Microarchitecture ,World Wide Web ,Debugging ,Web page ,ComputingMilieux_COMPUTERSANDEDUCATION ,Virtual learning environment ,Software engineering ,business ,computer ,computer.programming_language ,media_common - Abstract
Students of computer science freshman year usually develop assembler programs to learn processor architecture. Homework exercises are done on paper, while those in lab sessions are solved with the aid of programming tools. Students perceive theory and lab as different subjects, so they donpsilat use lab tools to test their theory solved problems. Moreover, during lab sessions, students often tend to ask for the teacherpsilas guide and advice instead of using the debugging tools because these are new and unfriendly for them, and do not offer a quick and clear feedback. In this paper we present an automatic and friendly assessment tool, SISA-EMU, with a novel feature: exercise driven feedback with teacherpsilas expertise. It provides correctness information and clues to help the students solve their most common mistakes for each individual problem (and not typical generic debug information) without the physical support of a teacher. SISA-EMU is currently in pre-deploy phase via a Moodle learning platform and we will have first evaluation results by the end of the current term.
- Published
- 2008
11. Selective predicate prediction for out-of-order processors
- Author
-
Antonio González, Eduardo Quiñones, and Joan-Manuel Parcerisa
- Subjects
Scheme (programming language) ,Out-of-order execution ,Transformation (function) ,Theoretical computer science ,Ideal (set theory) ,Computer science ,Branch misprediction ,Control (linguistics) ,computer ,Rename ,Predicate (grammar) ,computer.programming_language - Abstract
If-conversion transforms control dependencies to data dependencies by using a predication mechanism. It is useful to eliminate hard-to-predict branches and to reduce the severe performance impact of branch mispredictions. However, the use of predicated execution in out-of-order processors has to deal with two problems: there can be multiple definitions for a single destination register at rename time, and instructions with a false predicated consume unnecessary resources. Predicting predicates is an effective approach to address both problems. However, predicting predicates that come from hard-to-predict branches is not beneficial in general, because this approach reverses the if-conversion transformation, loosing its potential benefits. In this paper we propose a new scheme that dynamically selects which predicates are worthy to be predicted, and which one are more effective in its if-converted form. We show that our approach significantly outperforms previous proposed schemes. Moreover it performs within 5% of an ideal scheme with perfect predicate prediction.
- Published
- 2006
12. The latency hiding effectiveness of decoupled access/execute processors
- Author
-
Antonio González, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Cache storage ,Hardware_MEMORYSTRUCTURES ,Parallel processing (Electronic computers) ,business.industry ,Computer science ,Cache memory ,Processament en paral·lel (Ordinadors) ,Memòria cau ,Parallel computing ,EPIC ,CAS latency ,Issue logic ,Embedded system ,Memòria ràpida de treball (Informàtica) ,Computer architecture ,Cache ,Delays ,Latency (engineering) ,Architecture ,business ,Speculation ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Compile time - Abstract
Several studies have demonstrated that out-of-order execution processors may not be the most adequate organization for wide-issue processors due to the increasing penalties that wire delays cause in the issue logic. The main target of out-of-order execution is to hide functional unit latencies and memory latency. However, the former can be quite effectively handled at compile time and this observation is one of the main arguments for the emerging EPIC architectures. In this paper, we demonstrate that a decoupled access/execute organization is very effective at hiding memory latency, even when it is very long. This paper presents a thorough evaluation of such processor organization. First, a generic decoupled access/execute architecture is defined and evaluated. Then the benefits of a lockup-free cache, control speculation and a store-load bypass mechanism under such an architecture are evaluated. Our analysis indicates that memory latency can be almost completely hidden by such techniques.
- Published
- 2002
13. The synergy of multithreading and access/execute decoupling
- Author
-
Joan-Manuel Parcerisa, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Parallel processing (Electronic computers) ,Computer science ,Processament en paral·lel (Ordinadors) ,Processor scheduling ,Virtual computer systems ,Parallel computing ,Simultaneous multithreading ,Bottleneck ,CAS latency ,Microarchitecture ,Multi-threading ,Parallel architectures ,Multithreading ,Virtual machines ,Delays ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Critical path method ,Temporal multithreading ,Sistemes virtuals (Informàtica) ,Decoupling (electronics) - Abstract
This work presents and evaluates a novel processor microarchitecture which combines two paradigms: access/execute decoupling and simultaneous multithreading. We investigate how both techniques complement each other: while decoupling features an excellent memory latency hiding efficiency, multithreading supplies the in-order issue stage with enough ILP to hide the functional unit latencies. Its partitioned layout, together with its in-order issue policy makes it potentially less complex, in terms of critical path delays, than a centralized out-of-order design, to support future growths in issue-width and clock speed. The simulations show that by adding decoupling to a multithreaded architecture, its miss latency tolerance is sharply increased and in addition, it needs fewer threads to achieve maximum throughput, especially for a large miss latency. Fewer threads result in a hardware complexity reduction and lower demands on the memory system, which becomes a critical resource for large miss latencies, since bandwidth may become a bottleneck.
- Published
- 1999
14. Reducing wire delay penalty through value prediction
- Author
-
Antonio González, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Estructura lògica ,Real-time computing ,Value (computer science) ,Workstation clusters ,Workload ,Context (language use) ,Arquitectura d'ordinadors ,Reliability engineering ,Microarchitecture ,Logic synthesis ,Logic design ,Computer architecture ,Delays ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Degradation (telecommunications) - Abstract
In this paper we show that value prediction can be used to avoid the penalty of long wire delays by predicting the data that is communicated through these long wires and validating the prediction locally where the value is produced. Only in the case of misprediction, the long wire delay is experienced. We apply this concept to a clustered microarchitecture in order to reduce inter-cluster communication. The predictability of values provides the dynamic instruction partitioning hardware with less constraints to optimize the trade-off between communication requirements and workload balance, which is the most critical issue of the partitioning scheme. We show that value prediction reduces the penalties caused by inter-cluster communication by 18% on average for a realistic implementation of a 4-cluster microarchitecture.
15. Improving latency tolerance of multithreading through decoupling
- Author
-
Joan-Manuel Parcerisa, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Clock rate ,Processor scheduling ,Thread (computing) ,Parallel computing ,Simultaneous multithreading ,CAS latency ,Theoretical Computer Science ,Instruction set ,Super-threading ,Access/execute decoupling ,Hardware complexity ,Superscalar ,Simultaneous multithreading processors ,Temporal multithreading ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Parallel processing (Electronic computers) ,business.industry ,Processament en paral·lel (Ordinadors) ,Instruction-level parallelism ,Microarchitecture ,Computational Theory and Mathematics ,Hardware and Architecture ,Multithreading ,Embedded system ,Latency hiding ,business ,Software - Abstract
The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.
16. Dynamic cluster assignment mechanisms
- Author
-
Joan-Manuel Parcerisa, Ramon Canal, Antonio González, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Parallel processing (Electronic computers) ,Computer science ,Processament en paral·lel (Ordinadors) ,Workload ,Parallel computing ,Steering logic ,Microarchitecture ,Instruction set ,Read-write memory ,Clustered microarchitectures ,Dynamically scheduled processors ,Benchmark (computing) ,Code (cryptography) ,Resource allocation ,Dynamic code partitioning ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Integer (computer science) - Abstract
Clustered microarchitectures are an effective approach to reducing the penalties caused by wire delays inside a chip. Current superscalar processors have in fact a two-cluster microarchitecture with a naive code partitioning approach: integer instructions are allocated to one cluster and floating-point instructions to the other. This partitioning scheme is simple and results in no communications between the two clusters (just through memory) but it is in general far from optimal because she workload is not evenly distributed most of the time. In fact, when the processor is running integer programs, the workload is extremely unbalanced since the FP cluster is not used at all. In this work we investigate run-time mechanisms that dynamically distribute the instructions of a program among these two clusters. By optimizing the trade-off between inter-cluster communication penalty and workload balance, the proposed schemes can achieve an average speed-up of 36% for the SpecInt95 benchmark suite.
17. Visibility rendering order: Improving energy efficiency on mobile GPUs through frame coherence
- Author
-
Antonio González, Enrique de Lucas, Pedro Marcuello, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Graphics Pipeline ,Computer science ,Color computer graphics ,Rasterization ,GPU ,Rendering (computer graphics) ,Rendering ,Imatges -- Processament -- Tècniques digitals ,Rendering (Computer graphics) ,Tiled rendering ,Graphics ,Image processing -- Digital techniques ,ComputingMethodologies_COMPUTERGRAPHICS ,Tile based rendering ,Informàtica::Infografia [Àrees temàtiques de la UPC] ,Energy-efficiency ,Graphics pipeline ,Infografia en color ,Pixel shading ,Topological order ,Occlusion culling ,Tile based deferred rendering ,Computational Theory and Mathematics ,Computer engineering ,Hardware and Architecture ,Fragment processing ,Signal Processing ,Visibility ,Shading ,Graphics processing units ,Fragmen processing ,Efficient energy use - Abstract
During real-time graphics rendering, objects are processed by the GPU in the order they are submitted by the CPU, and occluded surfaces are often processed even though they will end up not being part of the final image, thus wasting precious time and energy. To help discard occluded surfaces, most current GPUs include an Early-Depth test before the fragment processing stage. However, to be effective it requires that opaque objects are processed in a front-to-back order. Depth sorting and other occlusion culling techniques at the object level incur overheads that are only offset for applications having substantial depth and/or fragment shading complexity, which is often not the case in mobile workloads. We propose a novel architectural technique for mobile GPUs, Visibility Rendering Order (VRO), which reorders objects front-to-back entirely in hardware by exploiting the fact that the objects in graphics animated applications tend to keep its relative depth order across consecutive frames (temporal coherence). Since order relationships are already tested by the Depth Test, VRO incurs minimal energy overheads because it just requires adding a small hardware to capture that information and use it later to guide the rendering of the following frame. Moreover, unlike other approaches, this unit works in parallel with the graphics pipeline without any performance overhead. We illustrate the benefits of VRO using various unmodified commercial 3D applications for which VRO achieves 27% speed-up and 14.8% energy reduction on average over a state-of-the-art mobile GPU.
18. Improving branch prediction and predicated execution in out-of-order processors
- Author
-
Antonio González, Eduardo Quinones, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Branch ,Parallel computing ,Out of order ,computer.software_genre ,Instruction set ,Branch predication ,Instruction sets ,Degradation ,Hardware ,Proposals ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Accuracy ,Compilers (Computer programs) ,Pipelines ,Out-of-order execution ,Compiladors (Programes d'ordinador) ,Registers ,Branch predictor ,Predicate (grammar) ,Costs ,Branch target predictor ,Computer aided instruction ,Compiler ,Hardware_CONTROLSTRUCTURESANDMICROPROGRAMMING ,computer ,Algorithm - Abstract
If-conversion is a compiler technique that reduces the misprediction penalties caused by hard-to-predict branches, transforming control dependencies into data dependencies. Although it is globally beneficial, it has a negative side-effect because the removal of branches eliminates useful correlation information necessary for conventional branch predictors. The remaining branches may become harder to predict. However, in predicated ISAs with a compare-branch model, the correlation information not only resides in branches, but also in compare instructions that compute their guarding predicates. When a branch is removed, its correlation information is still available in its compare instruction. We propose a branch prediction scheme based on predicate prediction. It has three advantages: First, since the prediction is not done on a branch basis but on a predicate define basis, branch removal after if-conversion does not lose any correlation information, so accuracy is not degraded. Second, the mechanism we propose permits using the computed value of the branch predicate when available, instead of the predicted value, thus effectively achieving 100% accuracy on such early-resolved branches. Third, as shown in previous work, the selective predicate prediction is a very effective technique to implement if-conversion on out-of-order processors, since it avoids the problem of multiple register definitions and reduces the unnecessary resource consumption of nullified instructions. Hence, our approach enables a very efficient implementation of if-conversion for an out-of-order processor, with almost no additional hardware cost, because the same hardware is used to predict the predicates of if-converted code and to predict branches without accuracy degradation.
19. On-chip interconnects and instruction steering schemes for clustered microarchitectures
- Author
-
Julio Sahuquillo, José Duato, Antonio González, Joan-Manuel Parcerisa, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
- Subjects
Computer science ,Multiprocessing ,Instruction set ,Superscalar ,Multiprocessors ,System on a chip ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Intercluster communication ,Interconnection ,business.industry ,Estructura lògica ,On-chip interconnects ,Complexity ,Multiprocessadors ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Clustered microarchitecture ,Microarchitecture ,Logic synthesis ,Computational Theory and Mathematics ,Computer architecture ,Hardware and Architecture ,Embedded system ,Logic design ,Signal Processing ,Instruction steering ,business - Abstract
Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered superscalar microarchitectures. This new class of interconnects has demands and characteristics different from traditional multiprocessor networks. In particular, in a clustered microarchitecture, a low intercluster communication latency is essential for high performance. We propose some point-to-point cluster interconnects and new improved instruction steering schemes. The results show that these point-to-point interconnects achieve much better performance than bus-based ones, and that the connectivity of the network together with effective steering schemes are key for high performance. We also show that these interconnects can be built with simple hardware and achieve a performance close to that of an idealized contention-free model.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.