318 results on '"Xavier Martorell"'
Search Results
302. Design and implementation of message-passing services for the Blue Gene/L supercomputer
- Author
-
C. Christopher Erway, Joseph D. Ratterman, Charles J. Archer, Burkhard Steinmacher-Burow, Brian Toonen, José E. Moreira, José G. Castaños, William Gropp, John A. Gunnels, Xavier Martorell, K. W. Pinnow, P. Heidelberger, and G. Almasi
- Subjects
Coprocessor ,General Computer Science ,Computer science ,Node (networking) ,Message passing ,Bandwidth (signal processing) ,Message Passing Interface ,Parallel computing ,Supercomputer ,computer.software_genre ,Mode (computer interface) ,Operating system ,computer ,Massively parallel - Abstract
The Blue Gene®/L (BG/L) supercomputer, with 65,536 dual-processor compute nodes, was designed from the ground up to support efficient execution of massively parallel message-passing programs. Part of this support is an optimized implementation of the Message Passing Interface (MPI), which leverages the hardware features of BG/L. MPI for BG/L is implemented on top of a more basic message-passing infrastructure called the message layer. This message layer can be used both to implement other higher-level libraries and directly by applications. MPI and the message layer are used in the two BG/L modes of operation: the coprocessor mode and the virtual node mode. Performance measurements show that our message-passing services deliver performance close to the hardware limits of the machine. They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.
303. Running openMP applications efficiently on an everything-shared SDSM
- Author
-
Juan José Costa, Eduard Ayguadé, Jesús Labarta, Xavier Martorell, Toni Cortes, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, and Universitat Politècnica de Catalunya. VIRTUOS - Virtualisation and Operating Systems
- Subjects
Computer Networks and Communications ,Computer science ,Distributed computing ,Workstations ,Parallel computing ,Contracts ,Application software ,computer.software_genre ,Theoretical Computer Science ,Programming profession ,Hardware ,Informàtica [Àrees temàtiques de la UPC] ,Artificial Intelligence ,Distributed shared memory systems ,Semantic memory ,Computer architecture ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,Open systems ,Distributed shared memory ,Informàtica::Arquitectura de computadors::Arquitectures distribuïdes [Àrees temàtiques de la UPC] ,Software engineering ,Message passing ,Software distributed shared memory ,OpenMP ,Gestió de memòria (Informàtica) ,Costs ,Shared memory ,Memory management (Computer science) ,Runtime ,Hardware and Architecture ,Distributed algorithm ,Scalability ,Enginyeria de programari ,computer ,Software ,Educational programs - Abstract
Summary form only given. Traditional software distributed shared memory (SDSM) systems modify the semantics of a real hardware shared memory system by relaxing the coherence semantic and by limiting the memory regions that are actually shared. These semantic modifications are done to improve performance of the applications using it. We show that a SDSM system that behaves like a real shared memory system (without the afore mentioned relaxations) can also be used to execute OpenMP applications and achieve similar speedups as the ones obtained by traditional SDSM systems. This performance can be achieved by encouraging the cooperation between the SDSM and the OpenMP runtime instead of relaxing the semantics of the shared memory. In addition, techniques like boundaries alignment and page presend are demonstrated as very useful to overcome the limitations of the current SDSM systems.
304. Hipster: hybrid task manager for latency-critical cloud workloads
- Author
-
Rajiv Nishtala, Paul M. Carpenter, Xavier Martorell, Vinicius Petrucci, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, and Barcelona Supercomputing Center
- Subjects
Learning (artificial intelligence) ,Computació en núvol ,Computer science ,Cloud computing ,Servers ,02 engineering and technology ,7. Clean energy ,01 natural sciences ,Quality of service ,Server ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Frequency scaling ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,010302 applied physics ,business.industry ,Resource management ,Energy consumption ,Throughput ,020202 computer hardware & architecture ,Informació -- Sistemes d'emmagatzematge i recuperació ,Power demand ,Data center ,High performance computing ,Task manager ,Heuristics ,business ,Càlcul intensiu (Informàtica) ,Computer network - Abstract
In 2013, U. S. data centers accounted for 2.2% of the country's total electricity consumption, a figure that is projected to increase rapidly over the next decade. Many important workloads are interactive, and they demand strict levels of quality-of-service (QoS) to meet user expectations, making it challenging to reduce power consumption due to increasing performance demands. This paper introduces Hipster, a technique that combines heuristics and reinforcement learning to manage latency-critical workloads. Hipster's goal is to improve resource efficiency in data centers while respecting the QoS of the latency-critical workloads. Hipster achieves its goal by exploring heterogeneous multi-cores and dynamic voltage and frequency scaling (DVFS). To improve data center utilization and make best usage of the available resources, Hipster can dynamically assign remaining cores to batch workloads without violating the QoS constraints for the latency-critical workloads. We perform experiments using a 64-bit ARM big.LITTLE platform, and show that, compared to prior work, Hipster improves the QoS guarantee for Web-Search from 80% to 96%, and for Memcached from 92% to 99%, while reducing the energy consumption by up to 18%.
305. Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP
- Author
-
Roger Ferrer, Alejandro Duran, Eduard Ayguadé, Xavier Teruel, Xavier Martorell, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Multi-core processor ,Recursion ,Parallel processing (Electronic computers) ,Data parallelism ,Computer science ,Task parallelism ,Parallel computing ,Computer architecture ,Parallel processing (DSP implementation) ,Processament en paral·lel (Ordinadors) -- Arquitectura ,Programming paradigm ,Parallelism (grammar) ,Implicit parallelism ,Instruction-level parallelism ,Barcelona OpenMP Tasks Suite ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Application program interfaces ,OpenMP specifications - Abstract
Traditional parallel applications have exploited regular parallelism, based on parallel loops. Only a few applications exploit sections parallelism. With the release of the new OpenMP specification (3.0), this programming model supports tasking. Parallel tasks allow the exploitation of irregular parallelism, but there is a lack of benchmarks exploiting tasks in OpenMP. With the current (and projected) multicore architectures that offer many more alternatives to execute parallel applications than traditional SMP machines, this kind of parallelism is increasingly important. And so, the need to have some set of benchmarks to evaluate it. In this paper, we motivate the need of having such a benchmarks suite, for irregular and/or recursive task parallelism. We present our proposal, the Barcelona OpenMP Tasks Suite (BOTS), with a set of applications exploiting regular and irregular parallelism, based on tasks. We present an overall evaluation of the BOTS benchmarks in an Altix system and we discuss some of the different experiments that can be done with the different compilation and runtime alternatives of the benchmarks.
306. Performance-driven processor allocation
- Author
-
Julita Corbalan, Jesús Labarta, and Xavier Martorell
- Subjects
business.industry ,Computer science ,Real-time computing ,Runtime library ,Processor scheduling ,Workload ,Multiprocessing ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Multiprocessor scheduling ,Scheduling (computing) ,Computational Theory and Mathematics ,Hardware and Architecture ,Embedded system ,Signal Processing ,Resource allocation ,Algorithm design ,Computer multitasking ,business - Abstract
In current multiprogrammed multiprocessor systems, to take into account the performance of parallel applications is critical to decide an efficient processor allocation. In this paper, we present the performance-driven processor allocation policy (PDPA). PDPA is a new scheduling policy that implements a processor allocation policy and a multiprogramming-level policy, in a coordinated way, based on the measured application performance. With regard to the processor allocation, PDPA is a dynamic policy that allocates to applications the maximum number of processors to reach a given target efficiency. With regard to the multiprogramming level, PDPA allows the execution of a new application when free processors are available and the allocation of all the running applications is stable, or if some applications show bad performance. Results demonstrate that PDPA automatically adjusts the processor allocation of parallel applications to reach the specified target efficiency, and that it adjusts the multiprogramming level to the workload characteristics. PDPA is able to adjust the processor allocation and the multiprogramming level without human intervention, which is a desirable property for self-configurable systems, resulting in a better individual application response time.
307. Characterizing and improving the performance of many-core task-based parallel programming runtimes
- Author
-
Jaume Bosch, Xubin Tan, Daniel Jiménez-González, Eduard Ayguadé, Carlos Alvarez, Xavier Martorell, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, and Barcelona Supercomputing Center
- Subjects
Parallel computing ,020203 distributed computing ,Organizations ,Speedup ,Correctness ,Computer science ,Distributed computing ,Computation ,Runtime verification ,Message systems ,Parallel programming ,Computational modeling ,02 engineering and technology ,Thread (computing) ,Ordinadors paral·lels ,Many core ,Runtime ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,Parallel processing ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] - Abstract
Parallel task-based programming models like OpenMP support the declaration of task data dependences. This information is used to delay the task execution until the task data is available. The dependences between tasks are calculated at runtime using shared graphs that are updated concurrently by all threads. However, only one thread can modify the task graph at a time to ensure correctness; others need to wait before doing their modifications. This waiting limits the application's parallelism and becomes critical in many-core systems. This paper characterizes this behavior, analyzing how it hinders performance and presenting an alternative organization suitable for the runtimes of task-based programming models. This organization allows managing the runtime structures asynchronously or synchronously, adapting the runtime to reduce the waste of computation resources and increase theperformance. Results show that the new runtime structure outperforms the peak speedup of the original runtime model whencontention is huge and achieves similar or better performance results for real applications. This work is partially supported by the European Union H2020 Research and Innovation Action through the Mont-Blanc 3 project (GA 671697) and HiPEAC (GA 687698), by the Spanish Government (projects SEV-2015-0493 and TIN2015-65316-P), and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272).
308. Task-parallel reductions in openMP and OmpSs
- Author
-
Jesús Labarta, Sergi Mateo, Xavier Teruel, Jan Ciesko, Vicenç Beltran, Xavier Martorell, Rosa M. Badia, Eduard Ayguadé, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Ministerio de Economía y Competitividad (España), and Generalitat de Catalunya
- Subjects
Recursion ,Parallel processing (Electronic computers) ,Computer science ,Processament en paral·lel (Ordinadors) ,Parallel programming ,OpenMP ,Parallel computing ,Construct (python library) ,Task (project management) ,Reduction (complexity) ,OmpSs ,Computer architecture ,Parallel processing (DSP implementation) ,Scalability ,Parallel programming model ,Recursive algorithms ,Task ,Benchmark (computing) ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,Reduction - Abstract
© Springer International Publishing Switzerland 2014. The wide adoption of parallel processing hardware in mainstream computing as well as the raising interest for efficient parallel programming in the developer community increase the demand for parallel programming model support for common algorithmic patterns. In this work we present an extension to the OpenMP task construct to add support for reductions in while-loops and general-recursive algorithms. Further we discuss implications on the OpenMP standard and present a prototype implementation in OmpSs. Benchmark results confirm applicability of this approach and scalability on current SMP systems., This work has been developed with the support of the grant SEV-2011-00067 of Severo Ochoa Program, awarded by the Spanish Government and by the Spanish Ministry of Science and Innovation (contracts TIN2012-34557, and CAC2007-00052) by the Generalitat de Catalunya (contract 2009-SGR-980) and the Intel-BSC Exascale Lab collaboration project. Also the authors would like to thank the OpenMP community for their substantial contribution to this work.
309. Implementing MPI on the BlueGene/L supercomputer
- Author
-
Almási, G., Archer, C., Castanõs, J. G., Chris Erway, C., Heidelberger, P., Xavier Martorell, Moreira, J. E., Pinnow, K., Ratterman, J., Smeds, N., Steinmacher-Burow, B., Gropp, W., and Toonen, B.
310. Evaluation of the Memory Page Migration Influence in the System Performance: The Case of the SGI O2000
- Author
-
Xavier Martorell, Julita Corbalan, and Jesús Labarta
- Subjects
Distributed shared memory ,Flat memory model ,Page fault ,Computer science ,Interleaved memory ,Operating system ,Uniform memory access ,Registered memory ,Distributed memory ,Parallel computing ,computer.software_genre ,Memory map ,computer - Abstract
Current shared-memory multiprocessor CC-NUMA architectures provide a global address space to applications by hardware. However, even though the memory is virtually shared, it is actually physically distributed. Since memory nodes are distributed across the system, the cost of the memory accesses depends on the distance between the node that accesses the data and the node that physically contains the data. To reduce the impact of a bad initial memory placement, some operating systems offer a dynamic memory migration mechanism.In this paper, we want to demonstrate that memory migration mechanisms are a useful approach, but that their performance depends more on related issues, such as the processor scheduling, than on the mechanism itself. To show that, we evaluate the case of the automatic memory migration mechanism provided by IRIX, in Origin systems.We have evaluated several workloads of OpenMP applications under different system conditions such as the processor scheduling policy or the system load. In particular, we have focused on the effects of the page migration mechanism on the CPU time consumed by each application, the processor allocation received, and the speedup, when applying performance-driven scheduling policies.Results show that, if the scheduler is memory conscious, that is, it maintains as much as possible the system stable, the automatic memory page migration mechanism provided by IRIX will improve the execution time of OpenMPapplications. Experiments also show that the combination of performance-driven policies and the memory migration mechanism results in a system that can be automatically self-evaluated and self-configured.
311. Improving Gang scheduling through job performance analysis and malleability
- Author
-
Julita Corbalan, Jesús Labarta, and Xavier Martorell
- Subjects
Malleability ,Exploit ,Computer science ,Job performance ,Distributed computing ,Programming paradigm ,User requirements document ,Gang scheduling ,Scheduling (computing) - Abstract
The OpenMP programming model provides parallel applications a very important feature: job malleability. Job malleability is the capacity of an application to dynamically adapt its parallelism to the number of processors allocated to it. We believe that job malleability provides to applications the flexibility that a system needs to achieve its maximum performance. We also defend that a system has to take its decisions not only based on user requirements but also based on run-time performance measurements to ensure the efficient use of resources. Job malleability is the application characteristic that makes possible the run-time performance analysis. Without malleability applications would not be able to adapt their parallelism to the system decisions. To support these ideas, we present two new approaches to attack the two main problems of Gang Scheduling: the excessive number of time slots and the fragmentation. Our first proposal is to apply a scheduling policy inside each time slot of Gang Scheduling to distribute processors among applications considering their efficiency, calculated based on run-time measurements. We call this policy Performance-Driven Gang Scheduling. Our second approach is a new re-packing algorithm, Compress&Join, that exploits the job malleability. This algorithm modifies the processor allocation of running applications to adapt it to the system necessities and minimize the fragmentation and number of time slots. These proposals have been implemented in a SGI Origin 2000 with 64 processors. Results show the validity and convenience of both, to consider the job performance analysis calculated at run-time to decide the processor allocation, and to use a flexible programming model that adapts applications to system decisions.
312. Combining static and dynamic data coalescing in unified parallel C
- Author
-
Michail Alvanos, Ettore Tiotto, Xavier Martorell, José Nelson Amaral, Montse Farreras, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Computer science ,One-sided communication ,02 engineering and technology ,Parallel computing ,C (Llenguatge de programació) ,Unified Parallel C ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Code generation ,Partitioned global address space ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,computer.programming_language ,Informàtica::Arquitectura de computadors::Arquitectures distribuïdes [Àrees temàtiques de la UPC] ,Distributed database ,Parallel processing (Electronic computers) ,Dynamic data ,Processament en paral·lel (Ordinadors) ,Unified parallel C ,Supercomputer ,020202 computer hardware & architecture ,Computational Theory and Mathematics ,Hardware and Architecture ,C (Computer program language) ,Signal Processing ,Performance evaluation ,020201 artificial intelligence & image processing ,computer - Abstract
Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through a network. This paper addresses important limitations in the code generation for partitioned global address space (PGAS) languages. These languages allow fine-grained communication and lead to programs that perform many fine-grained accesses to data. When the data is distributed to remote computing nodes, code transformations are required to prevent performance degradation. Until now code transformations to PGAS programs have been restricted to the cases where both the physical mapping of the data or the number of processing nodes are known at compilation time. In this paper, a novel application of the inspector-executor model overcomes these limitations and allows profitable code transformations, which result in fewer and larger messages sent through the network, when neither the data mapping nor the number of processing nodes are known at compilation time. A performance evaluation reports both scaling and absolute performance numbers on up to 32,768 cores of a Power 775 supercomputer. This evaluation indicates that the compiler transformation results in speedups between 1.15 $\times$ and 21 $\times$ over a baseline and that these automated transformations achieve up to 63 percent the performance of the MPI versions.
313. OmpSs@Zynq All-Programmable SoC Ecosystem
- Author
-
Antonio Filgueras, Kees Vissers, Jan Langer, Juanjo Noguera, Xavier Martorell, Eduard Gil, Daniel Jiménez-González, Carlos Alvarez, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Matrius de portes programables per l'usuari ,Exploit ,Computer science ,business.industry ,Heterogenous parallel programming model ,Field programmable gate arrays ,ComputerSystemsOrganization_PROCESSORARCHITECTURES ,Chip ,Task (computing) ,Resource (project management) ,Computer architecture ,Embedded system ,Programming paradigm ,business ,Field-programmable gate array ,Task dataflow models ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,Automatic hardware generation - Abstract
OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely influenced the recently appeared OpenMP 4.0 specification. Zynq All-Programmable SoC combines the features of a SMP and a FPGA and benefits DLP, ILP and TLP parallelisms in order to efficiently exploit the new technology improvements and chip resource capacities. In this paper, we focus on programmability and heterogeneous execution support, presenting a successful combination of the OmpSs programming model and the Zynq All-Programmable SoC platforms.
314. Runtime address space computation for SDSM systems
- Author
-
Balart, J., Gonzàlez, M., Xavier Martorell, Ayguadé, E., and Labarta, J.
315. 29th International Conference on Field Programmable Logic and Applications, FPL 2019, Barcelona, Spain, September 8-12, 2019
- Author
-
Ioannis Sourdis, Christos-Savvas Bouganis, Carlos álvarez 0001, Leonel Antonio Toledo Díaz, Pedro Valero-Lara, and Xavier Martorell
- Published
- 2019
316. Evolving OpenMP for Evolving Architectures - 14th International Workshop on OpenMP, IWOMP 2018, Barcelona, Spain, September 26-28, 2018, Proceedings
- Author
-
Bronis R. de Supinski, Pedro Valero-Lara, Xavier Martorell, Sergi Mateo Bellido, and Jesús Labarta
- Published
- 2018
- Full Text
- View/download PDF
317. Parallel scalability of face detection in heterogeneous multithreaded architectures
- Author
-
Oro García, David, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Hernando Pericás, Francisco Javier, Martorell Bofill, Xavier, and Martorell, Xavier (Martorell Bofill)
- Abstract
Recently, facial recognition systems have become extremely popular and deployments of this technology are now ubiquitous. Applications ranging from access control to automated surveillance of video feeds rely on facial recognition for precisely identifying persons at multiple locations. Modern facial recognition software targeting surveillance applications typically needs to analyze video streams in order to identify faces in crowds in real time. The first analytical step to be conducted in facial recognition systems is face detection, which mainly involves determining the precise coordinates and dimensions of all faces appearing on a given image or video frame, and constitutes the first major bottleneck in the pipeline. As opposed to other use cases such as image classification that usually work flawlessly with VGA images, surveillance applications require working with high or ultra high definition resolutions in order to be able to locate and correctly identify people in crowds. Consequently, in order to maximize the chances of obtaining facial mugshots with enough quality and pixel densities to enable accurate facial identification, it is a must to be able to develop algorithms and heuristics that are capable of working with big images. The main challenge is to perform all required computations involved in just a few milliseconds to avoid the slowdown of all subsequent stages of the facial recognition pipeline. In this thesis, we study several low-level parallelization techniques and kernels that efficiently solve the problem of face detection in a scalable manner over multithreaded data-parallel GPU architectures. The first part of the thesis covers a multilevel mechanism that exploits both coarse-grained and fine-grained parallelism in combination with a smart usage of local on-die memories to reduce GPU underutilization when evaluating boosted cascades of ensembles over high-definition videos. We demonstrate that our proposed parallelization strategy solves the problem of GPU underutilization and achieves a 5X speed up when compared to methods relying on serialized kernel execution. The second part of the thesis presents a heuristic and a hybrid framework combining hand-crafted features with state-of-the-art convolutional neural networks to address the problem of real-time face detection in videos at ultra-high definition resolutions (4K and 8K). The obtained results prove that our proposed heuristic is capable of achieving real-time throughput over challenging video datasets when combining binarized hand-crafted features for discarding regions not containing faces with neural networks to further refine the underlying face detection process. The third part of the thesis presents a novel parallel non-maximum suppression (NMS) algorithm targeting the on-die GPU architectures included in modern SoCs. The contributed algorithm relies on a boolean matrix and parallel reductions to handle workloads featuring thousands of simultaneous detections on a given picture or video frame. Finally, we both formally and experimentally demonstrate that the execution time of our proposed parallel NMS algorithm linearly scales as the amount of GPU cores are increased. Actualment, els sistemes de reconeixement facial han augmentat la seva popularitat notablement i existeix una gran varietat de desplegaments de la tecnologia a escenaris molt diversos. Aplicacions com el control d'accés o la videovigilància automatitzada utilitzen algorismes de reconeixment facial per tal d'identificar persones de manera molt precisa a tot tipus d'ubicacions. Els sistemes de reconeixement facial optimitzats per a videovigilància massiva es basen en l'anàlisi exhaustiu en temps real de fotogrames a fluxes de vídeo per tal d'identificar cares en entorns multitudinaris. La primera etapa analítica emprada pels sistemes de reconeixement facial és la detecció facial. Aquesta etapa és l'encarregada de determinar de manera bastant exacte les coordenades i dimensions de totes les cares que apareixen en un fotograma de vídeo o imatge. La detecció facial constituteix, doncs, el principal coll d'ampolla al pipeline de reconeixement facial. A diferència d'altres casos d'ús com la classificació automatitzada d'imatges, que acostumen a funcionar òptimament amb imatges de resolució VGA, les aplicacions de videovigilància massiva automatitzada necessiten treballar amb resolucions d'alta o molt alta definició per tal de localitzar i identificar persones en entorns multitudinaris. Així doncs, per tal d'aconseguir augmentar les possibilitats de capturar imatges de cares amb suficient densitat de píxels i qualitat, és necessari desenvolupar algorismes i heurístiques que treballin a resolucions elevades. El principal repte consisteix en intentar executar tots els càlculs necessaris en uns pocs mil·lisegons per tal d'evitar retards inacceptables al computar les etapes posteriors a la detecció facial al pipeline de reconeixement facial. Aquesta tesi estudia diferents tècniques de paral·lelització a baix nivell i kernels que permeten resoldre de manera eficient el problema de la detecció facial escalable sobre arquitectures GPU multifil paral·leles. La primera part de la tesi desenvolupa un mecanisme de paral·lelització que utilitza les memòries locals disponibles dins del xip de manera sofisticada en combinació amb el paral·lelisme de granularitat fina i gruixuda durant l'execució de kernels per tal d'avaluar cascades de classificadors sobre vídeos d'alta definició. Es demostra que la tècnica de paral·lelització proposada soluciona el problema d'infrautilització de recursos a les arquitectures GPU i aconsegueix un factor 5X de millora respecte a mètodes que executen kernels en sèrie. La segona part de la tesi es centra en el desenvolupament d'una heurística i un framework híbrid que utilitza classificadors de característiques optimitzades de forma manual amb modernes xarxes neuronals convolucionals per tal de resoldre en temps real el problema de la detecció facial a ultra-alta definició (resolucions 4K i 8K). Els resultats obtinguts demostren que el mètode que s'ha proposat aconsegueix descartar ràpidament les regions dels fotogrames que no contenen cares, mentre que la xarxa neuronal aconseguix refinar el procediment de detecció facial a les regions que sí les contenen. La tercera part de la tesi presenta un novedós algorisme paral·lel de non-maximum suppression (NMS) que està optimitzat per les arquitectures GPU mòbils de baix consum. L'algorisme proposat es basa en una matriu booleana i reduccions paral·leles per tal d'aconseguir computar de manera escalable i eficient fotogrames que inclouen milers de deteccions simultànies d'objectes o de cares. Finalment, es demostra que el temps d'execució de l'algorisme NMS paral·lel proposat escala linealment a mesura que s'incrementa la quantitat de nuclis disponibles a les arquitectures GPU.
- Published
- 2020
318. Energy optimising methodologies on heterogeneous data centres
- Author
-
Nishtala, Rajiv, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Martorell Bofill, Xavier, Mosse, Daniel, and Martorell, Xavier (Martorell Bofill)
- Subjects
Informàtica [Àrees temàtiques de la UPC] ,Energia -- Estalvi ,Centres informàtics - Abstract
In 2013, U.S. data centres accounted for 2.2% of the country's total electricity consumption, a figure that is projected to increase rapidly over the next decade. A significant proportion of power consumed within a data centre is attributed to the servers, and a large percentage of that is wasted as workloads compete for shared resources. Many data centres host interactive workloads (e.g., web search or e-commerce), for which it is critical to meet user expectations and user experience, called Quality of Service (QoS). There is also a wish to run both interactive and batch workloads on the same infrastructure to increase cluster utilisation and reduce operational costs and total energy consumption. Although much work has focused on the impacts of shared resource contention, it still remains a major problem to maintain QoS for both interactive and batch workloads. The goal of this thesis is twofold. First, to investigate how, and to what extent, resource contention has an effect on throughput and power of batch workloads via modelling. Second, we introduce a scheduling approach to determine on-the-fly the best configuration to satisfy the QoS for latency-critical jobs on any architecture. To achieve the above goals, we first propose a modelling technique to estimate server performance and power at runtime called Runtime Estimation of Performance and Power (REPP). REPP's goal is to allow administrators' control on power and performance of processors. REPP achieves this goal by estimating performance and power at multiple hardware settings (dynamic frequency and voltage states (DVFS), core consolidation and idle states) and dynamically sets these settings based on user-defined constraints. The hardware counters required to build the models are available across architectures, making it architecture agnostic. We also argue that traditional modelling and scheduling strategies are ineffective for interactive workloads. To manage such workloads, we propose Hipster that combines both a heuristic, and a reinforcement learning algorithm to manage interactive workloads. Hipster's goal is to improve resource efficiency while respecting the QoS of interactive workloads. Hipster achieves its goal by exploring the multicore system and DVFS. To improve utilisation and make the best usage of the available resources, Hipster can dynamically assign remaining cores to batch workloads without violating the QoS constraints for the interactive workloads. We implemented REPP and Hipster in real-life platforms, namely 64-bit commercial (Intel SandyBridge and AMD Phenom II X4 B97) and experimental hardware (ARM big.LITTLE Juno R1). After obtaining extensive experimental results, we have shown that REPP successfully estimates power and performance of several single-threaded and multiprogrammed workloads. The average errors on Intel, AMD and ARM architectures are, respectively, 7.1%, 9.0%, 7.1% when predicting performance, and 8.1%, 6.5%, 6.0% when predicting power. Similarly, we show that when compared to prior work, Hipster improves the QoS guarantee for Web-Search from 80% to 96%, and for Memcached from 92% to 99%, while reducing the energy consumption by up to 18% on the ARM architecture., En el año 2013, los centros de cálculo de los EEUU consumieron el 2,2% del consumo total de electricidad en ese país. Las proyecciones futuras indican que esta cantidad se incrementará rápidamente durante la próxima década. Una cantidad significativa del consumo de un centro de cálculo corresponde al funcionamiento de los servidores, y un alto porcentaje de este consumo se desperdicia mientras los trabajos compiten en el uso de recursos compartidos. Una gran cantidad de los centros de cálculo se utilizan para ejecutar trabajos interactivos, para los cuales es muy importante cumplir con las expectativas de los usuarios y proporcionar una alta calidad de servicio (CDS). En estos centros, se intentan ejecutar aplicaciones interactivas i en batch en la misma infraestructura para incrementar su utilización, y reducir los costes de mantenimiento y la energía total consumida. Aunque se dedican muchos esfuerzos al impacto de la compartición de recursos en el rendimiento de las aplicaciones, todavía se mantiene el problema de garantizar un determinado nivel de CDS para los dos tipos de trabajos, interactivos y en batch. Los objetivos de esta tesis doctoral son, enprimerlugar, investigar mediante técnicas de modelado, cómo y hasta que punto la contención debida a la compartición de recursos tiene un efecto en la ejecución y el consumo en trabajos batch. Ensegundolugar, la tesis presenta una técnica de planificación para determinar dinámicamente la mejor configuración para satisfacer una CDS en trabajos interactivos con un límite de latencia preestablecido, en cualquier arquitectura Para conseguir los objetivos propuestos, primero proponemos una técnica de modelización para estimar dinámicamente el rendimiento y el consumo de los servidores, que recibe por nombre Runtime Estimation of Performance and Power (REPP). El objetivo que perseguimos con la política de planificación REPP es permitir a los administradores obtener el control del consumo y el rendimiento de los procesadores. REPP consigue este objetivo a través de la estimación del rendimiento de las aplicaciones y su consumo al variar los niveles de energía del procesador, y dinámicamente cambia la configuración del sistema respetando las condiciones dadas por el usuario. Este modelado se realiza en base a un conjunto de contadores de eventos del procesador, que se han seleccionado de forma que están disponibles en las arquitecturas más comunes, haciendo que REPP sea independiente de la arquitectura En este trabajo de tesis doctoral, también defendemos que los métodos tradicionales de modelado y las estrategias de planificación usadas en estos entornos, no son efectivas para trabajos interactivos. Para tratar correctamente a estos trabajos, proponemos Hipster, una política de planificación que combina una heurística y un algoritmo basado en aprendizaje por refuerzo. El objetivo que fijamos con Hipster es mejorar la eficiencia en el uso de los recursos, al mismo tiempo que se respeta la calidad de servicio data a los trabajos interactivos. Hipster consigue sus objetivos con la exploración del funcionamiento del sistema y la variación de la frecuencia y el voltaje de los procesadores Hemos implementado REPP y Hipster en plataformas comerciales de 64bit (Intel y AMD) y experimentales (ARM big.LITTLE). Hemos obtenido resultados experimentales en estas plataformas y hemos demostrado que REPP realiza estimaciones de consumo y rendimiento de aplicaciones secuenciales y de trabajos formados por varias aplicaciones. El error medio en las arquitecturas Intel, AMD y ARM son, respectivamente, del 7,1%,9,0% y 7,1% en la predicción del rendimiento, y del 8,1%,6,5% y 6,0% en la predicción del consumo. De forma similar, demostramos que al comparar Hipster con los trabajos previos, nuestro algoritmo mejora la calidad de servicio para el servicio de búsqueda en la web, entre el 80% y el 96%, y para la aplicación Memcached del 92% al 99%, al tiempo que reduce el consumo de energia hasta el 18% en ARM
- Published
- 2017
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.