22 results on '"Multi-threading"'
Search Results
2. Adaptive Manta Ray Foraging Optimizer for Determining Optimal Thread Count in Multi-threaded Applications.
- Author
-
MALAVE, SACHIN H. and SHINDE, SUBHASH K.
- Subjects
MOBULIDAE ,MULTICORE processors ,BIOLOGICALLY inspired computing ,ENERGY consumption ,PREDICTION models - Abstract
In high-performance computing, selecting the appropriate thread count has a significant impact on execution time and energy consumption. On multi-core processor systems, it is widely believed that for maximal speedup, the total number of threads should match the number of cores. Thread migration rate, cache miss rate, thread synchronisation, and context switching rate are all impacted by changes in thread count at the hardware and OS levels. As a result, it is extremely difficult to analyse these factors for relatively complex multi-threaded programs and determine the optimal number of threads. The method put forward in this study is an enhancement of the conventional Manta-Ray Foraging Optimization, a bio-inspired approach that has been applied to a number of numerical engineering problems. The proposed approach makes use of three foraging steps: chain, cyclone, and somersault. Using the well-known benchmark suite PARSEC (The Princeton Application Repository for Shared-Memory Computers), the suggested work is simulated on an NVIDIA-DGX Intel Xeon-E5 2698-v4 processor. The findings demonstrate that the new modified AMRFO-based prediction model can choose the appropriate number of threads with fairly minimal overheads when compared to the current method. [ABSTRACT FROM AUTHOR]
- Published
- 2022
3. Parallel best-first search algorithms for planning problems on multi-core processors.
- Author
-
El Baz, Didier, Fakih, Bilal, Sanchez Nigenda, Romeo, and Boyer, Vincent
- Subjects
- *
SEARCH algorithms , *MULTICORE processors , *ARTIFICIAL intelligence , *PARALLEL algorithms , *INTERNATIONAL competition , *PARALLEL programming - Abstract
The multiplication of computing cores in modern processor units permits revisiting the design of classical algorithms to improve computational performance in complex application domains. Artificial Intelligence planning is one of those applications where large search spaces require intelligent and more exhaustive search control. In this paper, parallel planning algorithms, derived from best-first search, are proposed for shared memory architectures. The parallel algorithms, based on the asynchronous work pool paradigm, maintain good thread occupancy in multi-core CPUs. All algorithms use one ordered global list of states stored in shared memory from where they select nodes for expansion. A parallel best-first search algorithm that develops new states with depth equal to one is proposed first. Then, we propose an extension of this parallel algorithm that features a diversification strategy in order to escape local minima. We study and analyse a set of computational experiments for problems that come from the International Planning Competition and real-world industry applications. The empirical evaluation shows that the parallel algorithms solve most of the domains efficiently without incurring higher solutions costs. In those problems with partial results, we highlight the potential shortcomings of the proposed approaches for promising future directions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
4. Efficient parallelisation of the packet classification algorithms on multi‐core central processing units using multi‐threading application program interfaces.
- Author
-
Abbasi, Mahdi and Rafiee, Milad
- Abstract
The categorisation of network packets according to multiple parameters such as sender and receiver addresses is called packet classification. Packet classification lies at the core of Software‐Defined Networking (SDN)‐based network applications. Due to the increasing speed of network traffic, there is an urgent need for packet classification at higher speeds. Although it is possible to accelerate packet classification algorithms through hardware implementation, this solution imposes high costs and offers limited development capacity. On the other hand, current software methods to solve this problem are relatively slow. A practical solution to this problem is to parallelise packet classification using multi‐core processors. In this study, the Thread, parallel patterns library (PPL), open multi‐processing (OpenMP), and threading building blocks (TBB) libraries are examined and implemented to parallelise three packet classification algorithms, i.e. tuple space search, tuple pruning search, and hierarchical tree. According to the results, the type of algorithm and rulesets may influence the performance of parallelisation libraries. In general, the TBB‐based method shows the best performance among parallelisation libraries due to using a theft mechanism and can accelerate the classification process up to 8.3 times on a system with a quad‐core processor. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
5. Cooperative scheduling of multi-core and cloud resources: fine-grained offloading strategy for multithreaded applications.
- Author
-
Wang, Zhaoyang, Hao, Wanming, Yan, Lei, Han, Zhuo, and Yang, Shouyi
- Subjects
- *
CENTRAL processing units , *MOBILE computing , *ENERGY consumption , *SCHEDULING , *POWER aware computing , *MULTICORE processors , *COMPUTER scheduling - Abstract
Nowadays, advanced smart mobile devices equipped with multi-core central processing units for handling multithreaded (MT) applications. However, existing research mainly uses single-thread (ST) computing to deal with applications, which limits the performance of mobile computing. To make full use of multi-core resources, this study proposes a fine-grained MT offloading strategy to solve the offloading problem of MT application. The strategy jointly schedules cloud computing resources, as well as local multi-core computing and communication resources. Precisely, the authors first formulate the minimum energy consumption problem for ST offloading. Then, they prove that the problem is convex and solve it by standard convex optimisation technique. Thirdly, they extend the optimisation goals from ST applications to MT applications, and design calculation rules for MT applications to reduce computing costs. Finally, based on these calculation rules and the optimal solution for ST offloading, they develop a MT offloading strategy to solve the computation offloading problem of MT applications. Simulation results show that the proposed fine-grained MT offloading strategy effectively reduces the minimum delay requirement of mobile computing. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
6. Synergy: An HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC.
- Author
-
ZHONG, GUANWEN, DUBEY, AKSHAT, TAN, CHENG, and MITRA, TULIKA
- Subjects
MULTICORE processors ,SYSTEMS on a chip ,WIRELESS Internet ,INTERNET of things ,ARM microprocessors ,HETEROGENEOUS computing - Abstract
Convolutional Neural Networks (CNN) have been widely deployed in diverse application domains. There has been significant progress in accelerating both their training and inference using high-performance GPUs, FPGAs, and custom ASICs for datacenter-scale environments. The recent proliferation of mobile and Internet of Things (IoT) devices have necessitated real-time, energy-efficient deep neural network inference on embedded-class, resource-constrained platforms. In this context, we present Synergy, an automated, hardware-software co-designed, pipelined, high-throughput CNN inference framework on embedded heterogeneous system-on-chip (SoC) architectures (Xilinx Zynq). Synergy leverages, through multi-threading, all the available on-chip resources, which includes the dual-core ARM processor along with the FPGA and the NEON Single-Instruction Multiple-Data (SIMD) engines as accelerators.Moreover, Synergy provides a unified abstraction of the heterogeneous accelerators (FPGA and NEON) and can adapt to different network configurations at runtimewithout changing the underlying hardware accelerator architecture by balancingworkload across accelerators through work-stealing. Synergy achieves 7.3X speedup, averaged across seven CNN models, over a well-optimized software-only solution. Synergy demonstrates substantially better throughput and energy-efficiency compared to the contemporary CNN implementations on the same SoC architecture. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
7. Two-sided orthogonal reductions to condensed forms on asymmetric multicore processors.
- Author
-
Alonso, Pedro, Catalán, Sandra, Herrero, José R., Quintana-Ortí, Enrique S., and Rodríguez-Sánchez, Rafael
- Subjects
- *
MULTICORE processors , *KERNEL operating systems , *INFORMATION asymmetry , *SINGULAR value decomposition , *MATHEMATICAL simplification - Abstract
We investigate how to leverage the heterogeneous resources of an Asymmetric Multicore Processor (AMP) in order to deliver high performance in the reduction to condensed forms for the solution of dense eigenvalue and singular-value problems. The routines that realize this type of two-sided orthogonal reductions (TSOR) in LAPACK are especially challenging, since a significant fraction of their floating-point operations are cast in terms of memory-bound kernels while the remaining part corresponds to efficient compute-bound kernels. To deal with this scenario: (1) we leverage implementations of memory-bound and compute-bound kernels specifically tuned for AMPs; (2) we select the algorithmic block size for the TSOR routines via a practical model; and (3) we adjust the type and number of cores to use at each step of the reduction. Our experiments validate the model and assess the performance of our asymmetry-aware TSOR routines, using an ARMv7 big.LITTLE AMP, for three key operations: the reduction to tridiagonal form for symmetric eigenvalue problems, the reduction to Hessenberg form for non-symmetric eigenvalue problems, and the reduction to bidiagonal form for singular-value problems. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
8. An efficient approach for mining sequential patterns using multiple threads on very large databases.
- Author
-
Huynh, Bao, Trinh, Cuong, Huynh, Huy, Snasel, Vaclav, Van, Thien-Trang, and Vo, Bay
- Subjects
- *
SEQUENTIAL pattern mining , *DATA mining , *MULTICORE processors , *SIMULTANEOUS multithreading processors , *DATA structures - Abstract
Sequential pattern mining (SPM) plays an important role in data mining, with broad applications such as in financial markets, education, medicine, and prediction. Although there are many efficient algorithms for SPM, the mining time is still high, especially for mining sequential patterns from huge databases, which require the use of a parallel technique. In this paper, we propose a parallel approach named MCM-SPADE (Multiple threads CM-SPADE), for use on a multi-core processor system as a multi-threading technique for SPM with very large database, to enhance the performance of the previous methods SPADE and CM-SPADE. The proposed algorithm uses the vertical data format and a data structure named CMAP (Co-occurrence MAP) for storing co-occurrence information. Based on the data structure CMAP, the proposed algorithm performs early pruning of the candidates to reduce the search space and it partitions the related tasks to each processor core by using the divide-and-conquer property. The proposed algorithm also uses dynamic scheduling to avoid task idling and achieve load balancing between processor cores. The experimental results show that MCM-SPADE attains good parallelization efficiency on various input databases. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
9. Static scheduling of the LU factorization with look-ahead on asymmetric multicore processors.
- Author
-
Catalán, Sandra, Herrero, José R., Quintana-Ortí, Enrique S., and Rodríguez-Sánchez, Rafael
- Subjects
- *
LU factorization , *MULTICORE processors , *SYSTEMS on a chip , *PARALLEL programs (Computer programs) , *COMPUTER systems - Abstract
We analyze the benefits of look-ahead in the parallel execution of the LU factorization with partial pivoting (LUpp) in two distinct “asymmetric” multicore scenarios. The first one corresponds to an actual hardware-asymmetric architecture such as the Samsung Exynos 5422 system-on-chip (SoC), equipped with an ARM big.LITTLE processor consisting of a quad-core Cortex-A15 cluster plus a quad-core Cortex-A7 cluster. For this scenario, we propose a careful mapping of the different types of tasks appearing in LUpp to the computational resources, in order to produce an efficient architecture-aware exploitation of the computational resources integrated in this SoC. The second asymmetric configuration appears in a hardware-symmetric multicore architecture where the cores can individually operate at a different frequency levels. In this scenario, we show how to employ the frequency slack to accelerate the tasks in the critical path of LUpp in order to produce a faster global execution as well as a lower energy consumption. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
10. P-HS-SFM: a parallel harmony search algorithm for the reproduction of experimental data in the continuous microscopic crowd dynamic models.
- Author
-
Jaber, Khalid Mohammad, Alia, Osama Moh’d, and Shuaib, Mohammed Mahmod
- Subjects
- *
SOCIAL forces , *SEARCH algorithms , *MULTICORE processors , *PARALLEL computers , *COMPUTER science - Abstract
Finding the optimal parameters that can reproduce experimental data (such as the velocity-density relation and the specific flow rate) is a very important component of the validation and calibration of microscopic crowd dynamic models. Heavy computational demand during parameter search is a known limitation that exists in a previously developed model known as the Harmony Search-Based Social Force Model (HS-SFM). In this paper, a parallel-based mechanism is proposed to reduce the computational time and memory resource utilisation required to find these parameters. More specifically, two MATLAB-based multicore techniques (
parfor andcreate independent jobs ) using shared memory are developed by taking advantage of the multithreading capabilities of parallel computing, resulting in a new framework called the Parallel Harmony Search-Based Social Force Model (P-HS-SFM). The experimental results show that theparfor -based P-HS-SFM achieved a better computational time of about 26 h, an efficiency improvement of54% and a speedup factor of 2.196 times in comparison with the HS-SFM sequential processor. The performance of the P-HS-SFM using the create independent jobs approach is also comparable toparfor with a computational time of 26.8 h, an efficiency improvement of about 30% and a speedup of 2.137 times. [ABSTRACT FROM AUTHOR]- Published
- 2018
- Full Text
- View/download PDF
11. Exploiting nested task-parallelism in the [formula omitted]-LU factorization.
- Author
-
Carratalá-Sáez, Rocío, Christophersen, Sven, Aliaga, José I., Beltran, Vicenç, Börm, Steffen, and Quintana-Ortí, Enrique S.
- Subjects
LUTETIUM compounds ,BOUNDARY element methods ,DATA structures ,MULTICORE processors - Abstract
• We parallelize the H-LU factorization as implemented in the sequential version of H2Lib, with problems involving low-rank blocks and real H-arithmetic. • We propose the use of an auxiliary "skeleton" array to identify task dependencies. • We leverage the new OmpSs-2 model, with explicit support for weak dependencies and early release to take advantage of fine-grained nested parallelism. We address the parallelization of the LU factorization of hierarchical matrices (H -matrices) arising from boundary element methods. Our approach exploits task-parallelism via the OmpSs programming model and runtime, which discovers the data-flow parallelism intrinsic to the operation at execution time, via the analysis of data dependencies based on the memory addresses of the tasks' operands. This is especially challenging for H -matrices, as the structures containing the data vary in dimension during the execution. We tackle this issue by decoupling the data structure from that used to detect dependencies. Furthermore, we leverage the support for weak operands and early release of dependencies, recently introduced in OmpSs-2, to accelerate the execution of parallel codes with nested task-parallelism and fine-grain tasks. As a result, we obtain a significant improvement in the parallel performance with respect to our previous work. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
12. Pedagogy and tools for teaching parallel computing at the sophomore undergraduate level.
- Author
-
Grossman, Max, Aziz, Maha, Chi, Heng, Tibrewal, Anant, Imam, Shams, and Sarkar, Vivek
- Subjects
- *
PARALLEL programming , *COLLEGE sophomores , *COMPUTER programming education in graduate schools , *MULTICORE processors , *UNIVERSITIES & colleges - Abstract
As the need for multicore-aware programmers rises in both science and industry, Computer Science departments in universities around the USA are having to rethink their parallel computing curriculum. At Rice University, this rethinking took the shape of COMP 322, an introductory parallel programming course that is required for all Bachelors students. COMP 322 teaches students to reason about the behavior of parallel programs, educating them in both the high level abstractions of task-parallel programming as well as the nitty gritty details of working with threads in Java. In this paper, we detail the structure, principles, and experiences of COMP 322, gained from 6 years of teaching parallel programming to second-year undergraduates. We describe in detail two particularly useful tools that have been integrated into the curriculum: the HJlibparallel programming library and the Habanero Autograder for parallel programs. We present this work with the hope that it will help augment improvements to parallel computing education at other universities. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
13. Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors.
- Author
-
Catalán, Sandra, Herrero, José R., Igual, Francisco D., Rodríguez-Sánchez, Rafael, Quintana-Ortí, Enrique S., and Adeniyi-Jones, Chris
- Subjects
LINEAR algebra ,MULTICORE processors ,SIMULTANEOUS multithreading processors ,HIGH performance computing ,ENERGY consumption ,COMPUTER software - Abstract
Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures, the adaption of these libraries for asymmetric multicore processors (AMPs) is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power-hungry versus slow/energy-efficient. For this purpose, we integrate coarse-grain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type. Our results on an ARM ® big.LITTLE™ processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach from the perspectives of both throughput and energy efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
14. Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors.
- Author
-
Catalán, Sandra, Igual, Francisco, Mayo, Rafael, Rodríguez-Sánchez, Rafael, and Quintana-Ortí, Enrique
- Subjects
- *
CONFIGURATIONS (Geometry) , *SCHEDULING , *MATRIX multiplications , *ASYMMETRIC digital subscriber lines , *MULTICORE processors , *COMPUTER network resources - Abstract
Asymmetric multicore processors have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications on clusters of commodity systems-on-chip. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication ( gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric-static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
15. An iteration-based hybrid parallel algorithm for tridiagonal systems of equations on multi-core architectures.
- Author
-
Tang, Guangping, Yang, Wangdong, Li, Kenli, Ye, Yu, Xiao, Guoqing, and Li, Keqin
- Subjects
ITERATIVE methods (Mathematics) ,HYBRID systems ,PARALLEL algorithms ,MULTICORE processors ,COMPUTER architecture - Abstract
An optimized parallel algorithm is proposed to solve the problem occurred in the process of complicated backward substitution of cyclic reduction during solving tridiagonal linear systems. Adopting a hybrid parallel model, this algorithm combines the cyclic reduction method and the partition method. This hybrid algorithm has simple backward substitution on parallel computers comparing with the cyclic reduction method. In this paper, the operation count and execution time are obtained to evaluate and make comparison for these methods. On the basis of results of these measured parameters, the hybrid algorithm using the hybrid approach with a multi-threading implementation achieves better efficiency than the other parallel methods, that is, the cyclic reduction and the partition methods. In particular, the approach involved in this paper has the least scalar operation count and the shortest execution time on a multi-core computer when the size of equations meets some dimension threshold. The hybrid parallel algorithm improves the performance of the cyclic reduction and partition methods by 19.2% and 13.2%, respectively. In addition, by comparing the single-iteration and multi-iteration hybrid parallel algorithms, it is found that increasing iteration steps of the cyclic reduction method does not affect the performance of the hybrid parallel algorithm very much. Copyright © 2015 John Wiley & Sons, Ltd. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
16. Intel Cilk Plus for complex parallel algorithms: “Enormous Fast Fourier Transforms” (EFFT) library.
- Author
-
Asai, Ryo and Vladimirov, Andrey
- Subjects
- *
INTEL microprocessors , *PARALLEL algorithms , *FAST Fourier transforms , *MULTICORE processors , *BANDWIDTHS - Abstract
In this paper we demonstrate the methodology for parallelizing the computation of large one-dimensional discrete fast Fourier transforms (DFFTs) on multi-core Intel Xeon processors. DFFTs based on the recursive Cooley–Tukey method have to control cache utilization, memory bandwidth and vector hardware usage, and at the same time scale across multiple threads or compute nodes. Our method builds on a single-threaded Intel Math Kernel Library (MKL) implementation of real-to-complex DFFT, and uses the Intel Cilk Plus framework for thread parallelism. We demonstrate the ability of Intel Cilk Plus to handle parallel recursion with nested loop-centric parallelism without tuning the code to the number of cores or cache metrics. The result of our work is a library called EFFT that performs 1D DFTs of size 2 N for N ≥ 21 faster than the corresponding Intel MKL parallel DFT implementation by up to 1.5 × , and faster than FFTW by up to 2.5x. The code of EFFT is available for free download under the GPLv3 license. This work provides a new efficient DFFT implementation, and at the same time demonstrates an educational example of how computer science problems with complex parallel patterns can be optimized for high performance using the Intel Cilk Plus framework. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
17. Parallelizing filter-and-verification based exact set similarity joins on multicores.
- Author
-
Fier, Fabian and Freytag, Johann-Christoph
- Subjects
- *
ALGORITHMS , *DATA structures , *SCALABILITY , *MULTICORE processors , *SPEED - Abstract
Set similarity join (SSJ) is a well studied problem with many algorithms proposed to speed up its performance. However, its scalability and performance are rarely discussed in modern multicore environments. Existing algorithms assume a single-threaded execution that leaves the abundant parallelism provided by modern machines unused, or use distributed setups that may not yield efficient runtimes and speedups that are proportional to the amount of hardware resources (e.g., CPU cores). In this paper, we focus on a widely-used family of SSJ algorithms that are based on the filter-and-verification paradigm, and study the potential of speeding them up in the context of multicore machines. We adapt state-of-the-art SSJ algorithms including PPJoin and AllPairs. Our experiments using 12 real-world datasets highlight important findings: (1) Using the exact number of hardware-provided hyperthreads leads to optimal runtimes for most experiments, (2) hand-crafted data structures do not always lead to better performance, and (3) PPJoin's position filter is more effective in the multithreaded case compared to the single-threaded execution. • Multi-threading has not yet been considered to speed up set similarity joins. • We propose a novel data-parallel set similarity join algorithm. • Multi-threading speeds up the set similarity join 2 to 10 times. • Implementation optimizations are not benefitial for the runtime. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
18. A comparative simulation study on the power-performance of multi-core architecture.
- Author
-
Saravanan, Vijayalakshmi, Anpalagan, Alagan, Kothari, D., Woungang, Isaac, and Obaidat, Mohammad
- Subjects
- *
COMPUTER architecture , *MULTICORE processors , *LAPTOP computers , *COMPUTER input-output equipment , *COMPUTATIONAL complexity - Abstract
Nowadays, multi-core processor is the main technology used in desktop PCs, laptop computers and mobile hardware platforms. As the number of cores on a chip keeps increasing, it adds up the complexity and impacts more on both power and performance of a processor. In multi-processors, the number of cores and various parameters, such as issue-width, number of instructions and execution time, are key design factors to balance the amount of thread-level parallelism and instruction-level parallelism. In this paper, we perform a comprehensive simulation study that aims to find the optimum number of processor cores in desktop/laptop computing processor models with shallow pipeline depth. This paper also explores the trade-off between the number of cores and different parameters used in multi-processors in terms of power-performance gains and analyzes the impact of 3D stacking on the design of simultaneous multi-threading and chip multiprocessing. Our analysis shows that the optimum number of cores varies with different classes of workloads, namely: SPEC2000, SPEC2006 and MiBench. Simulation study is presented using architectures with shorter pipeline depth, showing that (1) the optimum number of cores for power-performance is 8, (2) the optimum number of threads in the range [2, 4], and (3) for beyond 32 cores, multi-core processors are no longer efficient in terms of performance benefits and overall power consumption. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
19. Multi-Threading and Suffix Grouping on Massive Multiple Pattern Matching Algorithm.
- Author
-
Oh, Doohwan and Ro, Won W.
- Subjects
- *
THREADS (Computer programs) , *PATTERN recognition systems , *ALGORITHMS , *GROUP theory , *MULTICORE processors , *PERFORMANCE evaluation , *PARALLEL processing - Abstract
The widely used multiple pattern matching algorithms experience severe performance degradation when the number of patterns to match increases. In light of this fact, this paper presents a multi-threaded multiple pattern matching algorithm to overcome the performance degradation; this algorithm presents two additional improvements on the original Wu–Manber algorithm. First, the proposed algorithm employs a multi-threaded execution model to parallelize the pattern matching operation on multi-core processors. Second, the patterns to be searched are distributed over multiple threads according to the pattern similarity. For this purpose, the proposed algorithm groups the target patterns on the basis of their suffixes and distributes the patterns over multiple threads. Through experiments and performance analysis, our algorithm shows a significant performance gain as compared with the original Wu–Manber algorithm and the previously proposed multi-threaded pattern matching on massive pattern sets of size exceeding 5000. The results obtained from the pattern matching operation using eight cores show much improved execution time, which is nearly 14.9 times faster on average than that of the conventional Wu–Manber algorithm. It is demonstrated that the proposed idea improves the overall performance by reducing the amount of workload on a single thread through multi-threading and an efficient data distribution policy. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
20. MetAlign 3.0: performance enhancement by efficient use of advances in computer hardware.
- Author
-
Lommen, Arjen and Kools, Harrie
- Subjects
- *
GAS chromatography , *MASS spectrometry , *LIQUID chromatography , *COMPUTER input-output equipment , *MULTICORE processors , *RANDOM access memory , *COMPUTER software - Abstract
A new, multi-threaded version of the GC-MS and LC-MS data processing software, metAlign, has been developed which is able to utilize multiple cores on one PC. This new version was tested using three different multi-core PCs with different operating systems. The performance of noise reduction, baseline correction and peak-picking was 8-19 fold faster compared to the previous version on a single core machine from 2008. The alignment was 5-10 fold faster. Factors influencing the performance enhancement are discussed. Our observations show that performance scales with the increase in processor core numbers we currently see in consumer PC hardware development. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
21. Proactive Use of Shared L3 Caches to Enhance Cache Communications in Multi-Core Processors.
- Author
-
Fide, S. and Jenks, S.
- Abstract
The software and hardware techniques to exploit the potential of multi-core processors are falling behind, even though the number of cores and cache levels per chip is increasing rapidly. There is no explicit communications support available, and hence inter-core communications depend on cache coherence protocols, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we present software controlled eviction (SCE) to improve the performance of multithreaded applications running on multi-core processors by moving shared data to shared cache levels before it is demanded from remote private caches. Simulation results show that SCE offers significant performance improvement (8-28%) and reduces L3 cache misses by 88-98%. [ABSTRACT FROM PUBLISHER]
- Published
- 2008
- Full Text
- View/download PDF
22. Exploiting nested task-parallelism in the H-LU factorization
- Author
-
Rocío Carratalá-Sáez, Sven Christophersen, Steffen Börm, Vicenç Beltran, Enrique S. Quintana-Ortí, and José Ignacio Aliaga
- Subjects
boundary element methods (BEM) ,General Computer Science ,Computer science ,Task dependencies ,Task parallelism ,02 engineering and technology ,Parallel computing ,Operand ,multi-threading ,01 natural sciences ,010305 fluids & plasmas ,Theoretical Computer Science ,law.invention ,Memory address ,Multi-threading ,law ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Leverage (statistics) ,Boundary element methods (BEM) ,multicore processors ,nested task-parallelism ,Nested task-parallelism ,task dependencies ,Multicore processors ,Data structure ,LU decomposition ,ARQUITECTURA Y TECNOLOGIA DE COMPUTADORES ,Modeling and Simulation ,LU factorization ,Data analysis ,Programming paradigm ,020201 artificial intelligence & image processing ,Hierarchical linear algebra ,hierarchical linear algebra - Abstract
[EN] We address the parallelization of the LU factorization of hierarchical matrices (H-matrices) arising from boundary element methods. Our approach exploits task-parallelism via the OmpSs programming model and runtime, which discovers the data-flow parallelism intrinsic to the operation at execution time, via the analysis of data dependencies based on the memory addresses of the tasks' operands. This is especially challenging for H-matrices, as the structures containing the data vary in dimension during the execution. We tackle this issue by decoupling the data structure from that used to detect dependencies. Furthermore, we leverage the support for weak operands and early release of dependencies, recently introduced in OmpSs-2, to accelerate the execution of parallel codes with nested task-parallelism and fine-grain tasks. As a result, we obtain a significant improvement in the parallel performance with respect to our previous work., The researchers from Universidad Jaume I (UJI) were supported by projects CICYT TIN2014-53495-R and TIN2017-82972-R of MINECO and FEDER; project UJI-B2017-46 of UJI; and the FPU program of MECD.
- Published
- 2019
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.