Descriptor: "manycore" / Publisher: elsevier b.v. - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"manycore"' showing total 14 results

Start Over Descriptor "manycore" Publisher elsevier b.v.

14 results on '"manycore"'

1. tpSpMV: A two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures.

Author: Chen, Yuedan, Xiao, Guoqing, Wu, Fan, Tang, Zhuo, and Li, Keqin
Subjects: *NUMERICAL solutions for linear algebra, *MULTIPLICATION, *SPARSE matrices, *COMPUTING platforms, *DATA structures, *CREMATORIUMS, *DATA reduction
Abstract: • We propose tpSpMV to alleviate the three main difficulties in parallel SpMV on multicore and manycore architectures. • We propose the two-phase parallel execution technique for tpSpMV to overcome the computational scale limitation. • We respectively propose the adaptive partitioning methods and parallelization designs for tpSpMV to exploit the architectural advantages. • We design several optimizations for tpSpMV to improve bandwidth usage and optimize tpSpMV's performance. • Experimental results on the SW26010 CPU show that tpSpMV yields performance improvements over existing work. Sparse matrix-vector multiplication (SpMV) is one of the important subroutines in numerical linear algebras widely used in lots of large-scale applications. Accelerating SpMV on multicore and manycore architectures based on Compressed Sparse Row (CSR) format via row-wise parallelization is one of the most popular directions. However, there are three main challenges in optimizing parallel CSR-based SpMV: (a) limited local memory of each computing unit can be overwhelmed by assignments to long rows of large-scale sparse matrices; (b) irregular accesses to the input vector result in expensive memory access latency; (c) sparse data structure leads to low bandwidth usage. This paper proposes a two-phase large-scale SpMV, called tpSpMV, based on the memory structure and computing architecture of multicore and manycore architectures to alleviate the three main difficulties. First, we propose the two-phase parallel execution technique for tpSpMV that performs parallel CSR-based SpMV into two separate phases to overcome the computational scale limitation. Second, we respectively propose the adaptive partitioning methods and parallelization designs using the local memory caching technique for the two phases to exploit the architectural advantages of the high-performance computing platforms and alleviate the problem of high memory access latency. Third, we design several optimizations, such as data reduction, aligned memory accessing, and pipeline technique, to improve bandwidth usage and optimize tpSpMV's performance. Experimental results on SW26010 CPUs of the Sunway TaihuLight supercomputer prove that tpSpMV achieves up to 28.61 speedups and yields the performance improvement of 13.16% over the state-of-the-art work on average. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

2. A performance-aware yield analysis and optimization of manycore architectures.

Author: Lee, Jeong-Gun and Kwak, Sanghoon
Subjects: *SIMULATION methods & models, *NUMERICAL analysis, *MATHEMATICAL equivalence, *NANOSCIENCE, *NANOELECTROMECHANICAL systems
Abstract: In advanced nano-scale fabrication technology, maintaining both manufacturing yield and chip reliability is a challenging issue. Recent manycore processors provide inherent core redundancy and the redundancies can be exploited to resolve the yield and reliability problems. In this paper, we develop asymptotic analysis and simulation models to better understand performance and yield joint-characteristics in a manycore processor design. Our asymptotic model is built based on Amdahl’s law, Eble’s rule and statistical yield equations to derive the optimum number of cores with respect to “ performance-averaged yield ”. Our model can predict possible impacts of different manycore processor configurations and process technology parameters on the performance average yield for given degree of parallelism. Through the asymptotic analysis and optimization based on our model, we can observe an asymptotic relationship between design parameters such as “the number of cores” and “core area” of manycore architectures with regards to its performance and yield. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

3. The Impact of Traffic Localisation on the Performance of NoCs for Very Large Manycore Systems.

Author: Khanjari, Sharifa Al and Vanderbauwhede, Wim
Subjects: NETWORKS on a chip, SEMICONDUCTORS, MULTICORE processors, WIRELESS communications, TOPOLOGY
Abstract: The scaling of semiconductor technologies is leading to processors with increasing numbers of cores. The adoption of Networks-on-Chip (NoC) in manycore systems requires a shift in focus from computation to communication, as communication is fast becoming the dominant factor in processor performance. In large manycore systems, performance is predicated on the locality of communication. In this work, we investigate the performance of three NoC topologies for systems with thousands of processor cores under two types of localised traffic. We present latency and throughput results comparing fat quadtree, concentrated mesh and mesh topologies under different degrees of localisation. Our results, based on the ITRS physical data for 2023, show that the type and degree of localisation of traffic significantly affects the NoC performance, and that scale-invariant topologies perform worse than flat topologies. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

4. Towards an Automatic Co-generator for Manycores’ Architecture and Runtime: STHORM case-study.

Author: Bechara, Charly, Chehida, Karim Ben, and Thabet, Farhat
Subjects: COGENERATORS, COMPUTER architecture, RUN time systems (Computer science), ELECTRIC generators, COMPUTER simulation
Published: 2015
Full Text: View/download PDF

5. A portable platform for accelerated PIC codes and its application to GPUs using OpenACC.

Author: Hariri, F., Tran, T.M., Jocksch, A., Lanti, E., Progsch, J., Messmer, P., Brunner, S., Gheller, C., and Villard, L.
Subjects: *ACCELERATION (Mechanics), *GRAPHICS processing units, *CODING theory, *COMPUTER algorithms, *COMPUTER programming, *PROBLEM solving
Abstract: We present a portable platform, called PIC_ENGINE , for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandy bridge 8-core CPU by a factor of 3.4. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

6. FPGA prototyping of emerging manycore architectures for parallel programming research using Formic boards.

Author: Lyberis, Spyros, Kalokerinos, George, Lygerakis, Michalis, Papaefstathiou, Vassilis, Mavroidis, Iakovos, Katevenis, Manolis, Pnevmatikatos, Dionisios, and Nikolopoulos, Dimitrios S.
Subjects: *FIELD programmable gate arrays, *PROTOTYPES, *COMPUTER architecture, *PARALLEL programming, *PERFORMANCE evaluation, *NETWORKS on a chip
Abstract: Abstract: Performance evaluation of parallel software and architectural exploration of innovative hardware support face a common challenge with emerging manycore platforms: they are limited by the slow running time and the low accuracy of software simulators. Manycore FPGA prototypes are difficult to build, but they offer great rewards. Software running on such prototypes runs orders of magnitude faster than current simulators. Moreover, researchers gain significant architectural insight during the modeling process. We use the Formic FPGA prototyping board [1], which specifically targets scalable and cost-efficient multi-board prototyping, to build and test a 64-board model of a 512-core, MicroBlaze-based, non-coherent hardware prototype with a full network-on-chip in a 3D-mesh topology. We expand the hardware architecture to include the ARM Versatile Express platforms and build a 520-core heterogeneous prototype of 8 Cortex-A9 cores and 512 MicroBlaze cores. We then develop an MPI library for the prototype and evaluate it extensively using several bare-metal and MPI benchmarks. We find that our processor prototype is highly scalable, models faithfully single-chip multicore architectures, and is a very efficient platform for parallel programming research, being 50,000 times faster than software simulation. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

7. τ C: C with Process Network Extensions for Embedded Manycores.

Author: Goubier, Thierry, Couroussé, Damien, and Azaiez, Selma
Subjects: EMBEDDED computer systems, MULTICORE processors, COMPUTATIONAL complexity, MATHEMATICAL optimization, PARALLEL programming, COMPUTER architecture
Abstract: Abstract: Current and future embedded manycore systems bring complex and heterogeneous architectures with a large number of processing cores, making both parallel programming at this scale and understanding the architecture itself a daunting task. Process Networks and other dataflow based Models of Computation (MoC) are a good base to present a universal model of the underlying manycore architectures to the programmer. If a language exposes a simple to grasp MoC in a consistent way across architectures, the programmer can concentrate the efforts on optimizing the expression of parallelism in the application instead of porting code to a given system. In this paper, we present a process network extension to C called τ C and its mapping to both a POSIX target and the P2012/STHORM platform, and show how the language offers an architecture independent solution of this problem. [Copyright &y& Elsevier]
Published: 2014
Full Text: View/download PDF

8. PACHA: Low Cost Bare Metal Development for Shared Memory Manycore Accelerators.

Author: Aminot, Alexandre, Guerre, Alexandre, Peeters, Julien, and Lhuillier, Yves
Subjects: COMPUTER memory management, COMPUTER storage devices, COST effectiveness, MATERIALS science, APPLICATION software, LINUX operating systems
Abstract: Abstract: Today, efficiently implementing an application on shared memory manycore accelerators is a hard task. Linux eases the development, but is not adapted to exploit the maximum of the platform yet. Bare metal programming environments provided with accelerators give a low-overhead access to the platform. However, they increase the complexity of development, mainly due to 4 gaps: each accelerator has its own specific environment; bare metal environments are only supported on the hardware platform or on the proprietary simulator; they have seldom, if ever, execution debugging support; they do not have parallel programming models, and whenever they exist, they are platform-specific. In order to fill the gaps and thus, to lower the barrier to develop on bare metal on shared memory manycore accelerators, we present PACHA. It features two aspects: a low overhead, Platform Agnostic, Close-to-HArdware (PACHA) programming interface, which allows to handle only one version of the application code for all supported accelerators, and an easy-to-use multi-platform development environment, which abstracts the complexity of each accelerator's development environment. With a x86 support and a Linux compatibility, PACHA offers a functional simulator and all the Linux set of debugging tools. Further, based on the programming interface, three parallel execution models have been ported in order to facilitate the devel- opment and comparison of applications. PACHA is currently fully supported for x86 platforms, TILEPro64 and STHORM. A case study on a TILEPro64 is presented: the performance gain using PACHA rather than Linux with OpenMP or Pthread is about 1,8x to 4x, without increasing the development cost. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

9. Extended Cyclostatic Dataflow Program Compilation and Execution for an Integrated Manycore Processor.

Author: Aubry, Pascal, Beaucamps, Pierre-Edouard, Blanc, Frédéric, Bodin, Bruno, Carpov, Sergiu, Cudennec, Löıc, David, Vincent, Dore, Philippe, Dubrulle, Paul, Dinechin, Benôıt Dupont de, Galea, Franc,Cois, Goubier, Thierry, Harrand, Michel, Jones, Samuel, Lesage, Jean-Denis, Louise, Stéphane, Chaisemartin, Nicolas Morey, Nguyen, Thanh Hai, Raynaud, Xavier, and Sirdey, Renaud
Subjects: DATA flow computing, COMPUTER programming, EMBEDDED computer systems, INTEGRATED circuits, COMPUTATIONAL complexity, COMPUTER performance
Abstract: Abstract: The ever-growing number of cores in embedded chips emphasizes more than ever the complexity inherent to parallel pro- gramming. To solve these programmability issues, there is a renewed interest in the dataflow paradigm. In this context, we present a compilation toolchain for the ΣC language, which allows the hierarchical construction of stream applications and automatic mapping of this application to an embedded manycore target. As a demonstration of this toolchain, we present an implementation of a H.264 encoder and evaluate its performance on Kalray's embedded manycore MPPA chip. [Copyright &y& Elsevier]
Published: 2013
Full Text: View/download PDF

10. Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

Author: Madduri, Kamesh, Im, Eun-Jin, Ibrahim, Khaled Z., Williams, Samuel, Ethier, Stéphane, and Oliker, Leonid
Subjects: *NUMERICAL solutions to partial differential equations, *PARALLEL computers, *MULTICORE processors, *HIGH performance computing, *COMPUTER architecture, *GRAPHICS processing units, *MATHEMATICAL optimization, *MICROPROCESSORS
Abstract: Abstract: The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures. [Copyright &y& Elsevier]
Published: 2011
Full Text: View/download PDF

11. THAMON: Thermal-aware High-performance Application Mapping onto Opto-electrical network-on-chip.

Author: Abdollahi, Meisam, Firouzabadi, Yasaman, Dehghani, Fatemeh, and Mohammadi, Siamak
Subjects: *MULTICASTING (Computer networks), *LINEAR programming, *NETWORK performance, *DATA corruption, *INTEGER programming, *HEURISTIC algorithms, *DATA transmission systems
Abstract: Nanophotonics as a promising candidate for future on-chip networks has several benefits such as bandwidth transparency, light-speed data rate and low power consumption. Despite significant features of optical on-chip data transmission, the challenges of thermal and process variations (PVs) threaten the photonic network-on-chip reliability. These variations may cause a notable drift in the resonant wavelength of micro-ring resonators as the basic element of optical on-chip network and lead to data corruption, bandwidth wastage and eventually reliability reduction. In this paper, after discussing the effects of temperature variation (TV) on the reliability of optical on-chip network, a thermal-aware and high performance application mapping platform called as THAMON is proposed for manycore systems based-on hybrid opto-electrical network-on-chip. An integer linear programming (ILP) problem, as well as four run-time heuristic algorithms, are suggested to increase the network performance along with maintaining the network reliability through thermal management. The experimental results show that ILP and our proposed most effective heuristic approach improve the performance of the system by about 24% and 16% compared to the base-line approach, and also meet the thermal limitations of the photonic network. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

12. An on-node scalable sparse incomplete LU factorization for a many-core iterative solver with Javelin.

Author: Booth, Joshua Dennis and Bolet, Gregory
Subjects: *LINEAR systems, *INTERIOR-point methods, *SYSTEMS on a chip, *LINEAR algebra, *SEMIMETALS, *MULTIPLICATION, *SOCIAL responsibility of business, *PERMUTATIONS
Abstract: We present a scalable incomplete LU factorization to be used as a preconditioner for solving sparse linear systems with iterative methods in the package called Javelin. Javelin allows for improved parallel factorization on shared-memory many-core systems by packaging the coefficient matrix into a format that allows for high performance sparse matrix-vector multiplication and sparse triangular solves with minimal overheads. The framework achieves these goals by using a collection of traditional permutations, point-to-point thread synchronizations, tasking, and segmented prefix scans in a conventional compressed sparse row (CSR) format. Moreover, this framework stresses the importance of co-designing dependent tasks, such as sparse factorization and triangular solves, on highly-threaded architectures. We compare our method to the past distributed methods for incomplete factorization (Aztec) and current multithreaded packages (WSMP) in order to demonstrate the importance of having highly threaded factorizations on many-core systems. Using these changes, traditional fill-in and drop tolerance methods can be used, while still being able to have observed speedups of up to ~ 42 × on 68 Intel Knights Landing cores and ~ 12 × on 14 Intel Haswell cores. Moreover, this work provides insight into how the new data-structure impacts iteration counts, and provides insight into future improvements, such as point to GPUs. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

13. A framework to generate domain-specific manycore architectures from dataflow programs.

Author: Savas, Süleyman, Ul-Abdin, Zain, and Nordström, Tomas
Subjects: *ARCHITECTURE, *COMPUTER architecture, *SOFTWARE development tools, *EMULATION software, *DESIGN software
Abstract: In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures. However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture. We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

14. Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface.

Author: Sepúlveda, Johanna, Marangozova-Martin, Vania, and Castrillon, Jeronimo
Subjects: COMPUTER simulation, DATA libraries, BIG data, PROGRAMMING languages, COMPUTER input-output equipment, INTEGRATED circuits
Abstract: Manycore systems are one of the key enabler technologies for most of current computational paradigms, including Internet of things, data-centers on chip and big data processing. These paradigms are characterized by tight and demanding requirements such as code portability, dynamicity, high performance, usability, predictability, reliability, low power and security. This combination of requirements has led to heterogeneous manycore systems which are extremely challenging to design and to program. As a result, a large body of research has focused on development of languages, simulation environments and analysis tools that allow to model and predict the behavior of this type of systems from early design stages. ALCHEMY 2017 presents five research works addressing the challenges of code portability, high performance, usability, security and reliability in manycore systems. These are namely: 1. ” An OpenMP backend for the Sigma-C streaming language , addresses the software portability challenge of manycore architectures. It proposes an implementation of an OpenMP backend for the SigmaC language, a cycle-static data flow abstraction to program many-core embedded platforms. Its compilation scheme allows for utilization of future manycore embedded systems such as Kalray’s MPPA. 2. A multi-level optimization strategy to improve the performance of the stencil computation combines manual vectorization, space tiling and stencil composition for achieving high performance of stencil kernels on manycore systems. The evaluation with three compilers (Intel, Clang and GCC) and two target multi-core platforms (Intel Broadwell and Ivybridge) reports better results compared to the state of the art. 3. A Distributed Shared Memory Model and C++ Templated Meta-Programming Interface for the Epiphany RISC Array Processor addresses the usability challenge. It proposes techniques for data layout and parallel loop order abstraction as a parallel programming API targeting the Epiphany architecture. This results into a transparent distributed shared memory (DSM) model for Epiphany that eliminates the need to manage local data movement between cores. 4. Towards Protected MPSoC Communication for Information Protection against a Malicious NoC deals with vulnerabilities on Network-on-Chip (NoC). The authors propose a security protocol which allows the secure communication among the cores of the system, even in the presence of Trojan insertions at the NoC whose aim is to modify and steal data. 5. GPU-Accelerated Real-Time Path Planning and the Predictable Execution Model addresses the reliability challenge and tackles the important problem of ensuring reliable Worst Case Execution Time for Real-Time and Cyber Physical Systems. While considering heterogeneous (CPU/GPU), the idea is to separate memory and processor operations through Time-Division Multiplexing (TDM). [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

14 results on '"manycore"'

1. tpSpMV: A two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures.

2. A performance-aware yield analysis and optimization of manycore architectures.

3. The Impact of Traffic Localisation on the Performance of NoCs for Very Large Manycore Systems.

4. Towards an Automatic Co-generator for Manycores’ Architecture and Runtime: STHORM case-study.

5. A portable platform for accelerated PIC codes and its application to GPUs using OpenACC.

6. FPGA prototyping of emerging manycore architectures for parallel programming research using Formic boards.

7. τ C: C with Process Network Extensions for Embedded Manycores.

8. PACHA: Low Cost Bare Metal Development for Shared Memory Manycore Accelerators.

9. Extended Cyclostatic Dataflow Program Compilation and Execution for an Integrated Manycore Processor.

10. Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

11. THAMON: Thermal-aware High-performance Application Mapping onto Opto-electrical network-on-chip.

12. An on-node scalable sparse incomplete LU factorization for a many-core iterative solver with Javelin.

13. A framework to generate domain-specific manycore architectures from dataflow programs.

14. Architecture, Languages, Compilation and Hardware support for Emerging ManYcore systems (ALCHEMY): Preface.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

14 results on '"manycore"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources