91 results on '"Tianhe-2"'
Search Results
2. Evaluation of the computational performance of the finite-volume atmospheric model of the IAP/LASG (FAMIL) on a high-performance computer
- Author
-
Jin-Xiao LI, Qing BAO, Yi-Min LIU, and Guo-Xiong WU
- Subjects
FAMIL ,scalability ,computational performance ,Tianhe-2 ,Environmental sciences ,GE1-350 ,Oceanography ,GC1-1581 - Abstract
High computational performance is extremely important for climate system models, especially in ultra-high-resolution model development. In this study, the computational performance of the Finite-volume Atmospheric Model of the IAP/LASG (FAMIL) was comprehensively evaluated on Tianhe-2, which was the world’s top-ranked supercomputer from June 2013 to May 2016. The standardized Atmospheric Model Inter-comparison Project (AMIP) type of experiment was carried out that focused on the computational performance of each node as well as the simulation year per day (SYPD), the running cost speedup, and the scalability of the FAMIL. The results indicated that (1) based on five indexes (CPU usage, percentage of CPU kernel mode that occupies CPU time and of message passing waiting time (CPU_SW), code vectorization (VEC), average of Gflops (Gflops_AVE), and peak of Gflops (Gflops_PK)), FAMIL shows excellent computational performance on every Tianhe-2 computing node; (2) considering SYPD and the cost speedup of FAMIL systematically, the optimal Message Passing Interface (MPI) numbers of processors (MNPs) choice appears when FAMIL use 384 and 1536 MNPs for C96 (100 km) and C384 (25 km), respectively; and (3) FAMIL shows positive scalability with increased threads to drive the model. Considering the fast network speed and acceleration card in the MIC architecture on Tianhe-2, there is still significant room to improve the computational performance of FAMIL.
- Published
- 2017
- Full Text
- View/download PDF
3. Performance Evaluation of HPGMG on Tianhe-2: Early Experience
- Author
-
Ao, Yulong, Liu, Yiqun, Yang, Chao, Liu, Fangfang, Zhang, Peng, Lu, Yutong, Du, Yunfei, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Wang, Guojun, editor, Zomaya, Albert, editor, Martinez, Gregorio, editor, and Li, Kenli, editor
- Published
- 2015
- Full Text
- View/download PDF
4. FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster
- Author
-
Chen, Cheng, Du, Yunfei, Xu, Zhen, Yang, Canqun, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Wang, Guojun, editor, Zomaya, Albert, editor, Martinez, Gregorio, editor, and Li, Kenli, editor
- Published
- 2015
- Full Text
- View/download PDF
5. Large-Scale Neo-Heterogeneous Programming and Optimization of SNP Detection on Tianhe-2
- Author
-
Cui, Yingbo, Liao, Xiangke, Peng, Shaoliang, Lu, Yutong, Yang, Canqun, Wang, Bingqiang, Wu, Chengkun, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Kunkel, Julian M., editor, and Ludwig, Thomas, editor
- Published
- 2015
- Full Text
- View/download PDF
6. Analyzing time-dimension communication characterizations for representative scientific applications on supercomputer systems.
- Author
-
Chen, Juan, Zhou, Wenhao, Dong, Yong, Wang, Zhiyuan, Cui, Chen, Wu, Feihao, Zhou, Enqiang, and Tang, Yuhua
- Abstract
Exascale computing is one of the major challenges of this decade, and several studies have shown that communications are becoming one of the bottlenecks for scaling parallel applications. The analysis on the characteristics of communications can effectively aid to improve the performance of scientific applications. In this paper, we focus on the statistical regularity in time-dimension communication characteristics for representative scientific applications on supercomputer systems, and then prove that the distribution of communication-event intervals has a power-law decay, which is common in scientific interests and human activities. We verify the distribution of communication-event intervals has really a power-law decay on the Tianhe-2 supercomputer, and also on the other six parallel systems with three different network topologies and two routing policies. In order to do a quantitative study on the power-law distribution, we exploit two groups of statistics: bursty vs. memory and periodicity vs. dispersion. Our results indicate that the communication events show a "strong-bursty and weak-memory" characteristic and the communication event intervals show the periodicity and the dispersion. Finally, our research provides an insight into the relationship between communication optimizations and time-dimension communication characteristics. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
7. Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization.
- Author
-
Chen, Cheng, Du, Yunfei, Zuo, Ke, Fang, Jianbin, and Yang, Canqun
- Subjects
- *
FAULT tolerance (Engineering) , *RELIABILITY in engineering , *SCALABILITY , *QUANTUM computing - Abstract
Massively heterogeneous architectures are widely adopted for the design of modern peta-scale and future exa-scale systems. In such heterogeneous clusters, due to the increasing number of involved components, it is essential to enable fault tolerance to improve the reliability of the whole system. However, existing programming models for heterogeneous clusters (e.g., MPI + X) concern more on performance, instead of reliability. In this paper, we design and implement a fault tolerance framework for hybrid programs that leverage heterogeneous hardware architectures based on the in-memory checkpointing technique. We provide new capabilities for programming heterogeneous applications that can greatly simplify the implementation of application-level checkpointing. We also conduct optimizations on checkpoint saving and loading to increase scalability. We validate effectiveness of the framework with various benchmarks and real-world applications on the Tianhe-2 supercomputer. Our experimental results show that our framework can improve the resilience of long-running applications and reduce checkpointing overhead. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
8. Parallel optimization of the Crystal-KMC on Tianhe-2
- Author
-
Jie Wu, Jianjiang Li, Peng Wei, Yang Yun, and Baixue Ji
- Subjects
Multidisciplinary ,Software ,Speedup ,Coprocessor ,Atom (programming language) ,business.industry ,Computer science ,Message Passing Interface ,Tianhe-2 ,Kinetic Monte Carlo ,business ,Bottleneck ,Computational science - Abstract
The Kinetic Monte Carlo (KMC) is one of the commonly used methods for simulating radiation damage of materials. Our team develops a parallel KMC software named Crystal-KMC, which supports the Embedded Atom Method (EAM) potential energy and utilizes the Message Passing Interface (MPI) technology to simulate the vacancy transition of the Copper (Cu) element under neutron radiation. To make better use of the computing power of modern supercomputers, we develop the parallel efficiency optimization model for the Crystal-KMC on Tianhe-2, to achieve a larger simulation of the damage process of materials under irradiation environment. Firstly, we analyze the performance bottleneck of the Crystal-KMC software and use the MIC offload statement to implement the operation of key modules of the software on the MIC coprocessor. We use OpenMP to develop parallel optimization for the Crystal-KMC, combined with existing MPI inter-process communication optimization, finally achieving hybrid parallel optimization. The experimental results show that in the single-node CPU and MIC collaborative parallel mode, the speedup of the calculation hotspot reaches 30.1, and the speedup of the overall software reaches 7.43.
- Published
- 2021
- Full Text
- View/download PDF
9. An Interface for Biomedical Big Data Processing on the Tianhe-2 Supercomputer.
- Author
-
Xi Yang, Chengkun Wu, Kai Lu, Lin Fang, Yong Zhang, Shengkang Li, Guixin Guo, and YunFei Du
- Subjects
- *
BIG data , *MEDICAL databases , *ELECTRONIC data processing , *SUPERCOMPUTERS , *GENOMICS - Abstract
Big data, cloud computing, and high-performance computing (HPC) are at the verge of convergence. Cloud computing is already playing an active part in big data processing with the help of big data frameworks like Hadoop and Spark. The recent upsurge of high-performance computing in China provides extra possibilities and capacity to address the challenges associated with big data. In this paper, we propose Orion--a big data interface on the Tianhe-2 supercomputer--to enable big data applications to run on Tianhe-2 via a single command or a shell script. Orion supports multiple users, and each user can launch multiple tasks. It minimizes the effort needed to initiate big data applications on the Tianhe-2 supercomputer via automated configuration. Orion follows the "allocate-when-needed" paradigm, and it avoids the idle occupation of computational resources. We tested the utility and performance of Orion using a big genomic dataset and achieved a satisfactory performance on Tianhe-2 with very few modifications to existing applications that were implemented in Hadoop/Spark. In summary, Orion provides a practical and economical interface for big data processing on Tianhe-2. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
10. An approach to enhance the performance of large-scale structural analysis on CPU-MIC heterogeneous clusters.
- Author
-
Miao, Xinqiang, Jin, Xianlong, and Ding, Junhong
- Subjects
MODEL-integrated computing ,MOTHERBOARDS ,HETEROGENEOUS computing ,COMPUTER software execution ,MATHEMATICAL optimization - Abstract
Clusters with the CPU-MIC heterogeneous architecture are becoming more popular in recent years. However, it is not easy to achieve good performance on such machines. The key challenge has been the asymmetry within clusters, arising from different kinds of execution units as well as different communication latencies. To improve the performance of large-scale structural analysis on CPU-MIC heterogeneous clusters, a multi-layer and multi-grain collaborative parallel computing approach is proposed in the paper. The proposed method combines the parallel algorithm and the hardware architecture of CPU-MIC heterogeneous clusters together. Through mapping computing tasks to various hardware layers, it both resolves the load balance problem between CPU and MIC devices and significantly reduces the communication overheads of the system. Numerical experiments conducted on Tianhe-2 supercomputer show that the proposed method obtained better performance compared with the traditional approach. Scalability investigation showed that the proposed method had good scalability with respect to problem sizes. The findings of this paper are of help to the parallel porting and performance optimization of other applications on CPU-MIC heterogeneous clusters. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
11. P-Hint-Hunt: a deep parallelized whole genome DNA methylation detection tool.
- Author
-
Shaoliang Peng, Shunyun Yang, Ming Gao, Xiangke Liao, Jie Liu, Canqun Yang, Chengkun Wu, and Wenqiang Yu
- Subjects
- *
DNA methylation , *CYTOSINE , *EPIGENETICS , *DYNAMIC programming , *SUPERCOMPUTERS - Abstract
Background: The increasing studies have been conducted using whole genome DNA methylation detection as one of the most important part of epigenetics research to find the significant relationships among DNA methylation and several typical diseases, such as cancers and diabetes. In many of those studies, mapping the bisulfite treated sequence to the whole genome has been the main method to study DNA cytosine methylation. However, today's relative tools almost suffer from inaccuracies and time-consuming problems. Results: In our study, we designed a new DNA methylation prediction tool ("Hint-Hunt") to solve the problem. By having an optimal complex alignment computation and Smith-Waterman matrix dynamic programming, Hint-Hunt could analyze and predict the DNA methylation status. But when Hint-Hunt tried to predict DNA methylation status with large-scale dataset, there are still slow speed and low temporal-spatial efficiency problems. In order to solve the problems of Smith-Waterman dynamic programming and low temporal-spatial efficiency, we further design a deep parallelized whole genome DNA methylation detection tool ("P-Hint-Hunt") on Tianhe-2 (TH-2) supercomputer. Conclusions: To the best of our knowledge, P-Hint-Hunt is the first parallel DNA methylation detection tool with a high speed-up to process large-scale dataset, and could run both on CPU and Intel Xeon Phi coprocessors. Moreover, we deploy and evaluate Hint-Hunt and P-Hint-Hunt on TH-2 supercomputer in different scales. The experimental results illuminate our tools eliminate the deviation caused by bisulfite treatment in mapping procedure and the multi-level parallel program yields a 48 times speed-up with 64 threads. P-Hint-Hunt gain a deep acceleration on CPU and Intel Xeon Phi heterogeneous platform, which gives full play of the advantages of multi-cores (CPU) and many-cores (Phi). [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
12. ParaBTM: A Parallel Processing Framework for Biomedical Text Mining on Supercomputers
- Author
-
Yuting Xing, Chengkun Wu, Xi Yang, Wei Wang, En Zhu, and Jianping Yin
- Subjects
biomedical text mining ,big data ,Tianhe-2 ,parallel computing ,load balancing ,Organic chemistry ,QD241-441 - Abstract
A prevailing way of extracting valuable information from biomedical literature is to apply text mining methods on unstructured texts. However, the massive amount of literature that needs to be analyzed poses a big data challenge to the processing efficiency of text mining. In this paper, we address this challenge by introducing parallel processing on a supercomputer. We developed paraBTM, a runnable framework that enables parallel text mining on the Tianhe-2 supercomputer. It employs a low-cost yet effective load balancing strategy to maximize the efficiency of parallel processing. We evaluated the performance of paraBTM on several datasets, utilizing three types of named entity recognition tasks as demonstration. Results show that, in most cases, the processing efficiency can be greatly improved with parallel processing, and the proposed load balancing strategy is simple and effective. In addition, our framework can be readily applied to other tasks of biomedical text mining besides NER.
- Published
- 2018
- Full Text
- View/download PDF
13. Reducing Static Energy in Supercomputer Interconnection Networks Using Topology-Aware Partitioning.
- Author
-
Chen, Juan, Tang, Yuhua, Dong, Yong, Xue, Jingling, Wang, Zhiyuan, and Zhou, Wenhao
- Subjects
- *
SUPERCOMPUTERS , *INTEGRATED circuit interconnections , *NETWORK routers , *ENERGY management , *RESOURCE allocation , *ENERGY conservation - Abstract
The key to reducing static energy in supercomputers is switching off their unused components. Routers are the major components of a supercomputer. Whether routers can be effectively switched off or not has become the key to static energy management for supercomputers. For many typical applications, the routers in a supercomputer exhibit low utilization. However, there is no effective method to switch the routers off when they are idle. By analyzing the router occupancy in time and space, for the first time, we present a routing-policy guided topology partitioning methodology to solve this problem. We propose topology partitioning methods for three kinds of commonly used topologies (mesh, torus and fat-tree) equipped with the three most popular routing policies (deterministic routing, directionally adaptive routing and fully adaptive routing). Based on the above methods, we propose the key techniques required in this topology partitioning based static energy management in supercomputer interconnection networks to switch off unused routers in both time and space dimensions. Three topology-aware resource allocation algorithms have been developed to handle effectively different job-mixes running on a supercomputer. We validate the effectiveness of our methodology by using Tianhe-2 and a simulator for the aforementioned topologies and routing policies. The energy savings achieved on a subsystem of Tianhe-2 range from 3.8 to 79.7 percent. This translates into a yearly energy cost reduction of up to half a million US dollars for Tianhe-2. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
14. Parallelizing and optimizing large-scale 3D multi-phase flow simulations on the Tianhe-2 supercomputer.
- Author
-
Li, Dali, Xu, Chuanfu, Wang, Yongxian, Song, Zhifang, Xiong, Min, Gao, Xiang, and Deng, Xiaogang
- Subjects
PARALLELIZING compilers ,LARGE scale systems ,THREE-dimensional modeling ,MULTIPHASE flow ,COMPUTER simulation ,SUPERCOMPUTERS - Abstract
The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large-scale LBM simulations with increasing resolution and extending temporal range require massive (HPC) resources, thus motivating us to port it onto modern many-core heterogeneous supercomputers like Tianhe-2. Although many-core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating-point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large-scale heterogeneous system. In this paper, we parallelize and optimize the open source 3D multi-phase LBM code openlbmflow on the Intel Xeon Phi (MIC) accelerated Tianhe-2 supercomputer using a hybrid and heterogeneous MPI+OpenMP+Offload+single instruction, mulitple data (SIMD) programming model. With cache blocking and SIMD-friendly data structure transformation, we dramatically improve the SIMD and cache efficiency for the single-thread performance on both CPU and Phi, achieving a speedup of 7.9X and 8.8X, respectively, compared with the baseline code. To collaborate CPUs and Phi processors efficiently, we propose a load-balance scheme to distribute workloads among intra-node two CPUs and three Phi processors and use an asynchronous model to overlap the collaborative computation and communication as far as possible. The collaborative approach with two CPUs and three Phi processors improves the performance by around 3.2X compared with the CPU-only approach. Scalability tests show that openlbmflow can achieve a parallel efficiency of about 60% on 2048 nodes, with about 400K cores in total. To the best of our knowledge, this is the largest scale CPU-MIC collaborative LBM simulation for 3D multi-phase flow problems. Copyright © 2015 John Wiley & Sons, Ltd. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
15. 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores.
- Author
-
Liu, Yiqun, Yang, Chao, Liu, Fangfang, Zhang, Xianyi, Lu, Yutong, Du, Yunfei, Yang, Canqun, Xie, Min, and Liao, Xiangke
- Subjects
- *
ALGORITHM research , *CONJUGATE gradient methods , *DATA distribution , *HETEROGENEOUS computing , *ALGORITHMS - Abstract
In this article, we present a new hybrid algorithm to enable and scale the high-performance conjugate gradients (HPCG) benchmark on large-scale heterogeneous systems such as the Tianhe-2. Based on an inner–outer subdomain partitioning strategy, the data distribution between host and device can be balanced adaptively. The overhead of data movement from both the MPI communication and the PCI-E transfer can be significantly reduced by carefully rearranging and fusing operations. A variety of parallelization and optimization techniques for performance-critical kernels are exploited and analyzed to maximize the performance gain on both host and device. We carry out experiments on both a small heterogeneous computer and the world’s largest one, the Tianhe-2. On the small system, a thorough comparison and analysis has been presented to select from different optimization choices. On Tianhe-2, the optimized implementation scales to the full-system level of 3.12 million heterogeneous cores, with an aggregated performance of 623 Tflop/s and a parallel efficiency of 81.2%. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
16. Scaling up Hartree–Fock calculations on Tianhe-2.
- Author
-
Chow, Edmond, Liu, Xing, Misra, Sanchit, Dukhan, Marat, Smelyanskiy, Mikhail, Hammond, Jeff R., Du, Yunfei, Liao, Xiang-Ke, and Dubey, Pradeep
- Subjects
- *
HARTREE-Fock approximation , *QUANTUM chemistry , *LOAD balancing (Computer networks) , *SUPERCOMPUTERS , *COPROCESSORS - Abstract
This paper presents a new optimized and scalable code for Hartree–Fock self-consistent field iterations. Goals of the code design include scalability to large numbers of nodes, and the capability to simultaneously use CPUs and Intel Xeon Phi coprocessors. Issues we encountered as we optimized and scaled up the code on Tianhe-2 are described and addressed. A major issue is load balance, which is made challenging due to integral screening. We describe a general framework for finding a well-balanced static partitioning of the load in the presence of screening. Work stealing is used to polish the load balance. Performance results are shown on Stampede and Tianhe-2 supercomputers. Scalability is demonstrated on large simulations involving 2938 atoms and 27,394 basis functions, utilizing 8100 nodes of Tianhe-2. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
17. Design and Implementation of the Tianhe-2 Data Storage and Management System
- Author
-
Zhiguang Chen, Peng Cheng, and Yutong Lu
- Subjects
File system ,business.industry ,Computer science ,Data management ,Big data ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Data science ,Exascale computing ,Computer Science Applications ,Theoretical Computer Science ,Computational Theory and Mathematics ,Hardware and Architecture ,Middleware (distributed applications) ,Middleware ,Management system ,Computer data storage ,0202 electrical engineering, electronic engineering, information engineering ,Tianhe-2 ,business ,computer ,Software - Abstract
With the convergence of high-performance computing (HPC), big data and artificial intelligence (AI), the HPC community is pushing for “triple use” systems to expedite scientific discoveries. However, supporting these converged applications on HPC systems presents formidable challenges in terms of storage and data management due to the explosive growth of scientific data and the fundamental differences in I/O characteristics among HPC, big data and AI workloads. In this paper, we discuss the driving force behind the converging trend, highlight three data management challenges, and summarize our efforts in addressing these data management challenges on a typical HPC system at the parallel file system, data management middleware, and user application levels. As HPC systems are approaching the border of exascale computing, this paper sheds light on how to enable application-driven data management as a preliminary step toward the deep convergence of exascale computing ecosystems, big data, and AI.
- Published
- 2020
- Full Text
- View/download PDF
18. A CPU/MIC Collaborated Parallel Framework for GROMACS on Tianhe-2 Supercomputer
- Author
-
Xiaoyu Zhang, Shunyun Yang, Wenhe Su, Xingming Zhao, Shaoliang Peng, Yingbo Cui, Tenglilang Zhang, and Weiguo Liu
- Subjects
Source code ,Coprocessor ,business.industry ,Computer science ,Applied Mathematics ,media_common.quotation_subject ,0206 medical engineering ,ComputerApplications_COMPUTERSINOTHERSYSTEMS ,02 engineering and technology ,Parallel computing ,Molecular Dynamics Simulation ,Supercomputer ,Computing Methodologies ,Computational science ,Software ,Genetics ,Tianhe-2 ,Central processing unit ,business ,020602 bioinformatics ,Xeon Phi ,Biotechnology ,media_common - Abstract
Molecular Dynamics (MD) is the simulation of the dynamic behavior of atoms and molecules. As the most popular software for molecular dynamics, GROMACS cannot work on large-scale data because of limit computing resources. In this paper, we propose a CPU and Intel® Xeon Phi Many Integrated Core (MIC) collaborated parallel framework to accelerate GROMACS using the offload mode on a MIC coprocessor, with which the performance of GROMACS is improved significantly, especially with the utility of Tianhe-2 supercomputer. Furthermore, we optimize GROMACS so that it can run on both the CPU and MIC at the same time. In addition, we accelerate multi-node GROMACS so that it can be used in practice. Benchmarking on real data, our accelerated GROMACS performs very well and reduces computation time significantly. Source code: https://github.com/tianhe2/gromacs-mic.
- Published
- 2019
- Full Text
- View/download PDF
19. Communication-hiding programming for clusters with multi-coprocessor nodes.
- Author
-
Dong, Xinnan, Wen, Mei, Chai, Jun, Cai, Xing, Zhao, Mandan, and Zhang, Chunyuan
- Subjects
COMMUNICATION ,COMPUTER programming ,COPROCESSORS ,COMPUTER software ,MATHEMATICAL symmetry - Abstract
Future exascale systems are expected to adopt compute nodes that incorporate many accelerators. To shed some light on the upcoming software challenge, this paper investigates the particular topic of programming clusters that have multiple Xeon Phi coprocessors in each compute node. A new offload approach is considered for intra-node communication, which combines Intel's APIs of coprocessor offload infrastructure (COI) and symmetric communication interface (SCIF) for achieving low latency. While the conventional pragma-based offload approach allows simpler programming, the COI-SCIF approach has three advantages in (1) lower overhead associated with launching offloaded code, (2) higher data transfer bandwidths, and (3) more advanced asynchrony between computation and data movement. The low-level COI-SCIF approach is also shown to have benefits over the MPI-OpenMP counterpart, which belongs to the symmetric usage mode. Moreover, a hybird programming strategy based on COI-SCIF is presented for joining the computational force of all CPUs and coprocessors, while realizing communication hiding. All the programming approaches are tested by a real-world 3D application, for which the COI-SCIF-based approach shows a performance advantage on Tianhe-2. Copyright © 2015 John Wiley & Sons, Ltd. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
20. Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2.
- Author
-
Xue, Wei, Yang, Chao, Fu, Haohuan, Wang, Xinliang, Xu, Yangtong, Liao, Junfeng, Gan, Lin, Lu, Yutong, Ranjan, Rajiv, and Wang, Lizhe
- Subjects
- *
CENTRAL processing units , *SCALABILITY , *EULER method , *DOMAIN decomposition methods , *ALGORITHMS , *ATMOSPHERIC models - Abstract
In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first reformulate the mesocale model to avoid long-latency operations, and then employ carefully designed inter-node and intra-node domain decomposition algorithms to achieve balance utilization of different computing units. Proper communication-computation overlap and concurrent data transfer methods are utilized to reduce the cost of data movement at scale. A variety of optimization techniques on both the CPU side and the accelerator side are exploited to enhance the in-socket performance. The proposed hybrid algorithm successfully scales to 6,144 Tianhe-2 nodes with a nearly ideal weak scaling efficiency, and achieve over 8 percent of the peak performance in double precision. This ultra-scalable hybrid algorithm may be of interest to the community to accelerating atmospheric models on increasingly dominated heterogeneous supercomputers. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
21. Towards simulation of subcellular calcium dynamics at nanometre resolution.
- Author
-
Chai, Jun, Hake, Johan, Wu, Nan, Wen, Mei, Cai, Xing, Lines, Glenn T, Yang, Jing, Su, Huayou, Zhang, Chunyuan, and Liao, Xiangke
- Subjects
- *
COMPUTER simulation , *NANOTECHNOLOGY , *CALCIUM , *SUPERCOMPUTERS ,HEART disease etiology - Abstract
Numerical simulation of subcellular Ca2+ dynamics with a resolution down to one nanometre can be an important tool for discovering the physiological cause of many heart diseases. The requirement of enormous computational power, however, has made such simulations prohibitive so far. By using up to 12,288 Intel Xeon Phi 31S1P coprocessors on the new hybrid cluster Tianhe-2, which is the new number one supercomputer of the world, we have achieved 1.27 Pflop/s in double precision, which brings us much closer to the nanometre resolution. This is the result of efficiently using the hardware on different levels: (1) a single Xeon Phi (2) a single compute node that consists of a host and three coprocessors, and (3) a huge number of interconnected nodes. To overcome the challenge of programming Intel’s new many-integrated core (MIC) architecture, we have adopted techniques such as vectorization, hierarchical data blocking, register data reuse, offloading computations to the coprocessors, and pipelining computations with intra-/inter-node communications. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
22. CPU-MIC Acceleration of Multiple-point Statistical Simulation on Tianhe-2
- Author
-
Jia Liu, Xiaogang Ma, Qiyu Chen, Gang Liu, and Zhesi Cui
- Subjects
Orders of magnitude (bit rate) ,Acceleration ,Computer science ,Hybrid system ,Scalability ,Tianhe-2 ,Solid modeling ,Parallel computing ,USable ,Supercomputer - Abstract
Multiple-point statistics (MPS) has shown promise in representing heterogeneous phenomena in earth science. However, since the MPS algorithms require scanning of the entire pattern library or the training image for each unknown location, this results in severe computational consumption. It is really difficult to achieve high-resolution simulation, especially the 3D simulation of complex processes and phenomena. In order to characterize these phenomena in more detail, the size of simulation grids used for numerical models has increased by many orders of magnitude in the past years. As cluster computers become widely available, using parallel strategies is a natural way for increasing the usable grid size and the complexity of the models. These strategies must profit from of the possibilities offered by supercomputers with a large number of processors. As one of the fastest supercomputers in the world, Tianhe-2 provides a CPU-MIC synergistic micro-heterogeneous architecture with rich computing resources and is widely recognized in the supercomputing fields. This work intends to design and implement a parallel multiple-point statistical simulation based on the CPU-MIC hybrid system of Tianhe-2 supercomputer. It aims to extend the size of simulation grids to billions, and achieve the fine characterization of complex structures and phenomena. A series of synthetic experiments were used to test the effectiveness of the proposed strategy.
- Published
- 2020
- Full Text
- View/download PDF
23. News Briefs.
- Subjects
- *
MALWARE , *BANDWIDTH research , *GOOGLE Glass , *WEB development - Abstract
Topics include new technology that promises to revolutionize Web development, Hewlett-Packard's plan to reinvent the computer, a Microsoft antimalware operation that took innocent companies offline, a US Supreme Court decision that has sparked fear for cloud services' future, a malware campaign that threatened big companies in several countries, a new technology that "increases" wireless bandwidth, a report that indicates a slowdown in supercomputer performance growth, virtual reality coming to gaming, tech companies hardening their systems to fend off government spying, and movie theaters banning Google Glass. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
24. Petascale scramjet combustion simulation on the Tianhe-2 heterogeneous supercomputer
- Author
-
Chuanfu Xu, Meifang Yang, Yonggang Che, and Yutong Lu
- Subjects
Coprocessor ,Computer Networks and Communications ,Computer science ,Double-precision floating-point format ,Symmetric multiprocessor system ,010103 numerical & computational mathematics ,Parallel computing ,Supercomputer ,01 natural sciences ,Computer Graphics and Computer-Aided Design ,010305 fluids & plasmas ,Theoretical Computer Science ,Petascale computing ,Artificial Intelligence ,Hardware and Architecture ,0103 physical sciences ,Scalability ,Tianhe-2 ,Central processing unit ,0101 mathematics ,Software - Abstract
Combustion simulation is complex and computationally expensive as it involves integration of fundamental chemical kinetics and multidimensional Computational Fluid Dynamics (CFD) models. This paper presents our efforts porting a real-world supersonic combustion simulation application to the heterogeneous architecture consists of multi-core CPUs and Intel Many Integrated Core (MIC) coprocessors. Scalable OpenMP parallelization is added to make use of the large number of cores on CPUs and MIC coprocessors. Single thread performance optimizations are addressed to improve the computational efficiency. CPU and MIC collaborative algorithm, along with a series of techniques to improve the data transfer efficiency and load balance, are applied. Performance evaluation is performed on the Tianhe-2 supercomputer. The results show that on a single node, the optimized CPU-only version is 8.33 times faster than the baseline version, and the CPU + MIC heterogeneous version is again 3.07 times faster than the optimized CPU-only version. The resulting codes effectively scale to 5120 nodes (998,400 cores) on a mesh with 27.46 Giga cells. Given that the total number of floating-point operations is reduced by about 10 times after our optimizations, the heterogeneous version still achieves a sustained double precision floating-point performance of 0.46 Pflops on 5120 nodes. This demonstrates Petascale heterogeneous computing capabilities for real-world supersonic combustion problems.
- Published
- 2018
- Full Text
- View/download PDF
25. Hybrid parallel framework for multiple-point geostatistics on Tianhe-2: A robust solution for large-scale simulation
- Author
-
Qiyu Chen, Gregoire Mariethoz, Xiaogang Ma, Zhesi Cui, and Gang Liu
- Subjects
Series (mathematics) ,Distributed computing ,Path (graph theory) ,Process (computing) ,Tianhe-2 ,Overhead (computing) ,Geostatistics ,Computers in Earth Sciences ,Grid ,Scale (map) ,Information Systems - Abstract
Multiple-point geostatistical (MPS) simulation methods have attracted an enormous amount of attention in earth and environmental sciences due to their ability to enhance extraction and synthesis of heterogeneous patterns. To characterize the subtle features of heterogeneous structures and phenomena, large-scale and high-resolution simulations are often required. Accordingly, the size of simulation grids has increased dramatically. Since MPS is a sequential process for each grid unit along a simulation path, it results in severe computational consumption. In this work, a new hybrid parallel framework is proposed for the case of MPS simulation on large areas with enormous amount of grid cells. Both inter-node-level and intra-node-level parallel strategies are combined in this framework. To maintain the quality of the realizations, we implement a conflict control method adapting to the Monte-Carlo process. Also, an optimization method for the simulation information is embedded to reduce the inter-node communication overhead. A series of synthetic tests were used to verify the availability and performance of the proposed hybrid parallel framework. The results corroborate that the proposed framework can efficiently achieve the high-resolution reproduction and characterization of complex structures and phenomena in earth sciences.
- Published
- 2021
- Full Text
- View/download PDF
26. The Performance Test and Optimization of Crystal-MD Program on Tianhe-2
- Author
-
Changjun Hu, Jianjiang Li, Kai Zhang, Peng Wei, and Jie Wang
- Subjects
Crystal (programming language) ,Molecular dynamics ,Field (physics) ,Computer science ,Homogeneous ,Tianhe-2 ,Supercomputer ,Computational science - Abstract
In the research field of virtual reactor, the study of materials is one of the most significant issues, and in the research of materials irradiation effect, Molecular Dynamics (MD) is the widely used method. In this paper, the existing Crystal-MD simulation program has been tested in a homogeneous way on the platform of Tianhe-2 supercomputer, and the program has also been rewritten according to the heterogeneous multi-core architecture of Tianhe-2, and finally tested on the Tianhe-2 platform. Our experimentation results show that homogeneous Crystal-MD simulation program has a good expansion on Tianhe-2 and with the same number of nodes, the heterogeneous program is more efficient than the homogeneous one.
- Published
- 2019
- Full Text
- View/download PDF
27. High-Scalable Collaborated Parallel Framework for Large-Scale Molecular Dynamic Simulation on Tianhe-2 Supercomputer
- Author
-
Weiliang Zhu, Shaoliang Peng, Xiaoyu Zhang, Yutong Lu, Xiangke Liao, Dong Dong, Wenhe Su, Jie Liu, Canqun Yang, Kai Lu, and Dong-Qing Wei
- Subjects
Speedup ,Scale (ratio) ,business.industry ,Computers ,Applied Mathematics ,0206 medical engineering ,Computational Biology ,02 engineering and technology ,Molecular Dynamics Simulation ,Supercomputer ,Computational science ,Acceleration ,Software ,Scalability ,Genetics ,Tianhe-2 ,business ,Model building ,020602 bioinformatics ,Algorithms ,Biotechnology - Abstract
Molecular dynamics (MD) is a computer simulation method of studying physical movements of atoms and molecules that provide detailed microscopic sampling on molecular scale. With the continuous efforts and improvements, MD simulation gained popularity in materials science, biochemistry and biophysics with various application areas and expanding data scale. Assisted Model Building with Energy Refinement (AMBER) is one of the most widely used software packages for conducting MD simulations. However, the speed of AMBER MD simulations for system with millions of atoms in microsecond scale still need to be improved. In this paper, we propose a parallel acceleration strategy for AMBER on the Tianhe-2 supercomputer. The parallel optimization of AMBER is carried out on three different levels: fine grained OpenMP parallel on a single CPU, single node CPU/MIC parallel optimization and multi-node multi-MIC collaborated parallel acceleration. By the three levels of parallel acceleration strategy above, we achieved the highest speedup of 25-33 times compared with the original program.
- Published
- 2018
28. Teno: An Efficient High-Throughput Computing Job Execution Framework on Tianhe-2
- Author
-
Wei Yu, Yi-Xian Shen, Lin Li, Yun-Fei Du, Zhi-Guang Chen, and Yu-Tong Lu
- Subjects
Job scheduler ,Computer science ,Distributed computing ,020206 networking & telecommunications ,02 engineering and technology ,computer.software_genre ,Supercomputer ,Scheduling (computing) ,Key factors ,Software deployment ,0202 electrical engineering, electronic engineering, information engineering ,Tianhe-2 ,020201 artificial intelligence & image processing ,Resource management ,High-throughput computing ,computer - Abstract
Large-scale and loosely-coupled applications cannot be implemented directly on high-performance computing platforms. At the same time, the deployment and maintenance of high-performance computing and high-throughput computing will result in the waste of computing resources. In order to solve the problem that existing resource management systems cannot make high-throughput computing applications execute efficiently on high-performance computers, we propose, design and implement a high-throughput computing job execution framework Teno without modifying the existing configuration environment of Slurm on Tianhe-2. It uses Slurm to implement fine-grained resource scheduling through the idea of hierarchical scheduling, and optimizes the traditional Master-Worker model, thereby speeding up the high-throughput operation and increasing the effective utilization of cluster resources. Effective fault-tolerance mechanisms such as fault recovery and error retry are also implemented. Finally we design various experiments in Tianhe-2 to test and evaluate the key factors for high-throughput calculations of Teno, Slurm and HTCondor, and analyze in detail why the performance of Teno is over that of the other two.
- Published
- 2018
- Full Text
- View/download PDF
29. Parallelizing and optimizing large-scale 3D multi-phase flow simulations on the Tianhe-2 supercomputer
- Author
-
Yongxian Wang, Chuanfu Xu, Xiaogang Deng, Xiang Gao, Zhifang Song, Min Xiong, and Dali Li
- Subjects
020203 distributed computing ,Computer Networks and Communications ,Computer science ,business.industry ,Graphics processing unit ,02 engineering and technology ,Parallel computing ,Computational fluid dynamics ,Supercomputer ,Computer Science Applications ,Theoretical Computer Science ,Computational Theory and Mathematics ,Flow (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Tianhe-2 ,020201 artificial intelligence & image processing ,Boundary value problem ,business ,Software ,Xeon Phi - Abstract
The lattice Boltzmann method LBM is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large-scale LBM simulations with increasing resolution and extending temporal range require massive high-performance computing HPC resources, thus motivating us to port it onto modern many-core heterogeneous supercomputers like Tianhe-2. Although many-core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating-point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large-scale heterogeneous system.
- Published
- 2015
- Full Text
- View/download PDF
30. Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2
- Author
-
Rajiv Ranjan, Lin Gan, Yangtong Xu, Wei Xue, Xinliang Wang, Lizhe Wang, Haohuan Fu, Yutong Lu, Junfeng Liao, and Chao Yang
- Subjects
Computer science ,Domain decomposition methods ,Parallel computing ,Stencil ,Hybrid algorithm ,Theoretical Computer Science ,Acceleration ,Computational Theory and Mathematics ,Hardware and Architecture ,Hybrid system ,Scalability ,Tianhe-2 ,Central processing unit ,Software - Abstract
In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first reformulate the mesocale model to avoid long-latency operations, and then employ carefully designed inter-node and intra-node domain decomposition algorithms to achieve balance utilization of different computing units. Proper communication-computation overlap and concurrent data transfer methods are utilized to reduce the cost of data movement at scale. A variety of optimization techniques on both the CPU side and the accelerator side are exploited to enhance the in-socket performance. The proposed hybrid algorithm successfully scales to 6,144 Tianhe-2 nodes with a nearly ideal weak scaling efficiency, and achieve over 8 percent of the peak performance in double precision. This ultra-scalable hybrid algorithm may be of interest to the community to accelerating atmospheric models on increasingly dominated heterogeneous supercomputers.
- Published
- 2015
- Full Text
- View/download PDF
31. Communication-hiding programming for clusters with multi-coprocessor nodes
- Author
-
Chunyuan Zhang, Mandan Zhao, Jun Chai, Dong Xinnan, Xing Cai, and Mei Wen
- Subjects
Coprocessor ,Computer Networks and Communications ,Computer science ,Node (networking) ,Parallel computing ,Computer Science Applications ,Theoretical Computer Science ,Asynchrony (computer programming) ,Computational Theory and Mathematics ,Tianhe-2 ,Overhead (computing) ,Latency (engineering) ,Software ,Xeon Phi - Abstract
Summary Future exascale systems are expected to adopt compute nodes that incorporate many accelerators. To shed some light on the upcoming software challenge, this paper investigates the particular topic of programming clusters that have multiple Xeon Phi coprocessors in each compute node. A new offload approach is considered for intra-node communication, which combines Intel's APIs of coprocessor offload infrastructure (COI) and symmetric communication interface (SCIF) for achieving low latency. While the conventional pragma-based offload approach allows simpler programming, the COI-SCIF approach has three advantages in (1) lower overhead associated with launching offloaded code, (2) higher data transfer bandwidths, and (3) more advanced asynchrony between computation and data movement. The low-level COI-SCIF approach is also shown to have benefits over the MPI-OpenMP counterpart, which belongs to the symmetric usage mode. Moreover, a hybird programming strategy based on COI-SCIF is presented for joining the computational force of all CPUs and coprocessors, while realizing communication hiding. All the programming approaches are tested by a real-world 3D application, for which the COI-SCIF-based approach shows a performance advantage on Tianhe-2. Copyright © 2015 John Wiley & Sons, Ltd.
- Published
- 2015
- Full Text
- View/download PDF
32. Parallel Implementation and Optimizations of Visibility Computing of 3D Scene on Tianhe-2 Supercomputer
- Author
-
Congpin Zhang, Changmao Wu, Xiaodong Wang, and Zhengwei Xu
- Subjects
Multi-core processor ,Speedup ,Computer science ,Parallel algorithm ,020206 networking & telecommunications ,02 engineering and technology ,Supercomputer ,Rendering (computer graphics) ,Computer graphics ,Computer engineering ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Tianhe-2 ,020201 artificial intelligence & image processing - Abstract
Visibility computing is a basic problem in computer graphics, and is often the bottleneck in realistic rendering algorithms. Some of the most common include the determination of the objects visible from a viewpoint, virtual reality, real-time simulation and 3D interactive design. As one technique to accelerate the rendering speed, the research on visibility computing has gained great attention in recent years. Traditional visibility computing on single processor machine has been unable to meet more and more large-scale and complex scenes due to lack parallelism. However, it will face many challenges to design parallel algorithms on a cluster due to imbalance workload among compute nodes, the complicated mathematical model and different domain knowledge. In this paper, we propose an efficient and highly scalable framework for visibility computing on Tianhe-2 supercomputer. Firstly, a new technique called hemispheric visibility computing is designed, which can overcome the visibility missing of traditional perspective algorithm. Secondly, a distributed parallel algorithm for visibility computing is implemented, which is based on the master-worker architecture. Finally, we discuss the issue of granularity of visibility computing and some optimization strategies for improving overall performance. Experiments on Tianhe-2 supercomputer show that our distributed parallel visibility computing framework almost reaches linear speedup by using up to 7680 CPU cores.
- Published
- 2018
- Full Text
- View/download PDF
33. ICON-MIC: Implementing a CPU/MIC Collaboration Parallel Framework for ICON on Tianhe-2 Supercomputer
- Author
-
Jingrong Zhang, Zhiyong Liu, Fei Sun, Xiaohua Wan, Lun Li, Zihao Wang, Yu Chen, and Fa Zhang
- Subjects
0301 basic medicine ,Electron Microscope Tomography ,Fourier Analysis ,Computer science ,Fast Fourier transform ,02 engineering and technology ,Parallel computing ,Load balancing (computing) ,021001 nanoscience & nanotechnology ,Supercomputer ,Matrix multiplication ,03 medical and health sciences ,Computational Mathematics ,030104 developmental biology ,Computational Theory and Mathematics ,Modeling and Simulation ,Scalability ,Genetics ,Tianhe-2 ,Image Processing, Computer-Assisted ,Central processing unit ,0210 nano-technology ,Molecular Biology ,Xeon Phi ,Software - Abstract
Electron tomography (ET) is an important technique for studying the three-dimensional structures of the biological ultrastructure. Recently, ET has reached sub-nanometer resolution for investigating the native and conformational dynamics of macromolecular complexes by combining with the sub-tomogram averaging approach. Due to the limited sampling angles, ET reconstruction typically suffers from the "missing wedge" problem. Using a validation procedure, iterative compressed-sensing optimized nonuniform fast Fourier transform (NUFFT) reconstruction (ICON) demonstrates its power in restoring validated missing information for a low-signal-to-noise ratio biological ET dataset. However, the huge computational demand has become a bottleneck for the application of ICON. In this work, we implemented a parallel acceleration technology ICON-many integrated core (MIC) on Xeon Phi cards to address the huge computational demand of ICON. During this step, we parallelize the element-wise matrix operations and use the efficient summation of a matrix to reduce the cost of matrix computation. We also developed parallel versions of NUFFT on MIC to achieve a high acceleration of ICON by using more efficient fast Fourier transform (FFT) calculation. We then proposed a hybrid task allocation strategy (two-level load balancing) to improve the overall performance of ICON-MIC by making full use of the idle resources on Tianhe-2 supercomputer. Experimental results using two different datasets show that ICON-MIC has high accuracy in biological specimens under different noise levels and a significant acceleration, up to 13.3 × , compared with the CPU version. Further, ICON-MIC has good scalability efficiency and overall performance on Tianhe-2 supercomputer.
- Published
- 2017
34. mD3DOCKxb: An Ultra-Scalable CPU-MIC Coordinated Virtual Screening Framework
- Author
-
Bertil Schmidt, Shunyun Yang, Weiliang Zhu, Shaoliang Peng, Wenhe Su, Kai Lu, Xiangke Liao, Kuan-Ching Li, Yutong Lu, Xiaoyu Zhang, Zhiqiang Zhang, and Dong Dong
- Subjects
0301 basic medicine ,Virtual screening ,Multi-core processor ,Coprocessor ,Computer science ,business.industry ,Parallel computing ,Supercomputer ,03 medical and health sciences ,030104 developmental biology ,Embedded system ,Scalability ,Tianhe-2 ,Algorithm design ,business ,Massively parallel - Abstract
Molecular docking is an important method in computational drug discovery. In large-scale virtual screening, millions of small drug-like molecules (chemical compounds) are compared against a designated target protein (receptor). Depending on the utilized docking algorithm for screening, this can take several weeks on conventional HPC systems. However, for certain applications including large-scale screening tasks for newly emerging infectious diseases such high runtimes can be highly prohibitive. In this paper, we investigate how the massively parallel neo-heterogeneous architecture of Tianhe-2 Supercomputer consisting of thousands of nodes comprising CPUs and MIC coprocessors that can efficiently be used for virtual screening tasks. Our proposed approach is based on a coordinated parallel framework called mD3DOCKxb in which CPUs collaborate with MICs to achieve high hardware utilization. mD3DOCKxb comprises a novel efficient communication engine for dynamic task scheduling and load balancing between nodes in order to reduce communication and I/O latency. This results in a highly scalable implementation with parallel efficiency of over 84% (strong scaling) when executing on 8,000 Tianhe-2 nodes comprising 192,000 CPU cores and 1,368,000 MIC cores.
- Published
- 2017
- Full Text
- View/download PDF
35. COPCOP: A Novel Algorithm and Parallel Optimization Framework for Co-Evolutionary Domain Detection
- Author
-
Shaoliang Peng, Xiaoyu Zhang, Benyun Shi, Xiangke Liao, Kenli Li, and Zhu Hao
- Subjects
Computer science ,Applied Mathematics ,0206 medical engineering ,Parallel optimization ,02 engineering and technology ,Multiple methods ,Supercomputer ,Domain (software engineering) ,Negative selection ,Molecular level ,Genetics ,Tianhe-2 ,Algorithm ,020602 bioinformatics ,Selection (genetic algorithm) ,Biotechnology - Abstract
Co-evolution exists ubiquitously in biological systems. At the molecular level, interacting proteins, such as ligands and their receptors and components in protein complexes, co-evolve to maintain their structural and functional interactions. Many proteins contain multiple functional domains interacting with different partners, making co-evolution of interacting domains occur more prominently. Multiple methods have been developed to predict interacting proteins or domains within proteins by detecting their co-variation. This strategy neglects the fact that interacting domains can be highly co-conserved due to their functional interactions. Here we report a novel algorithm COPCOP to detect signals of both co-positive selection (co-variation) and co-purifying selection (co-conservation). Results show that our algorithm performs well and outperforms the popular co-variation analysis program CAPS. We also design and implement a multi-level parallel acceleration strategy for COPCOP based on Tianhe-2 CPU-MIC heterogeneous supercomputer system to meet the need of large-scale co-evolutionary domain detection.
- Published
- 2020
- Full Text
- View/download PDF
36. MilkyWay-2 supercomputer: system and application
- Author
-
Yutong Lu, Xiangke Liao, Liquan Xiao, and Canqun Yang
- Subjects
TOP500 ,General Computer Science ,Computer science ,business.industry ,Supercomputer ,computer.software_genre ,Theoretical Computer Science ,Instruction set ,Software ,Computer architecture ,Tianhe-2 ,Operating system ,Software system ,business ,Massively parallel ,computer ,Graph500 - Abstract
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity-off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16-core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
- Published
- 2014
- Full Text
- View/download PDF
37. Evaluation of the computational performance of the finite-volume atmospheric model of the IAP/LASG (FAMIL) on a high-performance computer
- Author
-
Jinxiao Li, Yimin Liu, Qing Bao, and Guoxiong Wu
- Subjects
Atmospheric Science ,Speedup ,010504 meteorology & atmospheric sciences ,Computer science ,computational performance ,CPU time ,02 engineering and technology ,Parallel computing ,FAMIL ,Oceanography ,01 natural sciences ,lcsh:Oceanography ,0202 electrical engineering, electronic engineering, information engineering ,lcsh:GC1-1581 ,scalability ,lcsh:Environmental sciences ,0105 earth and related environmental sciences ,lcsh:GE1-350 ,Node (networking) ,Message passing ,Tianhe-2 ,Supercomputer ,FLOPS ,020202 computer hardware & architecture ,Scalability ,Vectorization (mathematics) - Abstract
High computational performance is extremely important for climate system models, especially in ultra-high-resolution model development. In this study, the computational performance of the Finite-volume Atmospheric Model of the IAP/LASG (FAMIL) was comprehensively evaluated on Tianhe-2, which was the world’s top-ranked supercomputer from June 2013 to May 2016. The standardized Atmospheric Model Inter-comparison Project (AMIP) type of experiment was carried out that focused on the computational performance of each node as well as the simulation year per day (SYPD), the running cost speedup, and the scalability of the FAMIL. The results indicated that (1) based on five indexes (CPU usage, percentage of CPU kernel mode that occupies CPU time and of message passing waiting time (CPU_SW), code vectorization (VEC), average of Gflops (Gflops_AVE), and peak of Gflops (Gflops_PK)), FAMIL shows excellent computational performance on every Tianhe-2 computing node; (2) considering SYPD and the cost speedup of FAMIL systematically, the optimal Message Passing Interface (MPI) numbers of processors (MNPs) choice appears when FAMIL use 384 and 1536 MNPs for C96 (100 km) and C384 (25 km), respectively; and (3) FAMIL shows positive scalability with increased threads to drive the model. Considering the fast network speed and acceleration card in the MIC architecture on Tianhe-2, there is still significant room to improve the computational performance of FAMIL.
- Published
- 2017
- Full Text
- View/download PDF
38. Accelerating Electron Tomography Reconstruction Algorithm ICON Using the Intel Xeon Phi Coprocessor on Tianhe-2 Supercomputer
- Author
-
Yu Chen, Fei Sun, Zhiyong Liu, Zihao Wang, Lun Li, Fa Zhang, Jingrong Zhang, and Xiaohua Wan
- Subjects
0301 basic medicine ,Coprocessor ,Computer science ,Pentium ,Hyper-threading ,02 engineering and technology ,Parallel computing ,021001 nanoscience & nanotechnology ,SAGA-220 ,Supercomputer ,03 medical and health sciences ,030104 developmental biology ,Tianhe-2 ,Itanium ,0210 nano-technology ,Xeon Phi - Abstract
Electron tomography (ET) is an important method for studying three-dimensional cell ultrastructure. Combining with a sub-volume averaging approach, ET provides new possibilities for investigating in situ macromolecular complexes in sub-nanometer resolution. Because of the limited sampling angles, ET reconstruction usually suffers from the ‘missing wedge’ problem. With a validation procedure, Iterative Compressed-sensing Optimized NUFFT reconstruction (ICON) demonstrates its power in the restoration of validated missing information for low SNR biological ET dataset. However, the huge computational demand has become a bottleneck for the application of ICON. In this work, we developed the strategies of parallelization for NUFFT and ICON, and then implemented them on a Xeon Phi 31SP coprocessor to generate the parallel program ICON-MIC. We also proposed a hybrid task allocation strategy and extended ICON-MIC on multiple Xeon Phi cards on Tianhe-2 supercomputer to generate program ICON-MULT-MIC. With high accuracy, ICON-MIC has a significant acceleration compared to the CPU version, up to 13.3x, and ICON-MULT-MIC has good weak and strong scalability efficiency on Tianhe-2 supercomputer.
- Published
- 2017
- Full Text
- View/download PDF
39. Building the Supercomputer
- Author
-
Ashwin Pajankar
- Subjects
World Wide Web ,Computer science ,Tianhe-2 ,Home directory ,Wifi network ,Image editing ,computer.software_genre ,Supercomputer ,computer - Abstract
This chapter covers several basic but important aspects of Pixlr Editor that will be especially useful for those new to image editing. This chapter covers the following topics
- Published
- 2017
- Full Text
- View/download PDF
40. Accelerator-Centered Programming on Heterogeneous Systems
- Author
-
Canqun Yang, Cheng Chen, and Yunfei Du
- Subjects
020203 distributed computing ,TheoryofComputation_COMPUTATIONBYABSTRACTDEVICES ,Coprocessor ,Speedup ,Computer science ,020207 software engineering ,02 engineering and technology ,Parallel computing ,Supercomputer ,0202 electrical engineering, electronic engineering, information engineering ,Programming paradigm ,Tianhe-2 ,Central processing unit ,PCI Express ,Data transmission - Abstract
Parallel many cores contribute to heterogeneous architectures and achieve high computation throughput. Working as coprocessors and connected to general-purpose CPUs via PCIe, those special-purpose cores usually work as float computing accelerators (ACC). The popular programming models typically offload the computing intensive parts to accelerator then aggregate results, which would result in a great amount of data transfer via PCIe. In this paper, we introduce an ACC-centered model to leverage the limited bandwidth of PCIe, increase performance, reduce idle time of ACC. In order to realize dada-near-computing, our ACC-centered model arms to program centered on ACC and the control intensive parts are offloaded to CPU. Both CPU and ACC are devoted to higher performance with their architect feature. Validation on the Tianhe-2 supercomputer shows that the implementation of ACC-centered LU competes with the highly optimized Intel MKL hybrid implementation and achieves about 5× speedup versus the CPU version.
- Published
- 2016
- Full Text
- View/download PDF
41. Accelerating the Simulation of Thermal Convection in the Earth's Outer Core on Tianhe-2
- Author
-
Yunfei Du, Ligang Li, Changmao Wu, Yutong Lu, Leisheng Li, Haitao Zhao, Fangfang Liu, and Chao Yang
- Subjects
020203 distributed computing ,Computer simulation ,Xeon ,Preconditioner ,Computer science ,Linear system ,010103 numerical & computational mathematics ,02 engineering and technology ,Parallel computing ,Supercomputer ,01 natural sciences ,Outer core ,Computational science ,0202 electrical engineering, electronic engineering, information engineering ,Tianhe-2 ,Distributed memory ,0101 mathematics - Abstract
Numerical simulation of thermal convection in the Earth's outer core requires extreme-scale computing due to the large temporal and spatial disparity, extreme physical parameters, rapid rotation and spherical geometry. In this work, the numerical simulation of the thermal convection in the Earth's outer core for CPU-MIC heterogeneous many-core systems is studied. Firstly, starting from a legacy parallel code based on the PETSc software package, a framework of the numerical simulation built on CPU-MIC heterogeneous many-core systems has been developed. Secondly, a sparse linear solver for CPUMIC heterogeneous many-core systems, which focuses on solving the two linear systems of the simulation, is presented and optimized. Thirdly, some computational kernels of the simulation, including sparse matrix-vector multiplication (SpMV) and polynomial preconditioner on distributed memory Xeon Phiaccelerated systems are implemented and optimized. In addition, in order to reduce the cost of data movement, we use methods to minimize the memory access, the PCI-E data transfer, and the MPI communication. Finally, some optimized measures are taken to the extended code. Experiments on Tianhe-2 Supercomputer show that as compared to the original code, our Xeon Phiaccelerated design is able to deliver 6.93x and 6.00x speedups for single MIC device and 64 MIC devices, respectively.
- Published
- 2016
- Full Text
- View/download PDF
42. mAMBER: A CPU/MIC collaborated parallel framework for AMBER on Tianhe-2 supercomputer
- Author
-
Jie Liu, Weiliang Zhu, Yutong Lu, Kai Lu, Xiaoyu Zhang, Canqun Yang, Xiangke Liao, Shaoliang Peng, and Dong-Qing Wei
- Subjects
Source code ,Coprocessor ,Speedup ,010304 chemical physics ,business.industry ,Computer science ,media_common.quotation_subject ,Parallel computing ,010402 general chemistry ,Supercomputer ,01 natural sciences ,0104 chemical sciences ,Computational science ,Software ,0103 physical sciences ,Scalability ,Tianhe-2 ,Central processing unit ,business ,media_common - Abstract
Molecular dynamics (MD) is a computer simulation method of studying physical movements of atoms and molecules that provide detailed microscopic sampling on molecular scale. With the continuous efforts and improvements, MD simulation gained popularity in materials science, biochemistry and biophysics with various application areas and expanding data scale. Assisted Model Building with Energy Refinement (AMBER) is one of the most widely used software packages for conducting MD simulations. However, the speed of AMBER MD simulations for system with millions of atoms in microsecond scale still need to be improved. In this paper, we propose a parallel acceleration strategy for AMBER on Tianhe-2 supercomputer. The parallel optimization of AMBER is carried out on three different levels: fine grained OpenMP parallel on a single MIC, single-node CPU/MIC collaborated parallel optimization and multi-node multi-MIC collaborated parallel acceleration. By the three levels of parallel acceleration strategy above, we achieved the highest speedup of 25–33 times compared with the original program. Source Code: https://github.com/tianhe2/mAMBER
- Published
- 2016
- Full Text
- View/download PDF
43. Enabling Tissue-Scale Cardiac Simulations Using Heterogeneous Computing on Tianhe-2
- Author
-
Namit Gaur, Johannes Langguth, Chunyuan Zhang, Qiang Lan, Mei Wen, and Xing Cai
- Subjects
0301 basic medicine ,03 medical and health sciences ,030104 developmental biology ,Computer science ,Calcium handling ,Scalability ,Tianhe-2 ,Symmetric multiprocessor system ,Parallel computing ,Load balancing (computing) ,Supercomputer ,Xeon Phi - Abstract
We develop a simulator for 3D tissue of the human cardiac ventricle with a physiologically realistic cell model and deploy it on the supercomputer Tianhe-2. In order to attain the full performance of the heterogeneous CPU-Xeon Phi design, we use carefully optimized codes for both devices and combine them to obtain suitable load balancing. Using a large number of nodes, we are able to perform tissue-scale simulations of the electrical activity and calcium handling in millions of cells, at a level of detail that tracks the states of trillions of ryanodine receptors. We can thus simulate arrythmogenic spiral waves and other complex arrhythmogenic patterns which arise from calcium handling deficiencies in human cardiac ventricle tissue. Due to extensive code tuning and parallelization via OpenMP, MPI, and SCIF/COI, large scale simulations of 10 heartbeats can be performed in a matter of hours. Test results indicate excellent scalability, thus paving the way for detailed whole-heart simulations in future generations of leadership class supercomputers.
- Published
- 2016
- Full Text
- View/download PDF
44. Evaluating and Optimizing Parallel LU-SGS for High-Order CFD Simulations on the Tianhe-2 Supercomputer
- Author
-
Dali Li, Chuanfu Xu, and Ningbo Guo
- Subjects
0209 industrial biotechnology ,Multi-core processor ,Xeon ,Computer science ,Pipeline (computing) ,02 engineering and technology ,Parallel computing ,Supercomputer ,01 natural sciences ,010305 fluids & plasmas ,Pipeline transport ,020901 industrial engineering & automation ,Data dependency ,0103 physical sciences ,Tianhe-2 ,Xeon Phi - Abstract
The inherent strong data dependency of LU-SGS poses tough challenges for shared-memory parallelization. The popular pipeline solution for parallel LU-SGS in CFD, achieves impressive parallel scalability on early multi-core processors. However, recent experiences show that the scalability of pipeline LU-SGS drops dramatically on emerging many-core processors such as Xeon Phi due to high startup and emptying overheads and severe load imbalance. We discover that increasingly large pipeline depth tremendously hinder the applicability of pipeline LU-SGS in realistic parallel CFD simulations on many-core processors. Aiming at alleviating these performance issues, we propose a novel improved pipeline LU-SGS algorithm, which organizes threads hierarchically using nested OpenMP to construct a subpipeline in each original pipeline stage to further exploit LUSGS's parallelism. We implement and evaluate it in our in-house high-order CFD software HOSTA on Xeon and Xeon Phi. For a given 256× 256× 256 workload, improved method achieves over 20% performance gains on Xeon Phi than traditional pipeline approach, a further 38% performance boost are observed on Xeon Phi when varies the dimension sizes. Related problems in realistic CFD simulations such as domain decomposition and algorithmic parameter tuning are also discussed. Generally, our work is applicable to all Gauss-Seidel like methods with intrinsic strong data dependency.
- Published
- 2016
- Full Text
- View/download PDF
45. A Hybrid MapReduce Implementation of PCA on Tianhe-2
- Author
-
Yutong Lu, Yili Qu, and Wei Yu
- Subjects
History ,Computer science ,Tianhe-2 ,Parallel computing ,Computer Science Applications ,Education - Published
- 2019
- Full Text
- View/download PDF
46. A Benchmark Test of Boson Sampling on Tianhe-2 Supercomputer
- Author
-
Yong Liu, Xuejun Yang, Yang Wang, Baida Zhang, Xian-Min Jin, Junjie Wu, and Huiquan Wang
- Subjects
FOS: Computer and information sciences ,Quantum Physics ,Multi-core processor ,Multidisciplinary ,Photon ,Computer science ,Quantum machine ,FOS: Physical sciences ,02 engineering and technology ,021001 nanoscience & nanotechnology ,Supercomputer ,01 natural sciences ,Upper and lower bounds ,Computational science ,Computer Science - Distributed, Parallel, and Cluster Computing ,0103 physical sciences ,Tianhe-2 ,Distributed, Parallel, and Cluster Computing (cs.DC) ,010306 general physics ,0210 nano-technology ,Quantum Physics (quant-ph) ,Quantum computer ,Boson - Abstract
Boson sampling, thought to be intractable classically, can be solved by a quantum machine composed of merely generation, linear evolution and detection of single photons. Such an analog quantum computer for this specific problem provides a shortcut to boost the absolute computing power of quantum computers to beat classical ones. However, the capacity bound of classical computers for simulating boson sampling has not yet been identified. Here we simulate boson sampling on the Tianhe-2 supercomputer which occupied the first place in the world ranking six times from 2013 to 2016. We computed the permanent of the largest matrix using up to 312,000 CPU cores of Tianhe-2, and inferred from the current most efficient permanent-computing algorithms that an upper bound on the performance of Tianhe-2 is one 50-photon sample per ~100 min. In addition, we found a precision issue with one of two permanent-computing algorithms., Comment: 7 pages, 5 figures
- Published
- 2016
- Full Text
- View/download PDF
47. On Robust and Efficient Parallel Reservoir Simulation on Tianhe-2
- Author
-
Weicai Ye, Xiaozhe Hu, Changhe Qiao, Zheng Li, Zhifan Zhu, Zhenying Zheng, Hongxuan Zhang, Meipeng Zhi, Yuesheng Xu, Chunsheng Feng, Wenchao Guan, Jinchao Xu, Chen-Song Zhang, and Yongdong Zhang
- Subjects
Reservoir simulation ,Computer science ,Tianhe-2 ,Parallel computing - Abstract
Parallel reservoir simulators are now widely used with availability of super computers. Modern massively parallel supercomputers demonstrate great power for simulating large-scale reservoir models. However, improving scalability and efficiency for fully implicit methods on emerging parallel architectures is still challenging. In this paper, we present a robust discretization together with a parallel linear solver algorithm; and we explore the parallel implementation on the world’s fastest supercomputer Tianhe-2. Starting with a general compositional model, we focus on the black oil model and developed Parallel eXtension Framework for parallelizing the serial simulator. A parallel preconditioner based on fast auxiliary space preconditioning (FASP) is applied to solve the Jacobian system arising from the fully implicit discretization. The parallel simulator was validated using large-scale black oil benchmark problems, for which parallel scalabilities were tested. Giant reservoir models with over 100 million grid blocks have been simulated within a few minutes, and test the strong scalability of AMG solver with 1 billion unknown. We also demonstrate the parallelization and acceleration using Intel Xeon Phi coprocessors. In the end, the efficiency of the parallel simulator is illustrated by a giant reservoir using up to 10,000 cores, for which the CPU and communication time are summarized for the linear and nonlinear algorithms.
- Published
- 2015
- Full Text
- View/download PDF
48. Time-Dimension Communication Characterization of Representative Scientific Applications on Tianhe-2
- Author
-
Juan Chen, Zhiyuan Wang, Liyang Xu, Wenhao Zhou, Yuhua Tang, and Xinhai Xu
- Subjects
Interconnection ,Computer science ,Multiple time dimensions ,Distributed computing ,Tianhe-2 ,Statistical dispersion ,Interval (mathematics) ,Supercomputer ,Exascale computing - Abstract
Exascale computing is one of the major challenges of this decade, and several studies have shown that the communication is becoming one of the bottlenecks for scaling parallel applications. The characteristic analysis of communication is an important means to improve the performance of scientific applications. In this paper, we focus on the statistical regularity in time-dimension communication characteristics of representative scientific applications and find that the distribution of interval of communication events has a power-law decay, which is widely found in scientific interests and human activities. For a quantitative study on characteristics of power-law distribution, we count two groups of typical measures: bursty vs. memory and periodicity vs. dispersion. Our analysis shows that the communication events reflect a "strong-bursty and weak-memory" characteristic and we also capture the periodicity and dispersion in interval distribution. All of the quantitative results are verified with eight representative scientific applications on Tianhe-2 supercomputer with a fat-tree-like interconnection network. Finally, our study provides an insight on the relationship between communication optimization and time-dimension communication characteristics.
- Published
- 2015
- Full Text
- View/download PDF
49. A Method to Accelerate GROMACS in Offload Mode on Tianhe-2 Supercomputer
- Author
-
Chengkun Wu, Qian Chen, Weiliang Zhu, Shaoliang Peng, Haiqiang Wang, Xiaoqian Zhu, Huaiyu Yang, Jinan Wang, and Xin Liu
- Subjects
Software ,Coprocessor ,Xeon ,Computer science ,business.industry ,Synchronization (computer science) ,Tianhe-2 ,Context (language use) ,Parallel computing ,Reuse ,Supercomputer ,business - Abstract
Molecular Dynamics(MD) is a computer simulation of physical movements of atoms and molecules in the context of N-body simulation, and is an important part of pharmaceutical industry. GROMACS, which is the most popular software for MD, could not perform satisfactorily with large-scale for the limit of computing resources. In this paper, we proposed a method to accelerate GROMACS with offload mode. In this mode, GROMACS could be arranged efficiently with CPU and the Intel® Xeon PhiTM Many Integrated Core (MIC) coprocessors at the same time, making the full use of Tianhe-2 supercomputer resources. To promote the efficiency of GROMACS, we proposed a series of methods, such as synchronization, data reassemble and array reuse. As we known, we are the first to accelerate GROMACS in offload mode on MIC.
- Published
- 2015
- Full Text
- View/download PDF
50. Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm
- Author
-
Yutong Lu, Xianyi Zhang, Fangfang Liu, Yiqun Liu, and Chao Yang
- Subjects
Coprocessor ,Xeon ,Computer science ,Conjugate gradient method ,Benchmark (computing) ,Tianhe-2 ,Symmetric multiprocessor system ,Multiplication ,Node (circuits) ,Parallel computing ,FLOPS ,Hybrid algorithm ,Xeon Phi - Abstract
In this paper, we propose a hybrid algorithm to enable and accelerate the High Performance Conjugate Gradient (HPCG) benchmark on a heterogeneous node with an arbitrary number of accelerators. In the hybrid algorithm, each subdomain is assigned to a node after a three-dimensional domain decomposition. The subdomain is further divided to several regular inner blocks and an outer part with a flexible inner-outer partitioning strategy. Each inner task is assigned to a MIC device and the size is adjustable to adapt the accelerator's computational power. The only outer part is assigned to CPU and the thickness of boundary size is also adjustable to maintain load balance between CPU and MICs. By properly fusing the computational kernels with preceding ones, we present an asynchronous data transfer scheme to better overlap local computation with the PCI-express data transfer. All basic HPCG kernels, especially the time-consuming sparse matrix-vector multiplication (SpMV) and the symmetric Gauss-Seidel relaxation (SymGS), are extensively optimized for both CPU and MIC, on both algorithmic and architectural levels. On a single node of Tianhe-2 which is composed of an Intel Xeon processor and three Intel Xeon Phi coprocessors, we successfully obtain an aggregated performance of 50.2 Gflops, which is around 1.5% of the peak performance.
- Published
- 2014
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.