61 results on '"Thomas Naughton"'
Search Results
2. RADICAL-Pilot and PMIx/PRRTE: Executing Heterogeneous Workloads at Large Scale on Partitioned HPC Resources
- Author
-
Mikhail Titov, Matteo Turilli, Andre Merzky, Thomas Naughton, Wael Elwasif, and Shantenu Jha
- Published
- 2023
3. Towards a Standard Process Management Infrastructure for Workflows Using Python
- Author
-
Wael Elwasif, Thomas Naughton, and Matthew Baker
- Published
- 2023
4. Improved Security of Protected Health Information (PHI) through Software Development Life Cycle (SDLC) Processes.
- Author
-
Carolyn Fu, Thomas Naughton, William Simons, and Joanna Brownstein
- Published
- 2016
5. Scheduler: An Application for Supporting Streamlined Clinical Research Center Operations.
- Author
-
Joanna Brownstein, Carolyn Fu, Richie Siburian, Thomas Naughton, William Simons, Ankit Panchamia, Carl Woolf, Colette Hendricks, Seanne Falconer, and Douglas MacFadden
- Published
- 2016
6. Second Target Station Computer Science and Math Workshop Report
- Author
-
Mathieu Doucet, Steven Hartman, John Hetrick, Yaohua Liu, Thomas Naughton III, Thomas Proffen, Shuo Qian, and Jon Taylor
- Published
- 2022
7. INTERSECT Architecture Specification: System-of-systems Architecture (Version 0.5)
- Author
-
Swen Boehm, Thomas Naughton III, Suhas Somnath, Ben Mintz, Jack Lange, Scott {Leadership Computing} Atchley, Rohit Srivastava, and Patrick Widener
- Published
- 2022
8. Emulation Framework for Distributed Large-Scale Systems Integration
- Author
-
Neena Imam, Nageswara S. V. Rao, Anees Al-Najjar, Thomas Naughton, and Seth Hitefield
- Published
- 2022
9. The INTERSECT Open Federated Architecture for the Laboratory of the Future
- Author
-
Christian Engelmann, Olga Kuchar, Swen Boehm, Michael J. Brim, Thomas Naughton, Suhas Somnath, Scott Atchley, Jack Lange, Ben Mintz, and Elke Arenholz
- Published
- 2022
10. Software Framework for Federated Science Instruments
- Author
-
Thomas Naughton, James Arthur Kohl, Jean-Christophe Bilheux, Wael R. Elwasif, Neena Imam, Swen Boehm, Lawrence Sorrillo, Jason Kincl, Satyabrata Sen, Hassina Z. Bilheux, Seth Hitefield, and Nageswara S. V. Rao
- Subjects
Scientific instrument ,Guiding Principles ,business.industry ,Computer science ,Provisioning ,Oak Ridge National Laboratory ,computer.software_genre ,Software framework ,Software ,Workflow ,Systems engineering ,business ,computer ,Spallation Neutron Source - Abstract
There is an unprecedented promise of enhanced capabilities for federations of leadership computing systems and experimental science facilities by leveraging software technologies for fast and efficient operations. These federations seek to unify different science instruments, both computing and experimental, to effectively support science users and operators to execute complex workflows. The FedScI project addresses the software challenges associated with the formation and operation of federated environments by leveraging recent advances in containerization of software and softwarization of hardware. We propose a software framework to streamline the federation usage by science users and it’s provisioning and operations by facility providers. A distinguishing element of our work is the support for improved interaction between experimental devices, such as beam-line instruments, and more traditional high-performance computing resources, including compute, network, storage systems. We present guiding principles for the software framework and highlight portions of a current prototype implementation. We describe our science use case involving neutron imaging beam-lines (SNAP/BL-3, Imaging/CG-1D) at the Spallation Neutron Source and High Flux Isotope Reactor facilities at Oak Ridge National Laboratory. Additionally, we detail plans for a more direct instrument interaction within a federated environment, which could enable more advanced workflows with feedback loops to shorten the time to science.
- Published
- 2020
11. Characterizing the Performance of Executing Many-tasks on Summit
- Author
-
Andre Merzky, Matteo Turilli, Thomas Naughton, Shantenu Jha, and Wael R. Elwasif
- Subjects
FOS: Computer and information sciences ,geography ,Summit ,geography.geographical_feature_category ,Computer science ,Distributed computing ,02 engineering and technology ,Supercomputer ,01 natural sciences ,010305 fluids & plasmas ,Scheduling (computing) ,Runtime system ,Computer Science - Distributed, Parallel, and Cluster Computing ,Homogeneous ,0103 physical sciences ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Data analysis ,020201 artificial intelligence & image processing ,Distributed, Parallel, and Cluster Computing (cs.DC) ,Resource utilization - Abstract
Many scientific workloads are comprised of many tasks, where each task is an independent simulation or analysis of data. The execution of millions of tasks on heterogeneous HPC platforms requires scalable dynamic resource management and multi-level scheduling. RADICAL-Pilot (RP) -- an implementation of the Pilot abstraction, addresses these challenges and serves as an effective runtime system to execute workloads comprised of many tasks. In this paper, we characterize the performance of executing many tasks using RP when interfaced with JSM and PRRTE on Summit: RP is responsible for resource management and task scheduling on acquired resource; JSM or PRRTE enact the placement of launching of scheduled tasks. Our experiments provide lower bounds on the performance of RP when integrated with JSM and PRRTE. Specifically, for workloads comprised of homogeneous single-core, 15 minutes-long tasks we find that: PRRTE scales better than JSM for > O(1000) tasks; PRRTE overheads are negligible; and PRRTE supports optimizations that lower the impact of overheads and enable resource utilization of 63% when executing O(16K), 1-core tasks over 404 compute nodes.
- Published
- 2019
12. Application health monitoring for extreme‐scale resiliency using cooperative fault management
- Author
-
Joshua Hursey, Thomas Naughton, Pratul K. Agarwal, Al Geist, Byung H. Park, and David E. Bernholdt
- Subjects
Computer Networks and Communications ,Computer science ,business.industry ,Distributed computing ,020206 networking & telecommunications ,Fault tolerance ,02 engineering and technology ,Fault (power engineering) ,Article ,Computer Science Applications ,Theoretical Computer Science ,Fault management ,Software ,Computational Theory and Mathematics ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Transient (computer programming) ,business ,Error detection and correction ,Computational steering - Abstract
Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. We introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns as indicators of an application’s health, and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. The developed approach is general and can be easily applied to other applications.
- Published
- 2019
13. Reinforcement Learning-based Traffic Control to Optimize Energy Usage and Throughput (CRADA report)
- Author
-
Robert M. Patton, T Sean Oesch, Derek C. Rose, Steven R. Young, Ryan Tokola, Thomas Naughton, Matthew Eicholtz, Regina K. Ferrell, Wael R. Elwasif, and Thomas P. Karnowski
- Subjects
Computer science ,Control (management) ,Reinforcement learning ,Throughput (business) ,Energy (signal processing) ,Reliability engineering - Published
- 2019
14. Oak Ridge OpenSHMEM Benchmark Suite
- Author
-
Manjunath Gorentla Venkata, Matthew B. Baker, Thomas Naughton, Swaroop Pophale, Neena Imam, and Ferrol Aderholdt
- Subjects
Kernel (image processing) ,Java ,business.industry ,Computer science ,Suite ,Snapshot (computer storage) ,Use case ,Software engineering ,business ,Implementation ,computer ,Porting ,computer.programming_language - Abstract
The assessment of application performance is a fundamental task in high-performance computing (HPC). The OpenSHMEM Benchmark (OSB) suite is a collection of micro-benchmarks and mini-applications/compute kernels that have been ported to use OpenSHMEM. Some, like the NPB OpenSHMEM benchmarks, have been published before while most others have been used for evaluations but never formally introduced or discussed. This suite puts them together and is useful for assessing the performance of different use cases of OpenSHMEM. This offers system implementers a useful means of measuring performance and assessing the effects of new features as well as implementation strategies. The suite is also useful for application developers to assess the performance of the growing number of OpenSHMEM implementations that are emerging. In this paper, we describe the current set of codes available within the OSB suite, how they are intended to be used, and, where possible, a snapshot of their behavior on one of the OpenSHMEM implementations available to us. We also include detailed descriptions of every benchmark and kernel, focusing on how OpenSHMEM was used. This includes details on the enhancements we made to the benchmarks to support multithreaded variants. We encourage the OpenSHMEM community to use, review, and provide feedback on the benchmarks.
- Published
- 2019
15. A new deadlock resolution protocol and message matching algorithm for the extreme‐scale simulator
- Author
-
Thomas Naughton and Christian Engelmann
- Subjects
Computer Networks and Communications ,Computer science ,business.industry ,Embarrassingly parallel ,Message Passing Interface ,020206 networking & telecommunications ,02 engineering and technology ,Parallel computing ,Supercomputer ,Computer Science Applications ,Theoretical Computer Science ,Software ,Computational Theory and Mathematics ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Performance prediction ,020201 artificial intelligence & image processing ,Discrete event simulation ,business ,Blossom algorithm ,Simulation - Abstract
Investigating the performance of parallel applications at scale on future high-performance computingi¾?HPC architectures and the performance impact of different HPC architecture choices is an important component of HPC hardware/software co-design. The Extreme-scale Simulator xSim is a simulation toolkit for investigating the performance of parallel applications at scale. xSim scales to millions of simulated Message Passing Interface MPI processes. The xSim toolkit strives to limit simulation overheads in order to maintain performance and productivity criteria. This paper documents two improvements to xSim: 1i¾?a new deadlock resolution protocol to reduce the parallel discrete event simulation overhead and 2i¾?a new simulated MPI message matching algorithm to reduce the oversubscription management cost. These enhancements resulted in significant performance improvements. The simulation overhead for running the NASA Advanced Supercomputing Parallel Benchmark suite dropped from 1,020% to 238% for the conjugate gradient benchmark and 102% to 0% for the embarrassingly parallel benchmark. Additionally, the improvements were beneficial for reducing overheads in the highly accurate simulation mode of xSim, which is useful for resilience investigation studies for tracking intentional MPI process failures. In the highly accurate mode, the simulation overhead was reduced from 37,511% to 13,808% for conjugate gradient and from 3,332% to 204% for embarrassingly parallel. Copyright © 2016 John Wiley & Sons, Ltd.
- Published
- 2016
16. A survey of MPI usage in the US exascale computing project
- Author
-
Geoffroy Vallée, Swen Boehm, Howard Pritchard, George Bosilca, Thomas Naughton, Manjunath Gorentla Venkata, Ryan E. Grant, Martin Schulz, and David E. Bernholdt
- Subjects
020203 distributed computing ,Computer Networks and Communications ,Computer science ,010103 numerical & computational mathematics ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Exascale computing ,Computer Science Applications ,Theoretical Computer Science ,Computational Theory and Mathematics ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,0101 mathematics ,computer ,Software - Published
- 2018
17. Epidemic failure detection and consensus for extreme parallelism
- Author
-
Giuseppe Di Fatta, Thomas Naughton, Amogh Katti, and Christian Engelmann
- Subjects
Consensus algorithm ,Computer science ,Distributed computing ,Fault tolerance ,0102 computer and information sciences ,02 engineering and technology ,01 natural sciences ,Computing systems ,Theoretical Computer Science ,010201 computation theory & mathematics ,Hardware and Architecture ,Gossip ,Component (UML) ,0202 electrical engineering, electronic engineering, information engineering ,Parallelism (grammar) ,020201 artificial intelligence & image processing ,Gossip protocol ,Chandra–Toueg consensus algorithm ,Software - Abstract
Future extreme-scale high-performance computing systems will be required\ud to work under frequent component failures. The MPI Forum’s User\ud Level Failure Mitigation proposal has introduced an operation,\ud MPI Comm shrink, to synchronize the alive processes on the list of failed\ud processes, so that applications can continue to execute even in the presence\ud of failures by adopting algorithm-based fault tolerance techniques. This\ud MPI Comm shrink operation requires a failure detection and consensus\ud algorithm. This paper presents three novel failure detection and consensus\ud algorithms using Gossiping. Stochastic pinging is used to quickly detect\ud failures during the execution of the algorithm, failures are then disseminated\ud to all the fault-free processes in the system and consensus on the\ud failures is detected using the three consensus techniques. The proposed\ud algorithms were implemented and tested using the Extreme-scale Simulator.\ud The results show that the stochastic pinging detects all the failures in\ud the system. In all the algorithms, the number of Gossip cycles to achieve\ud global consensus scales logarithmically with system size. The second algorithm\ud also shows better scalability in terms of memory and network\ud bandwidth usage and a perfect synchronization in achieving global consensus.\ud The third approach is a three-phase distributed failure detection\ud and consensus algorithm and provides consistency guarantees even in very\ud large and extreme-scale systems while at the same time being memory and\ud bandwidth efficient.
- Published
- 2018
18. A comparison of Amazon Web Services and Microsoft Azure cloud platforms for high performance computing
- Author
-
Neena Imam, Charlotte Kotas, and Thomas Naughton
- Subjects
020203 distributed computing ,Amazon web services ,Random access memory ,Computer science ,business.industry ,020206 networking & telecommunications ,Cloud computing ,02 engineering and technology ,computer.software_genre ,Supercomputer ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,business ,computer - Abstract
Advances in commercial cloud computing necessitate continual evaluation of the cloud's performance on a variety of applications. This work looks at compute oriented instances from Amazon Web Services and Microsoft Azure cloud platforms and evaluates them with several high-performance computing benchmarks, including HPCC and HPCG. These benchmarks illustrate that the most cost competitive solution depends on the application to be run.
- Published
- 2018
19. Balancing Performance and Portability with Containers in HPC: An OpenSHMEM Example
- Author
-
Adam B. Simpson, Lawrence Sorrillo, Thomas Naughton, and Neena Imam
- Subjects
business.industry ,Computer science ,Testbed ,Software development ,computer.software_genre ,Supercomputer ,Software portability ,Software ,Titan (supercomputer) ,Software deployment ,Operating system ,business ,computer ,Graph500 - Abstract
There is a growing interest in using Linux containers to streamline software development and application deployment. A container enables the user to bundle the salient elements of the software stack from an application’s perspective. In this paper, we discuss initial experiences in using the Open MPI implementation of OpenSHMEM with containers on HPC resources. We provide a brief overview of two container runtimes, Docker & Singularity, highlighting elements that are of interest for HPC users. The Docker platform offers a rich set of services that are widely used in enterprise environments, whereas Singularity is an emerging container runtime that is specifically written for use on HPC systems. We describe our procedure for container assembly and deployment that strives to maintain the portability of the container-based application. We show performance results for the Graph500 benchmark running along the typical continuum of development testbed up to full production supercomputer (ORNL’s Titan). The results show consistent performance between the native and Singularity (container) tests. The results also showed an unexplained drop in performance when using the Cray Gemini network with Open MPI’s OpenSHMEM, which was unrelated to the container usage.
- Published
- 2018
20. A Cooperative Approach to Virtual Machine Based Fault Injection
- Author
-
Stephen L. Scott, Thomas Naughton, Geoffroy Vallée, Ferrol Aderholdt, and Christian Engelmann
- Subjects
020203 distributed computing ,021110 strategic, defence & security studies ,Computer science ,business.industry ,Distributed computing ,0211 other engineering and technologies ,Context (language use) ,02 engineering and technology ,Fault injection ,computer.software_genre ,Virtualization ,Virtual machine introspection ,Virtual finite-state machine ,System under test ,Virtual machine ,Embedded system ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,business ,computer ,Host (network) - Abstract
Resilience investigations often employ fault injection (FI) tools to study the effects of simulated errors on a target system. It is important to keep the target system under test (SUT) isolated from the controlling environment in order to maintain control of the experiement. Virtual machines (VMs) have been used to aid these investigations due to the strong isolation properties of system-level virtualization. A key challenge in fault injection tools is to gain proper insight and context about the SUT. In VM-based FI tools, this challenge of target context is increased due to the separation between host and guest (VM). We discuss an approach to VM-based FI that leverages virtual machine introspection (VMI) methods to gain insight into the target’s context running within the VM. The key to this environment is the ability to provide basic information to the FI system that can be used to create a map of the target environment. We describe a proof-of-concept implementation and a demonstration of its use to introduce simulated soft errors into an iterative solver benchmark running in user-space of a guest VM.
- Published
- 2017
21. Adding Fault Tolerance to NPB Benchmarks Using ULFM
- Author
-
Zachary W. Parchman, Stephen L. Scott, Christian Engelmann, David E. Bernholdt, Geoffroy Vallée, and Thomas Naughton
- Subjects
Engineering ,business.industry ,Distributed computing ,Software fault tolerance ,Embedded system ,Benchmark (computing) ,Message Passing Interface ,Fault tolerance ,Fault injection ,Layer (object-oriented design) ,business ,Resilience (network) ,Fault detection and isolation - Abstract
In the world of high-performance computing, fault tolerance and application resilience are becoming some of the primary concerns because of increasing hardware failures and memory corruptions. While the research community has been investigating various options, from system-level solutions to application-level solutions, standards such as the Message Passing Interface (MPI) are also starting to include such capabilities. The current proposal for MPI fault tolerant is centered around the User-Level Failure Mitigation (ULFM) concept, which provides means for fault detection and recovery of the MPI layer. This approach does not address application-level recovery, which is currently left to application developers. In this work, we present a modification of some of the benchmarks of the NAS parallel benchmark (NPB) to include support of the ULFM capabilities as well as application-level strategies and mechanisms for application-level failure recovery. As such, we present: (i) an application-level library to "checkpoint" and restore data, (ii) extensions of NPB benchmarks for fault tolerance based on different strategies, (iii) a fault injection tool, and (iv) some preliminary results that show the impact of such fault tolerant strategies on the application execution.
- Published
- 2016
22. Supporting the Development of Soft-Error Resilient Message Passing Applications Using Simulation
- Author
-
Christian Engelmann and Thomas Naughton
- Subjects
Correctness ,Computer science ,business.industry ,Message passing ,Process (computing) ,Hardware_PERFORMANCEANDRELIABILITY ,Parallel computing ,Fault injection ,Fault (power engineering) ,Data structure ,Soft error ,Overhead (computing) ,business ,Computer hardware - Abstract
Radiation-induced bit flip faults are of particular concern in extreme-scale high-performance computing systems. This paper presents a simulation-based tool that enables the development of soft-error resilient message passing applications by permitting the investigation of their correctness and performance under various fault conditions. The documented extensions to the Extreme-scale Simulator (xSim) enable the injection of bit flip faults at specific of injection location(s) and fault activation time(s), while supporting a significant degree of configurability of the fault type. Experiments show that the simulation overhead with the new feature is ~2,325% for serial execution and ~1,730% at 128 MPI processes, both with very fine-grain fault injection. Fault injection experiments demonstrate the usefulness of the new feature by injecting bit flips in the input and output matrices of a matrix-matrix multiply application, revealing vulnerability of data structures, masking and error propagation. xSim is the very first simulation-based MPI performance tool that supports both, the injection of process failures and bit flip faults.
- Published
- 2016
23. Hyperspectral Aquatic Radiative Transfer Modeling Using a High-Performance Cluster Computing-Based Approach
- Author
-
Budhendra L. Bhaduri, Stephen L. Scott, Amy L King, Thomas Naughton, İnci Güneralp, and Anthony M. Filippi
- Subjects
Theoretical computer science ,Computer science ,Computer cluster ,Radiative transfer modeling ,Remote sensing reflectance ,Radiative transfer ,General Earth and Planetary Sciences ,Hyperspectral imaging ,Inverse ,Bathymetry ,Inversion (meteorology) ,Computational science - Abstract
For aquatic studies, radiative transfer (RT) modeling can be used to compute hyperspectral above-surface remote sensing reflectance that can be utilized for inverse model development. Inverse models can provide bathymetry and inherent-and bottom-optical property estimation. Because measured oceanic field/organic datasets are often spatio-temporally sparse, synthetic data generation is useful in yielding sufficiently large datasets for inversion model development; however, these forward-modeled data are computationally expensive and time-consuming to generate. This study establishes the magnitude of wall-clock-time savings achieved for performing large, aquatic RT batch-runs using parallel computing versus a sequential approach. Given 2,600 simulations and identical compute-node characteristics, sequential architecture required ~100 hours until termination, whereas a parallel approach required only ~2.5 hours (42 compute nodes)—a 40x speed-up. Tools developed for this parallel execution are discussed.
- Published
- 2012
24. System-level virtualization research at Oak Ridge National Laboratory
- Author
-
Geoffroy Vallée, Christian Engelmann, Stephen L. Scott, Thomas Naughton, Anand Tikotekar, and Hong Ong
- Subjects
Application virtualization ,Workstation ,Computer Networks and Communications ,Computer science ,Full virtualization ,Fault tolerance ,Virtualization ,computer.software_genre ,Computer security ,law.invention ,Consolidation (business) ,Grid computing ,Hardware and Architecture ,law ,computer ,Software ,Data virtualization - Abstract
System-level virtualization is today enjoying a rebirth as a technique to effectively share what had been considered large computing resources which subsequently faded from the spotlight as individual workstations gained in popularity with a ''one machine-one user'' approach. One reason for this resurgence is that the simple workstation has grown in capability to rival anything similar, available in the past. Thus, computing centers are again looking at the price/performance benefit of sharing that single computing box via server consolidation. However, industry is only concentrating on the benefits of using virtualization for server consolidation (enterprise computing) whereas our interest is in leveraging virtualization to advance high-performance computing (HPC). While these two interests may appear to be orthogonal, one consolidating multiple applications and users on a single machine while the other requires all the power from many machines to be dedicated solely to its purpose, we propose that virtualization does provide attractive capabilities that may be exploited to the benefit of HPC interests. This does raise the two fundamental questions: is the concept of virtualization (a machine ''sharing'' technology) really suitable for HPC and if so, how does one go about leveraging these virtualization capabilities for the benefit of HPC. To address these questions, this document presents ongoing studies on the usage of system-level virtualization in a HPC context. These studies include an analysis of the benefits of system-level virtualization for HPC, a presentation of research efforts based on virtualization for system availability, and a presentation of research efforts for the management of virtual systems. The basis for this document was the material presented by Stephen L. Scott at the Collaborative and Grid Computing Technologies meeting held in Cancun, Mexico on April 12-14, 2007.
- Published
- 2010
25. STCI
- Author
-
Geoffroy Vallée, Swen Bohm, Thomas Naughton, and David E. Bernholdt
- Subjects
Discrete mathematics ,Computer science ,Distributed computing ,Component (UML) ,Scalability - Published
- 2015
26. Scalable and Fault Tolerant Failure Detection and Consensus
- Author
-
Thomas Naughton, Giuseppe Di Fatta, Christian Engelmann, and Amogh Katti
- Subjects
Gossip ,Computer science ,Component (UML) ,Distributed computing ,Scalability ,Bandwidth (signal processing) ,Fault tolerance ,Gossip protocol ,Chandra–Toueg consensus algorithm ,Synchronization - Abstract
Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.
- Published
- 2015
27. A Network Contention Model for the Extreme-scale Simulator
- Author
-
Thomas Naughton and Christian Engelmann
- Subjects
Software ,business.industry ,Feature (computer vision) ,Extreme scale ,Computer science ,Distributed computing ,Communication bandwidth ,Path (graph theory) ,business ,Simulation ,Network model - Abstract
The Extreme-scale Simulator (xSim) is a performance investigation toolkit for high-performance computing (HPC) hardware/software co-design. It permits running a HPC application with millions of concurrent execution threads, while observing its performance in a simulated extreme-scale system. This paper details a newly developed network modeling feature for xSim, eliminating the shortcomings of the existing network modeling capabilities. The approach takes a different path for implementing network contention and bandwidth capacity modeling using a less synchronous and accurate enough model design. With the new network modeling feature, xSim is able to simulate on-chip and on-node networks with reasonable accuracy and overheads.
- Published
- 2015
28. Improving the Performance of the Extreme-Scale Simulator
- Author
-
Thomas Naughton and Christian Engelmann
- Subjects
Computer science ,Embarrassingly parallel ,Benchmark (computing) ,Message Passing Interface ,Overhead (computing) ,Context (language use) ,Parallel computing ,Performance improvement ,Discrete event simulation ,Supercomputer ,Simulation - Abstract
Investigating the performance of parallel applications at scale on future high-performance computing (HPC) architectures and the performance impact of different architecture choices is an important component of HPC hardware/software co-design. The Extreme-scale Simulator (xSim) is a simulation-based toolkit for investigating the performance of parallel applications at scale. xSim scales to millions of simulated Message Passing Interface (MPI) processes. The overhead introduced by a simulation tool is an important performance and productivity aspect. This paper documents two improvements to xSim: (1) a new deadlock resolution protocol to reduce the parallel discrete event simulation management overhead and (2) a new simulated MPI message matching algorithm to reduce the oversubscription management overhead. The results clearly show a significant performance improvement, such as by reducing the simulation overhead for running the NAS Parallel Benchmark suite inside the simulator from 1,020% to 238% for the conjugate gradient (CG) benchmark and from 102% to 0% for the embarrassingly parallel (EP) and benchmark, as well as, from 37,511% to 13,808% for CG and from 3,332% to 204% for EP with accurate process failure simulation.
- Published
- 2014
29. Efficient Checkpointing of Virtual Machines Using Virtual Machine Introspection
- Author
-
Fang Han, Thomas Naughton, Stephen L. Scott, and Ferrol Aderholdt
- Subjects
Software_OPERATINGSYSTEMS ,business.industry ,Computer science ,Cloud computing ,Fault tolerance ,computer.software_genre ,Virtualization ,File size ,Virtual machine introspection ,Virtual machine ,Data_FILES ,Operating system ,Virtual machining ,Latency (engineering) ,business ,computer - Abstract
Cloud Computing environments rely heavily on system-level virtualization. This is due to the inherent benefits of virtualization including fault tolerance through checkpoint/restart (C/R) mechanisms. Because clouds are the abstraction of large datacenters and large datacenters have a higher potential for failure, it is imperative that a C/R mechanism for such an environment provide minimal latency as well as a small checkpoint file size. Recently, there has been much research into C/R with respect to virtual machines (VM) providing excellent solutions to reduce either checkpoint latency or checkpoint file size. However, these approaches do not provide both. This paper presents a method of checkpointing VMs by utilizing virtual machine introspection (VMI). Through the usage of VMI, we are able to determine which pages of memory within the guest are used or free and are better able to reduce the amount of pages written to disk during a checkpoint. We have validated this work by using various benchmarks to measure the latency along with the checkpoint size. With respect to checkpoint file size, our approach results in file sizes within 24% or less of the actual used memory within the guest. Additionally, the checkpoint latency of our approach is up to 52% faster than KVM's default method.
- Published
- 2014
30. Supporting the Development of Resilient Message Passing Applications Using Simulation
- Author
-
Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Bohm
- Subjects
Concurrency control ,Software ,business.industry ,Computer science ,Software fault tolerance ,Message passing ,Message Passing Interface ,Fault tolerance ,Parallel computing ,Discrete event simulation ,business ,Supercomputer - Abstract
An emerging aspect of high-performance computing (HPC) hardware/software co-design is investigating performance under failure. The work in this paper extends the Extreme-scale Simulator (xSim), which was designed for evaluating the performance of message passing interface (MPI) applications on future HPC architectures, with fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. xSim permits running MPI applications with millions of concurrent MPI ranks, while observing application performance in a simulated extreme-scale system using a lightweight parallel discrete event simulation. The newly added features offer user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). The presented solution permits investigating performance under failure and failure handling of ABFT solutions. The newly enhanced xSim is the very first performance tool that supports ULFM and ABFT.
- Published
- 2014
31. What Is the Right Balance for Performance and Isolation with Virtualization in HPC?
- Author
-
Stephen L. Scott, Thomas Naughton, Christian Engelmann, Geoffroy Vallée, Garry Smith, and Ferrol Aderholdt
- Subjects
Computer science ,Full virtualization ,Distributed computing ,Hypervisor ,Context (language use) ,computer.software_genre ,Virtualization ,Supercomputer ,Virtual machine ,Computer cluster ,Operating system ,Isolation (database systems) ,Resilience (network) ,computer - Abstract
The use of virtualization in high-performance computingi¾?HPC has been suggested as a means to provide tailored services and added functionality that many users expect from full-featured Linux cluster environments. While the use of virtual machines in HPC can offer several benefits, maintaining performance is a crucial factor. In some instances performance criteria are placed above isolation properties and selective relaxation of isolation for performance is an important characteristic when considering resilience for HPC environments employing virtualization. In this paper we consider some of the factors associated with balancing performance and isolation in configurations that employ virtual machines. In this context, we propose a classification of errors based on the concept of "error zones", as well as a detailed analysis of the trade-offs between resilience and performance based on the level of isolation provided by virtualization solutions. Finally, the results from a set of experiments are presented, that use different virtualization solutions, and in doing so allow further elucidation of the topic.
- Published
- 2014
32. A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools
- Author
-
Christian Engelmann, Thomas Naughton, Swen Bohm, and Geoffroy Vallée
- Subjects
Software visualization ,Computer science ,business.industry ,Software deployment ,Distributed computing ,Software fault tolerance ,Software construction ,Software development ,Software system ,Software architecture ,business ,System software - Abstract
The high-performance computing (HPC) community continues to increase the size and complexity of hardware platforms that support advanced scientific workloads. The runtime environment (RTE) is a crucial layer in the software stack for these large-scale systems. The RTE manages the interface between the operating system and the application running in parallel on the machine. The deployment of applications and tools on large-scale HPC computing systems requires the RTE to manage process creation in a scalable manner, support sparse connectivity, and provide fault tolerance. We have developed a new RTE that provides a basis for building distributed execution environments and developing tools for HPC to aid research in system software and resilience. This paper describes the software architecture of the Scalable runTime Component Infrastructure (STCI), which is intended to provide a complete infrastructure for scalable start-up and management of many processes in large-scale HPC systems. We highlight features of the current implementation, which is provided as a system library that allows developers to easily use and integrate STCI in their tools and/or applications. The motivation for this work has been to support ongoing research activities in fault-tolerance for large-scale systems. We discuss the advantages of the modular framework employed and describe two use cases that demonstrate its capabilities: (i) an alternate runtime for a Message Passing Interface (MPI) stack, and (ii) a distributed control and communication substrate for a fault-injection tool.
- Published
- 2013
33. Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems
- Author
-
Christian Engelmann and Thomas Naughton
- Subjects
business.industry ,Computer science ,Message passing ,Process (computing) ,Message Passing Interface ,Thread (computing) ,Fault injection ,Supercomputer ,computer.software_genre ,Concurrency control ,Software ,Multithreading ,Operating system ,business ,computer - Abstract
xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.
- Published
- 2013
34. Architecture for the next generation system management tools
- Author
-
Thomas Naughton, Christine Morin, Geoffroy Vallée, Stephen L. Scott, Adrien Lebre, Jérôme Gallard, Design and Implementation of Autonomous Distributed Systems (MYRIADS), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-SYSTÈMES LARGE ÉCHELLE (IRISA-D1), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA), Aspect and composition languages (ASCOLA), Laboratoire d'Informatique de Nantes Atlantique (LINA), Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Département informatique - EMN, Mines Nantes (Mines Nantes)-Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN), Oak Ridge National Laboratory [Oak Ridge] (ORNL), UT-Battelle, LLC, Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS)-Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS)-Département informatique - EMN, and Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
Computer Networks and Communications ,Computer science ,Distributed computing ,02 engineering and technology ,computer.software_genre ,Distributed systems ,Personalization ,Software ,Virtualization ,Systems management ,0202 electrical engineering, electronic engineering, information engineering ,Emulation ,Virtual platform ,business.industry ,HPC system resource management ,020206 networking & telecommunications ,Virtual system environment (VSE) ,Hardware and Architecture ,Virtual machine ,020201 artificial intelligence & image processing ,[INFO.INFO-OS]Computer Science [cs]/Operating Systems [cs.OS] ,Virtual platform (VP) ,Flexibility ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,business ,computer - Abstract
International audience; To get more results or greater accuracy, computational scientists execute their applications on distributedcomputing platforms such as clusters, grids, and clouds. These platforms are different in terms of hardwareand software resources as well as locality: some span across multiple sites and multiple administrativedomains, whereas others are limited to a single site/domain. As a consequence, in order to scale theirapplications up, the scientists have to manage technical details for each target platform. From our pointof view, this complexity should be hidden from the scientists, who, in most cases, would prefer to focuson their research rather than spending time dealing with platform configuration concerns.In this article, we advocate for a system management framework that aims to automatically set upthe whole run-time environment according to the applications’ needs. The main difference with regardsto usual approaches is that they generally only focus on the software layer whereas we address both thehardware and the software expectations through a unique system. For each application, scientists describetheir requirements through the definition of a virtual platform (VP) and a virtual system environment (VSE).Relying on the VP/VSE definitions, the framework is in charge of (i) the configuration of the physicalinfrastructure to satisfy the VP requirements, (ii) the set-up of the VP, and (iii) the customization of theexecution environment (VSE) upon the former VP. We propose a new formalism that the system can relyupon to successfully perform each of these three steps without burdening the user with the specifics ofthe configuration for the physical resources, and system management tools. This formalism leveragesGoldberg’s theory for recursive virtual machines (Goldberg, 1973 [6]) by introducing new conceptsbased on system virtualization (identity, partitioning, aggregation) and emulation (simple, abstraction).This enables the definition of complex VP/VSE configurations without making assumptions about thehardware and the software resources. For each requirement, the system executes the correspondingoperation with the appropriate management tool.As a proof of concept, we implemented a first prototype that currently interacts with several systemmanagement tools (e.g., OSCAR, the Grid’5000 toolkit, and XtreemOS) and that can be easily extended tointegrate new resource brokers or cloud systems such as Nimbus, OpenNebula, or Eucalyptus, for instance.
- Published
- 2012
35. 5th Workshop on System-Level Virtualization for High Performance Computing (HPCVirt 2011)
- Author
-
Thomas Naughton, Stephen L. Scott, and Geoffroy Vallée
- Subjects
Computer science ,Systems management ,Operating system ,System level ,Resource management ,computer.software_genre ,Virtualization ,Supercomputer ,computer ,Porting ,Service virtualization - Abstract
The emergence of virtualization enabled hardware, such as the latest generation AMD and Intel processors, has raised significant interest in High Performance Computing (HPC) community. In particular, system-level virtualization provides an opportunity to advance the design and development of operating systems, programming environments, administration practices, and resource management tools. This leads to some potential research topics for HPC, such as failure tolerance, system management, and solution for application porting to new HPC platforms. The workshop on System-level Virtualization for HPC (HPCVirt 2011) is intended to be a forum for the exchange of ideas and experiences on the use of virtualization technologies for HPC, the challenges and opportunities offered by the development of system-level virtualization solutions themselves, as well as case studies in the application of system-level virtualization in HPC.
- Published
- 2012
36. A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment
- Author
-
Thomas Naughton, Geoffroy Vallée, Christian Engelmann, and Stephen L. Scott
- Subjects
Computer science ,Full virtualization ,Distributed computing ,Message Passing Interface ,Fault tolerance ,Fault injection ,computer.software_genre ,Virtualization ,Fault (power engineering) ,System under test ,Virtual machine ,Resilience (network) ,computer ,System software - Abstract
Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi- petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques. While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and the operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption. The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [7, 15]. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context. We also present the design of a new fault injection framework that can leverage virtualization.
- Published
- 2012
37. Efficient Replication of Over 180 Genetic Associations with Self‐Reported Medical Data
- Author
-
Arnab B. Chowdry, Anne Wojcicki, J. Michael Macpherson, Joyce Y. Tung, Uta Francke, Amy K. Kiefer, Joanna Naughton, Chuong B. Do, Brian Thomas Naughton, David A. Hinds, and Nicholas Eriksson
- Subjects
business.industry ,Genomic data ,Cohort ,Replication (statistics) ,Medicine ,General Materials Science ,Medical information ,Replicate ,Odds ratio ,Medical diagnosis ,business ,Demography - Abstract
While the cost and speed of generating genomic data have come down dramatically in recent years, the slow pace of collecting medical data for large cohorts continues to hamper genetic research. Here we evaluate a novel online framework for amassing large amounts of medical information in a recontactable cohort by assessing our ability to replicate genetic associations using these data. Using web‐based questionnaires, we gathered self-reported data on 50 medical phenotypes from a generally unselected cohort of over 20,000 genotyped individuals. Of a list of genetic associations curated by NHGRI, we successfully replicated about 75% of the associations that we expected to (based on the number of cases in our cohort and reported odds ratios, and excluding a set of associations with contradictory published evidence). Altogether we replicated over 180 previously reported associations, including many for type 2 diabetes, prostate cancer, cholesterol levels, and multiple sclerosis. We found significant variation across categories of conditions in the percentage of expected associations that we were able to replicate, which may reflect systematic inflation of the effects in some initial reports, or differences across diseases in the likelihood of misdiagnosis or misreport. We also demonstrated that we could improve replication success by taking advantage of our recontactable cohort, offering more in‐depth questions to refine self‐reported diagnoses. Our data suggests that online collection of self‐reported data in a recontactable cohort may be a viable method for both broad and deep phenotyping in large populations.
- Published
- 2011
38. Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation
- Author
-
Pratul K. Agarwal, Al Geist, Jennifer L. Tippens, Thomas Naughton, Byung H. Park, and David E. Bernholdt
- Subjects
General protection fault ,Computer science ,Software fault tolerance ,Distributed computing ,Fault tolerance ,User interface ,Event correlation ,Interrupt ,Supercomputer ,Scheduling (computing) - Abstract
Many modern scientific applications, which are designed to utilize high performance parallel computers, occupy hundreds of thousands of computational cores running for days or even weeks. Since many scientists compete for resources, most supercomputing centers practice strict scheduling policies and perform meticulous accounting on their usage. Thus computing resources and time assigned to a user is considered invaluable. However, most applications are not well prepared for unforeseeable faults, still relying on primitive fault tolerance techniques. Considering that ever-plunging mean time to interrupt (MTTI) is making scientific applications more vulnerable to faults, it is increasingly important to provide users not only an improved fault tolerant environment, but also a framework to support their own fault tolerance policies so that their allocation times can be best utilized. This paper addresses a user level fault tolerance policy management based on a holistic approach to digest and correlate fault related information. It introduces simple semantics with which users express their policies on faults, and illustrates how event correlation techniques can be applied to manage and determine the most preferable user policies. The paper also discusses an implementation of the framework using open source software, and demonstrates, as an example, how a molecular dynamics simulation application running on the institutional cluster at Oak Ridge National Laboratory benefits from it.
- Published
- 2011
39. Efficient replication of over 180 genetic associations with self-reported medical data
- Author
-
Nicholas Eriksson, Brian Thomas Naughton, Arnab B. Chowdry, Anne Wojcicki, Amy K. Kiefer, Uta Francke, J. Michael Macpherson, Joyce Y. Tung, Joanna L. Mountain, David A. Hinds, and Chuong B. Do
- Subjects
Male ,Non-Clinical Medicine ,Genome-wide association study ,Genetics & Genomics ,Bioinformatics ,Computer Applications ,Cohort Studies ,Surveys and Questionnaires ,Odds Ratio ,Medicine ,Young adult ,Medical diagnosis ,Multidisciplinary ,Replicate ,Genomics ,Middle Aged ,Cohort ,Web-Based Applications ,Female ,Cohort study ,Research Article ,Adult ,Medical Ethics ,Genotype ,Clinical Research Design ,Science Policy ,Science ,Patient Advocacy ,Polymorphism, Single Nucleotide ,Young Adult ,Genetics ,Genome-Wide Association Studies ,Humans ,Genetic Testing ,Biology ,Genetic Association Studies ,Genetic association ,Aged ,Retrospective Studies ,Survey Research ,business.industry ,Genome, Human ,Personalized Medicine ,Computational Biology ,Human Genetics ,Odds ratio ,Bioethics ,Logistic Models ,Case-Control Studies ,Genetics of Disease ,Computer Science ,business ,Demography ,Genome-Wide Association Study - Abstract
While the cost and speed of generating genomic data have come down dramatically in recent years, the slow pace of collecting medical data for large cohorts continues to hamper genetic research. Here we evaluate a novel online framework for obtaining large amounts of medical information from a recontactable cohort by assessing our ability to replicate genetic associations using these data. Using web-based questionnaires, we gathered self-reported data on 50 medical phenotypes from a generally unselected cohort of over 20,000 genotyped individuals. Of a list of genetic associations curated by NHGRI, we successfully replicated about 75% of the associations that we expected to (based on the number of cases in our cohort and reported odds ratios, and excluding a set of associations with contradictory published evidence). Altogether we replicated over 180 previously reported associations, including many for type 2 diabetes, prostate cancer, cholesterol levels, and multiple sclerosis. We found significant variation across categories of conditions in the percentage of expected associations that we were able to replicate, which may reflect systematic inflation of the effects in some initial reports, or differences across diseases in the likelihood of misdiagnosis or misreport. We also demonstrated that we could improve replication success by taking advantage of our recontactable cohort, offering more in-depth questions to refine self-reported diagnoses. Our data suggest that online collection of self-reported data from a recontactable cohort may be a viable method for both broad and deep phenotyping in large populations.
- Published
- 2011
40. A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI
- Author
-
Geoffroy Vallée, Joshua Hursey, Thomas Naughton, and Richard L. Graham
- Subjects
Reduction (complexity) ,Computer science ,General protection fault ,Semantics (computer science) ,Software fault tolerance ,Distributed computing ,Scalability ,Fault tolerance ,Parallel computing ,Commit ,Scaling ,Algorithm - Abstract
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.
- Published
- 2011
41. Consent and Internet-Enabled Human Genomics
- Author
-
Joyce Y. Tung, Itsik Pe'er, J. Michael Macpherson, Nicholas Eriksson, Anne Wojcicki, Serge Saxonov, Joanna L. Mountain, Brian Thomas Naughton, Lawrence S. Hon, and Linda Avey
- Subjects
Cancer Research ,Genotype ,lcsh:QH426-470 ,Science Policy ,Single-nucleotide polymorphism ,Genomics ,Genome-wide association study ,Biology ,Genetics and Genomics/Complex Traits ,03 medical and health sciences ,0302 clinical medicine ,Genotype-phenotype distinction ,Genetics and Genomics/Population Genetics ,Genetic variation ,Genetics ,Eye color ,Chromosomes, Human ,Humans ,Molecular Biology ,Genetics (clinical) ,Ecology, Evolution, Behavior and Systematics ,030304 developmental biology ,0303 health sciences ,Internet ,Models, Genetic ,Genetic Variation ,Genetics and Genomics ,lcsh:Genetics ,Variation (linguistics) ,Phenotype ,Editorial ,Evolutionary biology ,Identification (biology) ,030217 neurology & neurosurgery ,Research Article ,Genome-Wide Association Study ,Hair - Abstract
Despite the recent rapid growth in genome-wide data, much of human variation remains entirely unexplained. A significant challenge in the pursuit of the genetic basis for variation in common human traits is the efficient, coordinated collection of genotype and phenotype data. We have developed a novel research framework that facilitates the parallel study of a wide assortment of traits within a single cohort. The approach takes advantage of the interactivity of the Web both to gather data and to present genetic information to research participants, while taking care to correct for the population structure inherent to this study design. Here we report initial results from a participant-driven study of 22 traits. Replications of associations (in the genes OCA2, HERC2, SLC45A2, SLC24A4, IRF4, TYR, TYRP1, ASIP, and MC1R) for hair color, eye color, and freckling validate the Web-based, self-reporting paradigm. The identification of novel associations for hair morphology (rs17646946, near TCHH; rs7349332, near WNT10A; and rs1556547, near OFCC1), freckling (rs2153271, in BNC2), the ability to smell the methanethiol produced after eating asparagus (rs4481887, near OR2M7), and photic sneeze reflex (rs10427255, near ZEB2, and rs11856995, near NR2F2) illustrates the power of the approach., Author Summary Twin studies have shown that many human physical characteristics, such as hair curl, earlobe shape, and pigmentation are at least partly heritable. In order to identify the genes involved in such traits, we administered Web-based surveys to the customer base of 23andMe, a personal genetics company. Upon completion of surveys, participants were able to see how their answers compared to those of other customers. Our examination of 22 different common traits in nearly 10,000 participants revealed associations among several single-nucleotide polymorphisms (SNPs, a type of common DNA sequence variation) and freckling, hair curl, asparagus anosmia (the inability to detect certain urinary metabolites produced after eating asparagus), and photic sneeze reflex (the tendency to sneeze when entering bright light). Additionally our analysis verified the association of a large number of previously identified genes with variation in hair color, eye color, and freckling. Our analysis not only identified new genetic associations, but also showed that our novel way of doing research—collecting self-reported data over the Web from involved participants who also receive interpretations of their genetic data—is a viable alternative to traditional methods.
- Published
- 2010
42. Loadable Hypervisor Modules
- Author
-
Geoffroy Vallée, Stephen L. Scott, Thomas Naughton, and Ferrol Aderholdt
- Subjects
Software_OPERATINGSYSTEMS ,business.industry ,Computer science ,media_common.quotation_subject ,Hypervisor ,Tracing ,computer.software_genre ,Storage hypervisor ,Debugging ,Virtual machine ,Embedded system ,Operating system ,Instrumentation (computer programming) ,business ,computer ,media_common - Abstract
This paper discusses the implementation of a new hypervisor mechanism for loading dynamic shared objects (modules) at runtime. These loadable hypervisor modules (LHM) are modeled after the loadable kernel modules used in Linux. We detail the current LHM implementation based on the Xen hypervisor. Potential use cases for this LHM mechanism include dynamic hypervisor instrumentation for debug tracing or performance analysis. We discuss the initial LHM prototype and future plans.
- Published
- 2010
43. Nonparametric multivariate anomaly analysis in support of HPC resilience
- Author
-
Thomas Naughton, Christian Engelmann, Geoffroy Vallée, S.L. Scott, and George Ostrouchov
- Subjects
Root (linguistics) ,business.industry ,Computer science ,Distributed computing ,Nonparametric statistics ,Condition monitoring ,computer.software_genre ,Identification (information) ,Software ,Software fault tolerance ,Data mining ,Resilience (network) ,business ,Failure mode and effects analysis ,computer - Abstract
Large-scale computing systems provide great potential for scientific exploration. However, the complexity that accompanies these enormous machines raises challeges for both, users and operators. The effective use of such systems is often hampered by failures encountered when running applications on systems containing tens-of-thousands of nodes and hundreds-of-thousands of compute cores capable of yielding petaflops of performance. In systems of this size failure detection is complicated and root-cause diagnosis difficult. This paper describes our recent work in the identification of anomalies in monitoring data and system logs to provide further insights into machine status, runtime behavior, failure modes and failure root causes. It discusses the details of an initial prototype that gathers the data and uses statistical techniques for analysis.
- Published
- 2009
44. Fault injection framework for system resilience evaluation
- Author
-
Christian Engelmann, Geoffroy Vallée, Stephen L. Scott, Wesley Bland, and Thomas Naughton
- Subjects
Basic premise ,Engineering ,Software fault ,business.industry ,Common method ,Fault injection ,business ,Masking (Electronic Health Record) ,Reliability engineering - Abstract
As high-performance computing (HPC) systems increase in size and complexity they become more difficult to manage. The enormous component counts associated with these large systems lead to significant challenges in system reliability and availability. This in turn is driving research into the resilience of large scale systems, which seeks to curb the effects of increased failures at large scales by masking the inevitable faults in these systems. The basic premise being that failure must be accepted as a reality of large scale system and coped with accordingly through system resilience.A key component in the development and evaluation of system resilience techniques is having a means to conduct controlled experiments. A common method for performing such experiments is to generate synthetic faults and study the resulting effects. In this paper we discuss the motivation and our initial use of software fault injection to support the evaluation of resilience for HPC systems. We mention background and related work in the area and discuss the design of a tool to aid in fault injection experiments for both user-space (application-level) and system-level failures.
- Published
- 2009
45. Performance comparison of two virtual machine scenarios using an HPC application
- Author
-
Thomas Naughton, Christian Engelmann, Sadaf R. Alam, Hong Ong, Stephen L. Scott, Geoffroy Vallée, and Anand Tikotekar
- Subjects
Virtual finite-state machine ,Full virtualization ,Computer science ,Virtual machine ,Temporal isolation among virtual machines ,Operating system ,Hypervisor ,Interrupt ,computer.software_genre ,Massively parallel ,computer ,Context switch - Abstract
Obtaining high flexibility to performance-loss ratio is a key challenge of today's HPC virtual environment landscape. And while extensive research has been targeted at extracting more performance from virtual machines, the idea that whether novel virtual machine usage scenarios could lead to high flexibility Vs performance trade-off has received less attention.We, in this paper, take a step forward by studying and comparing the performance implications of running the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) application on two virtual machine configurations. First configuration consists of two virtual machines per node with 1 application process per virtual machine. The second configuration consists of 1 virtual machine per node with 2 processes per virtual machine. Xen has been used as an hypervisor and standard Linux as a guest virtual machine. Our results show that the difference in overall performance impact on LAMMPS between the two virtual machine configurations described above is around 3%. We also study the difference in performance impact in terms of each configuration's individual metrics such as CPU, I/O, Memory, and interrupt/context switches.
- Published
- 2009
46. A tunable holistic resiliency approach for high-performance computing systems
- Author
-
Nichamon Naksinehaboon, Raja Nassar, Arun Babu Nagarajan, Thomas Naughton, George Ostrouchov, Christian Engelmann, Anand Tikotekar, Frank Mueller, Chokchai Leangsuksun, Chao Wang, Jyothish Varma, Stephen L. Scott, Geoffroy Vallée, and Mihaela Paun
- Subjects
Process (engineering) ,business.industry ,Computer science ,Reliability (computer networking) ,media_common.quotation_subject ,Preemption ,Fault tolerance ,Failure rate ,Context (language use) ,computer.software_genre ,Fault (power engineering) ,Computer Graphics and Computer-Aided Design ,Risk analysis (engineering) ,Virtual machine ,Embedded system ,Psychological resilience ,business ,computer ,Software ,media_common - Abstract
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
- Published
- 2009
47. Proactive Fault Tolerance Using Preemptive Migration
- Author
-
Geoffroy Vallée, Christian Engelmann, Thomas Naughton, and Stephen L. Scott
- Subjects
Parallel processing (DSP implementation) ,System failure ,Computer science ,Node (networking) ,Reliability (computer networking) ,Distributed computing ,System recovery ,Fault tolerance ,Architecture ,Supercomputer - Abstract
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.
- Published
- 2009
48. Proposal for Modifications to the OSCAR Architecture to Address Challenges in Distributed System Management
- Author
-
Thomas Naughton, S.L. Scott, and Geoffroy Vallée
- Subjects
business.industry ,Computer science ,Distributed computing ,Modular design ,computer.software_genre ,Systems analysis ,Software deployment ,Systems management ,User interface ,Architecture ,business ,Software architecture ,computer ,Graphical user interface - Abstract
OSCAR, a tool for the deployment and the management of clusters, has historically provided a simple Graphical User Interface (GUI) that aims to hide technical details associated with the management of such distributed platforms. The OSCAR GUI followed a fairly monolithic architecture, which was difficult to extend. Thankfully, OSCAR developers have made deep modifications to the overall OSCAR architecture to be much more modular, and as a result it is fairly simple to support OSCAR on new Linux distributions. However, a few questions remain. Is the present OSCAR architecture suitable to address current challenges needed for the management of distributed systems? For instance, is OSCAR providing a tool set that answers the needs of system administrators? This document presents a criticism of the present OSCAR architecture in order to identify current challenges related to distributed system management. Based on this analysis, we propose a modified version of the OSCAR architecture that emphasizes simplicity and incremental enhancements.
- Published
- 2008
49. Effects of virtualization on a scientific application running a hyperspectral radiative transfer code on virtual machines
- Author
-
Anand Tikotekar, Geoffroy Vallée, Stephen L. Scott, Christian Engelmann, Anthony M. Filippi, Thomas Naughton, and Hong Ong
- Subjects
Application virtualization ,Computer science ,Full virtualization ,Hardware virtualization ,Distributed computing ,Temporal isolation among virtual machines ,Thin provisioning ,Fault tolerance ,computer.software_genre ,Supercomputer ,Virtualization ,Virtual machine ,Operating system ,computer - Abstract
The topic of system-level virtualization has recently begun to receive interest for high performance computing (HPC). This is in part due to the isolation and encapsulation offered by the virtual machine. These traits enable applications to customize their environments and maintain consistent software configurations in their virtual domains. Additionally, there are mechanisms that can be used for fault tolerance like live virtual machine migration. Given these attractive benefits to virtualization, a fundamental question arises, how does this effect my scientific application? We use this as the premise for our paper and observe a real-world scientific code running on a Xen virtual machine. We studied the effects of running a radiative transfer simulation, Hydrolight, on a virtual machine. We discuss our methodology and report observations regarding the usage of virtualization with this application.
- Published
- 2008
50. A Framework for Proactive Fault Tolerance
- Author
-
Thomas Naughton, Geoffroy Vallée, Anand Tikotekar, Christian Engelmann, S.L. Scott, K. Charoenpornwattana, and Chokchai Leangsuksun
- Subjects
Computer science ,Software fault tolerance ,Distributed computing ,System recovery ,Fault tolerance ,Modular architecture ,Virtualization ,computer.software_genre ,Adaptation (computer science) ,computer - Abstract
Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures and proactively take action before failures occur in order to minimize failure impact on the system and application execution. This document presents a proactive fault tolerance framework. This framework can use different proactive fault tolerance mechanisms, i.e., migration and pause/un-pause. The framework also allows the implementation of new proactive fault tolerance policies thanks to a modular architecture. A first proactive fault tolerance policy has been implemented and preliminary experimentations have been done based on system-level virtualization and compared with results obtained by simulation.
- Published
- 2008
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.