65 results on '"Omer Subasi"'
Search Results
52. Comparative analysis of soft-error detection strategies
- Author
-
Osman Unsal, Joseph Manzano, Sriram Krishnamoorthy, Gokcen Kestor, Burcu O. Mutlu, and Omer Subasi
- Subjects
020203 distributed computing ,Computer science ,Iterative method ,Detector ,02 engineering and technology ,Silent data corruption ,Outcome (probability) ,020202 computer hardware & architecture ,Comparative evaluation ,Soft error detection ,Soft error ,0202 electrical engineering, electronic engineering, information engineering ,Transient (oscillation) ,Algorithm - Abstract
Undetected soft errors caused by transient bit flips can lead to silent data corruption (SDC), an undesirable outcome where invalid results pass for valid ones. This has motivated the design of soft error detectors to minimize SDCs. However, the detectors have been studied under different contexts, making comparative evaluation difficult. In this paper, we present the first comprehensive evaluation of four online soft error detection techniques in detecting the adverse impact of soft errors on iterative methods. We observe that, across five iterative methods, the detectors studied achieve high but not perfect detection rates. To understand the potential for improved detection, we evaluate a machine-learning based detector that takes as features that are the runtime features observed by the individual detectors to arrive at their conclusions. Our evaluation demonstrates improved but still far from perfect detection accuracy for the machine learning based detectors. This extensive evaluation demonstrates the need for designing error detectors to handle the evolutionary behavior exhibited by iterative solvers.
- Published
- 2018
- Full Text
- View/download PDF
53. On the theory of speculative checkpointing
- Author
-
Sriram Krishnamoorthy and Omer Subasi
- Subjects
020203 distributed computing ,Mathematical optimization ,Software_OPERATINGSYSTEMS ,Computer science ,Total cost ,Sampling (statistics) ,0102 computer and information sciences ,02 engineering and technology ,01 natural sciences ,010201 computation theory & mathematics ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Probability distribution ,Speculation ,Energy (signal processing) ,Rollback ,Selection (genetic algorithm) - Abstract
Collective checkpoint/rollback is the most popular approach for dealing with fail-stop errors on high-performance computing platforms. Prior work has focused on choosing checkpoint intervals that minimize the total cost of checkpoint/rollback. This work introduces the notion of speculative checkpointing, where we probabilistically skip some checkpoints. The careful selection of checkpoints either to be taken or skipped has the potential to reduce the total checkpoint/rollback overhead. We mathematically formulate the overall checkpoint/rollback cost in the presence of speculation. We consider the choice of speculation as a fixed probability or a probability distribution. We formulate two criteria to be minimized: total execution time and approximate total energy. We derive the criteria for beneficial speculative checkpointing for exponential and arbitrary failure distributions. Furthermore, we analyze the joint optimization of energy and time to express the trade-offs mathematically. We validate the formulations and evaluate various scenarios using discrete-event simulation. Experimental evaluation validates the models and demonstrates that employing speculation and choosing to speculate by sampling a distribution derived from the failure distribution achieves the best performance.
- Published
- 2018
- Full Text
- View/download PDF
54. Approximate Computing Techniques for Iterative Graph Algorithms
- Author
-
Mahantesh Halappanavar, Omer Subasi, Ananth Kalyanaraman, Daniel Chavarría-Miranda, Sriram Krishnamoorthy, and Ajay Panyala
- Subjects
Loop (graph theory) ,Theoretical computer science ,Computer science ,Perforation (oil well) ,Parallel algorithm ,010103 numerical & computational mathematics ,02 engineering and technology ,01 natural sciences ,Graph ,law.invention ,PageRank ,law ,020204 information systems ,Scalability ,Synchronization (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Graph coloring ,0101 mathematics ,Heuristics - Abstract
Approximate computing enables processing of large-scale graphs by trading off quality for performance. Approximate computing techniques have become critical not only due to the emergence of parallel architectures but also due to the availability of large scale datasets enabling data-driven discovery. Using two prototypical graph algorithms, PageRank and community detection, we present several approximate computing heuristics to scale the performance with minimal loss of accuracy. We present several heuristics including loop perforation, data caching, incomplete graph coloring and synchronization, and evaluate their efficiency. We demonstrate performance improvements of up to 83% for PageRank and up to 450x for community detection, with low impact on accuracy for both the algorithms. We expect the proposed approximate techniques will enable scalable graph analytics on data of importance to several applications in science and their subsequent adoption to scale similar graph algorithms.
- Published
- 2017
- Full Text
- View/download PDF
55. Automatic Risk-based Selective Redundancy for Fault-tolerant Task-parallel HPC Applications
- Author
-
Osman Unsal, Omer Subasi, and Sriram Krishnamoorthy
- Subjects
Dataflow ,Computer science ,Heuristic (computer science) ,Distributed computing ,Redundancy (engineering) ,Overhead (computing) ,Task parallelism ,Fault tolerance ,Heuristics ,Replication (computing) - Abstract
Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high-performance computing (HPC) systems. In this study, we present an automatic, efficient and lightweight redundancy mechanism to mitigate both error types. We propose partial task-replication and checkpointing for task-parallel HPC applications to mitigate silent and fail-stop errors. To avoid the prohibitive costs of complete replication, we introduce a lightweight selective replication mechanism. Using a fully automatic and transparent heuristics, we identify and selectively replicate only the reliability-critical tasks based on a risk metric. Our approach detects and corrects around 70% of silent errors with only 5% average performance overhead. Additionally, the performance overhead of the heuristic itself is negligible.
- Published
- 2017
- Full Text
- View/download PDF
56. Toward a General Theory of Optimal Checkpoint Placement
- Author
-
Sriram Krishnamoorthy, Gokcen Kestor, and Omer Subasi
- Subjects
020203 distributed computing ,Exponential distribution ,Computer science ,Reliability (computer networking) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Failure rate ,02 engineering and technology ,Parallel computing ,Constant (mathematics) ,Algorithm - Abstract
Checkpoint/restart has been widely used to cope with fail-stop errors. The checkpointing frequency is most often optimized by assuming an exponential failure distribution. However, field studies show that most often failures do not follow a constant failure rate exponential distribution. Therefore, the optimal checkpointing frequency should be computed and tuned considering the different distributions that failures follow. Moreover, due to operating system and input/output jitter and hybrid solutions that combine checkpointing with other techniques, such as data compression, checkpointing time can no longer be assumed constant. Thus, time varying checkpointing time should be accounted for to realistically model the application execution.In this study, we develop a mathematical theory and model to optimize the checkpointing frequency with respect to arbitrary failure distributions while capturing time-dependent non-constant checkpointing time. We show that we can provide closed-form formulas for important failure distributions in most cases. By instantiating our model, we study and analyze 10 important failure distributions to obtain the optimal checkpointing frequency for these distributions. Experimental evaluation shows that our model is highly accurate and deviates from the simulations less than 1% on average.
- Published
- 2017
- Full Text
- View/download PDF
57. Designing and modelling selective replication for fault-tolerant HPC applications
- Author
-
Osman Unsal, Omer Subasi, Ferad Zyulkyarov, Gulay Yalcin, Jesús Labarta, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, and Barcelona Supercomputing Center
- Subjects
Reliability theory ,Computer science ,Distributed computing ,Reliability (computer networking) ,Markov process ,Computer crashes ,02 engineering and technology ,Fault-tolerant computing ,symbols.namesake ,Resource (project management) ,Software ,Mathematical model ,Hardware ,0202 electrical engineering, electronic engineering, information engineering ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,020203 distributed computing ,Tolerància als errors (Informàtica) ,Parallel processing (Electronic computers) ,business.industry ,Markov processes ,Processament en paral·lel (Ordinadors) ,Fault tolerance ,Computational modeling ,Supercomputer ,Replication (computing) ,020202 computer hardware & architecture ,symbols ,business - Abstract
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user. This work is supported in part by the European Union Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and the FEDER funds under contract TIN2015-65316-P.
- Published
- 2017
58. Spatial support vector regression to detect silent errors in the exascale era
- Author
-
Leonardo Bautista-Gomez, Adrian Cristal, Sheng Di, Omer Subasi, Jesús Labarta, Franck Cappello, Osman Unsal, Prasanna Balaprakash, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Detection sensitivity ,Fold (higher-order function) ,Computer science ,Errors ,Real-time computing ,Budget control ,Silent data corruptions ,Support vector regression (SVR) ,Increasing capacities ,02 engineering and technology ,State-of-the-art techniques ,Overhead (business) ,020204 information systems ,Computer cluster ,0202 electrical engineering, electronic engineering, information engineering ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,020203 distributed computing ,Support vector machines ,Detector ,Fault tolerance ,High performance computing systems ,Distributed computer systems ,Support vector machine ,Benchmarking ,Exascale ,Snapshot (computer storage) ,False positive rate ,High performance computing ,Support vector machine regressions ,Cluster computing ,Càlcul intensiu (Informàtica) - Abstract
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads. This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program, under Contract DE-AC02-06CH11357, by FI-DGR 2013 scholarship, by HiPEAC PhD Collaboration Grant, the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402, and TIN2015-65316-P.
- Published
- 2016
59. A runtime heuristic to selectively replicate tasks for application-specific reliability targets
- Author
-
Ferad Zyulkyarov, Gulay Yalcin, Jesús Labarta, Omer Subasi, Osman Unsal, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
- Subjects
Computer science ,Distributed computing ,Reliability (computer networking) ,Task parallelism ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Informàtica::Arquitectura de computadors [Àrees temàtiques de la UPC] ,010302 applied physics ,020203 distributed computing ,Parallel processing (Electronic computers) ,Heuristic ,Selective replication ,Processament en paral·lel (Ordinadors) ,Dataflow programming ,HPC and exascale computing ,Replicate ,Supercomputer ,Replication (computing) ,Task (computing) ,Scalability ,Compiler ,computer - Abstract
In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated. This work was supported by FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the European Union (FEDER funds) under contract TIN2015-65316-P.
- Published
- 2016
- Full Text
- View/download PDF
60. Marriage Between Coordinated and Uncoordinated Checkpointing for the Exascale Era
- Author
-
Osman Unsal, Jesús Labarta, Ferad Zyulkyarov, and Omer Subasi
- Subjects
Unified system ,Computer science ,Distributed computing ,Data_FILES ,Programming paradigm ,Leverage (statistics) ,Fault tolerance ,Parallel computing - Abstract
The state-of-the-art checkpointing techniques are projected to be prohibitively expensive in the Exascale era. These techniques are most often holistic in nature which prevents them to leverage programming model and paradigm specific advantages so as to be viable for the Exascale era. In this work, we present a unified non-hierarchical model to combine uncoordinated checkpointing with coordinated system-wide checkpointing to capitalize on programming model specific advantages. We develop closed-form formulas for performance improvement and the optimal checkpoint interval of the unified model in our analytical assessment. As an instantiation of our model, we propose to unify task-level checkpointing with a system-wide checkpointing scheme for task-parallel HPC applications. This instantiation has three distinct advantages: first it reduces performance overheads by decreasing the frequency of checkpoints in the unified system, second it features fast failure recovery by using in-memory task-local checkpoints instead of on-disk global checkpoints, and third it does not compromise from the high failure coverage typical of system-wide checkpointing.
- Published
- 2015
- Full Text
- View/download PDF
61. NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart
- Author
-
Omer Subasi, Javier Arias, Osman Unsal, Jesus Labarta, and Adrian Cristal
- Subjects
Software ,Shared memory ,Asynchronous communication ,business.industry ,Computer science ,Dataflow ,Scalability ,Programming paradigm ,Leverage (statistics) ,Task parallelism ,Parallel computing ,business - Abstract
In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for task-parallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overheads by check pointing only tasks' inputs which are available for free in the OmpSs PM. We evaluate NanoCheckpoints by both pure task-parallel shared memory benchmarks (up to 16 cores) and hybrid OmpSs+MPI applications (up to 1024 cores). The results indicate that NanoCheckpoints has on average overhead 3% for shared memory benchmarks. The dataflow semantics of Nanos, where both check pointing and error recovery are asynchronous, allows NanoCheckpoints to scale at large core counts even when high error rates are present. For hybrid OmpSs+MPI benchmarks, NanoCheckpoints has very low overhead, on average 2%, and high scalability.
- Published
- 2015
- Full Text
- View/download PDF
62. Simplifying Linearizability Proofs with Reduction and Abstraction
- Author
-
Ali Sezgin, Shaz Qadeer, Serdar Tasiran, Omer Subasi, and Tayfun Elmas
- Subjects
Theoretical computer science ,Linearizability ,Computer science ,Programming language ,Concurrency ,Commit ,Construct (python library) ,Mathematical proof ,computer.software_genre ,Reduction (complexity) ,TheoryofComputation_LOGICSANDMEANINGSOFPROGRAMS ,Synchronization (computer science) ,computer ,Abstraction (linguistics) - Abstract
The typical proof of linearizability establishes an abstraction map from the concurrent program to a sequential specification, and identifies the commit points of operations. If the concurrent program uses fine-grained concurrency and complex synchronization, constructing such a proof is difficult. We propose a sound proof system that significantly simplifies the reasoning about linearizability. Linearizability is proved by transforming an implementation into its specification within this proof system. The proof system combines reduction and abstraction, which increase the granularity of atomic actions, with variable introduction and hiding, which syntactically relate the representation of the implementation to that of the specification. We construct the abstraction map incrementally, and eliminate the need to reason about the location of commit points in the implementation. We have implemented our method in the QED verifier and demonstrated its effectiveness and practicality on several highly-concurrent examples from the literature.
- Published
- 2010
- Full Text
- View/download PDF
63. Moğol Neküderîlerin Kökeni ve Faaliyetleri
- Author
-
Ömer Subaşı
- Subjects
History of Civilization ,CB3-482 - Abstract
Neküderîler, Afganistan coğrafyasında hayatlarını sürdüren ve tarih sahnesine çıktıkları günden itibaren faaliyet sahası olarak İlhanlı Devleti’nin doğu sınırını tercih eden göçebe bir topluluktur. XIII. yüzyılın ortalarında bir tümen askerle yaptıkları yağma olaylarıyla kendilerinden söz ettiren Neküderîlerin isimleri Kirmân’dan Gazne’ye kadar neredeyse bütün şehirlerde kan ve gözyaşıyla özdeşleşmişti. Hayatlarını sürdürdükleri bölgelerde yerel halkın korkulu rüyası hâline gelen bu topluluk hakkında çalışmalar yapılmış olsa da onların menşei üzerine süren tartışmalar uzun zamandan beri devam etmektedir. Bu çalışmada Neküderîlerin menşeinin, Çağatay şehzâdesi Neküder Oğul ve İlhanlı hükümdarı Ahmed Teküder’in ordularının bakiyeleri olduğu yönündeki iddialar ve Neküder Noyan’ın Afganistan’a gelişinden sonraki faaliyetleri ele alınacaktır. Ardından Neküder Noyan’ın ismiyle anılmaya başlayan Neküderî topluluklarının İlhanlı tarihi boyunca giriştikleri siyasî hareketler irdelenecektir. Ayrıca Neküderîlerin kısmen yerleşik hayata geçtikleri Herât çevresindeki yaşantıları özellikle Kert kuvvetleri arasında üstlendikleri görevler ve bu hanedanın tarihi üzerindeki etkileri, İlhanlı coğrafyasında kaleme alınan gerek genel gerekse şehir ve bölge tarihi üzerine yazılan kaynaklar ışığında incelenecektir.
- Published
- 2019
- Full Text
- View/download PDF
64. XIII. Yüzyılda Güney Kafkasya’da Selçuklu İzleri: 'Atabeglik Müessesesi ve Atabegler'
- Author
-
Ömer Subaşı
- Subjects
güney kafkasya ,selçuklu ,atabeg ,mkhargrdzeli ,mankaberdeli ,cakeli ,south caucasus ,seljuk ,History (General) ,D1-2009 - Abstract
Gürcistan Kraliçesi Tamara, Başkumandan Zakaria Mkhargrdzeli’nin 1212 yılındaki ölümünün ardından kardeşi İvane Mkhargrdzeli’ye başkumandanlık teklif etti. Bu teklif üzerine İvane, Kraliçe’den bir Selçuklu mirası olan ve neredeyse bölge devletlerinin tamamında kullanılan atabeglik unvanının kendisine verilmesini istedi. Gürcistan tarihinde o güne kadar eşine rastlanmayan bu unvanın talep edilmesine oldukça şaşıran Kraliçe, nihayetinde İvane’ye atabeg unvanını vermeyi kabul etti. Bundan sonra adından çokça söz ettirecek, genellikle babadan oğula geçmek suretiyle varlığını sürdürecek olan bu unvan Güney Kafkasya siyasî hayatında boy göstermeye başladı. Mkhargrdzeli hanedanına mensup Atabeg Avak’ın ölümünden sonra atabeglik kurumunun Güney Kafkasya’daki temsilcileri Mankaberdeli ailesinin üyeleri oldu. XIII. yüzyılın sonlarına doğru sessizliğe bürünen atabeglik müessesesi XIV. yüzyılın başından itibaren Cakeli ailesinin yönetimi altındaki Atabegler Yurdunda kullanılmaya başlandı. Bu çalışmada dönemin kaynakları göz önünde bulundurularak Selçuklu Devlet yapısının vazgeçilmez unvanlarından birisi olan Atabegliğin Güney Kafkasya siyasî hayatındaki yeri, önemi ve varoluş mücadelesi incelenecektir.
- Published
- 2017
- Full Text
- View/download PDF
65. CRC-based memory reliability for task-parallel HPC applications
- Author
-
Adrian Cristal, Omer Subasi, Jesús Labarta, Gulay Yalcin, Osman Unsal, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, and Barcelona Supercomputing Center
- Subjects
Computer science ,Error correction capability ,Distributed computing ,Errors ,Memory reliability ,Task parallelism ,02 engineering and technology ,Mathematical analysis ,Hardware ,Hardware acceleration ,Cyclic redundancy check ,0202 electrical engineering, electronic engineering, information engineering ,Error correction ,Informàtica::Arquitectura de computadors::Arquitectures paral·leles [Àrees temàtiques de la UPC] ,Application programs ,020203 distributed computing ,Parallel processing (Electronic computers) ,Processament en paral·lel (Ordinadors) ,Fault tolerance ,Dataflow model ,Reliability ,Reconfigurable computing ,Reconfigurable hardware ,020202 computer hardware & architecture ,Software-based solutions ,Scalability ,Error detection and correction ,Data flow analysis - Abstract
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the mostcommonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overheadwith hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRC-based mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error correction capability. This work was supported by FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.