29 results on '"Angelos Bilas"'
Search Results
2. EVOLVE: Towards Converging Big-Data, High-Performance and Cloud-Computing Worlds
- Author
-
Achilleas Tzenetopoulos, Dimosthenis Masouros, Konstantina Koliogeorgi, Sotirios Xydis, Dimitrios Soudris, Antony Chazapis, Christos Kozanitis, Angelos Bilas, Christian Pinto, Huy-Nam Nguyen, Stelios Louloudakis, Georgios Gardikis, George Vamvakas, Michelle Aubrun, Christy Symeonidou, Vassilis Spitadakis, Konstantinos Xylogiannopoulos, Bernhard Peischl, Tahir Emre Kalayci, Alexander Stocker, and Jean-Thomas Acquaviva
- Published
- 2022
3. Skynet: Performance-driven Resource Management for Dynamic Workloads
- Author
-
Manolis Marazakis, Angelos Bilas, and Yannis Sfakianakis
- Subjects
Computer science ,Control theory ,business.industry ,Quality of service ,Distributed computing ,SKYNET ,Resource allocation ,Throughput ,Resource management ,Cloud computing ,Dynamic priority scheduling ,business - Abstract
A primary concern for cloud operators is to increase resource utilization while maintaining good performance for applications. This is particularly difficult to achieve for three reasons: users tend to overprovision applications, applications are diverse and dynamic, and their performance depends on multiple resources. In this paper, we present Skynet, an automated and adaptive cloud resource management approach that addresses all three concerns. Skynet uses performance level objectives (PLOs) to capture user intentions about required performance more accurately to remove the user from the resource allocation loop. Then, Skynet estimates the resources required to achieve the target PLO. For this purpose, we employ a Proportional Integral Derivative (PID) controller per application and adjust its parameters on the fly. Finally, to capture the dependence of applications on different or multiple resources, Skynet extends the traditional one-dimensional PID controller to estimate CPU, memory, I/O throughput, and network throughput. Essentially, Skynet builds a model on-the-fly to map target PLOs to resources for each application, taking into account multiple resources and changing input load. We implement Skynet as an end-to-end, custom scheduler in Kubernetes and evaluate it using real workloads on both a private cluster and AWS. Skynet decreases PLO violations by more than 7.4x and increases resource utilization by more than 2x, compared to Kubernetes. Essentially, Skynet builds a model on-the-fly to map target PLOs to resources for each application, taking into account multiple resources and changing input load. We implement Skynet as an end-to-end, custom scheduler in Kubernetes and evaluate it using real workloads on both a private cluster and AWS. Skynet decreases PLO violations by more than 7.4x and increases resource utilization by more than 2x, compared to Kubernetes.
- Published
- 2021
4. TReM: A Task Revocation Mechanism for GPUs
- Author
-
Angelos Bilas, Stelios Mavridis, Manos Pavlidakis, and Nikos Chrysos
- Subjects
Source code ,Revocation ,business.industry ,Computer science ,media_common.quotation_subject ,Distributed computing ,Preemption ,Cloud computing ,010103 numerical & computational mathematics ,02 engineering and technology ,Supercomputer ,01 natural sciences ,Scheduling (computing) ,Task (computing) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,0101 mathematics ,business ,media_common - Abstract
GPUs in datacenters and cloud environments are mainly offered in a dedicated manner to applications, which leads to GPU under-utilization. Previous work has focused on increasing utilization by sharing GPUs across batch and user-facing tasks. With the presence of long-running tasks, scheduling approaches without GPU preemption fail to meet the SLA of user-facing tasks. Existing GPU preemption mechanisms introduce variable delays up to several seconds, which is intolerable, or require kernel source code, which is not always available.In this paper, we design TReM, a GPU revocation mechanism that stops a task at any point in its execution. TReM has a constant latency, of about 5ms to stop the currently executing kernel and about 17ms to start a new task. TReM does not store the state of the revoked kernel to obviate transfer latencies. We design and implement two scheduling policies, Priority and Elastic, that prioritize user-facing over batch tasks and utilize TReM to improve SLAs for user-facing tasks. To evaluate TReM, we use a workload generator that creates workloads with different characteristics, based on real traces. TReM reduces SLA violations by up to 10% compared to baseline policies that do not use a revocation mechanism. TReM incurs negligible overhead for non-revoked tasks and wastes only 3% of computation due to revocations for the workloads we examine.
- Published
- 2020
5. DyRAC: Cost-aware Resource Assignment and Provider Selection for Dynamic Cloud Workloads
- Author
-
Manolis Marazakis, Angelos Bilas, and Yannis Sfakianakis
- Subjects
Service (business) ,Computer science ,business.industry ,Distributed computing ,020206 networking & telecommunications ,Workload ,Cloud computing ,02 engineering and technology ,Dynamic priority scheduling ,Total cost of ownership ,Cost reduction ,Resource (project management) ,Software deployment ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business - Abstract
A primary concern for cloud users is how to minimize the total cost of ownership of cloud services. This is not trivial to achieve due to workload dynamics. Users need to select the number, size, type of VMs, and the provider to host their services based on available offerings. To avoid the complexity of re-configuring a cloud service, related work commonly approaches cost minimization as a packing problem that minimizes the resources allocated to services. However, this approach does not consider two problem dimensions that can further reduce cost: (1) provider selection and (2) VM sizing. In this paper, we explore a more direct approach to cost minimization by adjusting the type, number, size of VM instances, and the provider of a cloud service (i.e. a service deployment) at runtime. Our goal is to identify the limits in service cost reduction by online re-deployment of cloud services. For this purpose, we design DyRAC, an adaptive resource assignment mechanism for cloud services that, given the resource demands of a cloud service, estimates the most cost-efficient deployment. Our evaluation implements four different resource assignment policies to provide insight into how our approach works, using VM configurations of actual offerings from main providers (AWS, GCP, Azure). Our experiments show that DyRAC reduces cost by up to 33% compared to typical strategies.
- Published
- 2020
6. NanoStreams: Codesigned microservers for edge analytics in real time
- Author
-
Matthew Russell, Ahmad Hassan, Paul Barber, Ivor Spence, Umar Ibrahim Minhas, Giorgis Georgakoudis, Dimitrios S. Nikolopoulos, Heiner Giefers, Neil Horlock, Angelos Bilas, George Tzenakis, Peter Staar, Hans Vandierendonck, Costas Bekas, Colin Pattison, Murali Shyamsundar, Roger Woods, Stelios Kaloutsakis, Charles J. Gillan, and Richard Faloon
- Subjects
Computer science ,Complex event processing ,02 engineering and technology ,01 natural sciences ,Instruction set ,Modelling and Simulation ,Server ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,SDG 7 - Affordable and Clean Energy ,Field-programmable gate array ,Edge computing ,010302 applied physics ,020203 distributed computing ,Xeon ,business.industry ,Provisioning ,Computer Science Applications ,Computer architecture ,Hardware and Architecture ,Embedded system ,business ,System software - Abstract
NanoStreams explores the design, implementation, and system software stack of micro-servers aimed at processing data in-situ and in real time. These micro-servers can serve the emerging Edge computing ecosystem, namely the provisioning of advanced computational, storage, and networking capability near data sources to achieve both low latency event processing and high throughput analytical processing, before considering off-loading some of this processing to high-capacity data centres. Nano Streams explores a scale-out micro-server architecture that can achieve equivalent QoS to that of conventional rack-mounted servers for high-capacity data centres, but with dramatically reduced form factors and power consumption. To this end, Nano Streams introduces novel solutions in programmable & configurable hardware accelerators, as well as the system software stack used to access, share, and program those accelerators. Our Nano Streams micro-server prototype has demonstrated 5.5 x higher energy-efficiency than a standard Xeon Server. Simulations of the micro server's memory system extended to leverage hybrid DDR/NVM main memory indicated 5x higher energy-efficiency than a conventional DDR-based system.
- Published
- 2016
7. Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet
- Author
-
Angelos Bilas and Pilar González-Férez
- Subjects
Ethernet ,business.industry ,Computer science ,Embedded system ,Multipath I/O ,InfiniBand ,CPU time ,IOPS ,business ,Communications protocol ,ATA over Ethernet ,Context switch - Abstract
Small I/O requests are important for a large number of modern workloads in the data center. Traditionally, storage systems have been able to achieve low I/O rates for small I/O operations because of hard disk drive (HDD) limitations that are capable of about 100–150 IOPS (I/O operations per second) per spindle. Therefore, the host CPU processing capacity and network link throughput have been relatively abundant for providing these low rates. With new storage device technologies, such as NAND Flash Solid State Drives (SSDs) and non-volatile memory (NVM), it is becoming common to design storage systems that are able to support millions of small IOPS. At these rates, however, both server CPU and network protocol are emerging as the main bottlenecks for achieving large rates for small I/O requests. Most storage systems in datacenters deliver I/O operations over some network protocol. Although there has been extensive work in low-latency and high-throughput networks, such as Infiniband, Ethernet has dominated the datacenter. In this work we examine how networked storage protocols over raw Ethernet can achieve low, host CPU overhead and increase network link efficiency for small I/O requests. We first analyze in detail the latency and overhead of a networked storage protocol directly over Ethernet and we point out the main inefficiencies. Then, we examine how storage protocols can take advantage of context switch elimination and adaptive batching to reduce CPU and network overhead. Our results show that raw Ethernet is appropriate for supporting fast storage systems. For 4kB requests we reduce server CPU overhead by up to 45%, we improve link utilization by up to 56%, achieving more than 88% of the theoretical link throughput. Effectively, our techniques serve 56% more I/O operations over a 10Gbits/s link than a baseline protocol that does not include our optimizations at the same CPU utilization. Overall, to the best of our knowledge, this is the first work to present a system that is able to achieve 14μs host CPU overhead on both initiator and target for small networked I/Os over raw Ethernet without hardware support. In addition, our approach is able to achieve 287K 4kB IOPS out of the 315K IOPS that are theoretically possible over a 1.2GBytes/s link.
- Published
- 2015
8. EUROSERVER: Energy Efficient Node for European Micro-Servers
- Author
-
Yves Durand, Paul M. Carpenter, Manolis Marazakis, Iakovos Mavroidis, Alexis Farcy, John Goodacre, Georgi Gaydadjiev, Manolis Katevenis, Stefano Adami, Emil Matus, Angelos Bilas, John Thomson, and Denis Dutoit
- Subjects
Computer science ,business.industry ,Node (networking) ,Distributed computing ,Cloud computing ,Virtualization ,computer.software_genre ,Shared resource ,Server ,Systems architecture ,Resource management ,Data center ,business ,computer ,Computer network - Abstract
EUROSERVER is a collaborative project that aims to dramatically improve data centre energy-efficiency, cost, and software efficiency. It is addressing these important challenges through the coordinated application of several key recent innovations: 64-bit ARM cores, 3D heterogeneous silicon-on-silicon integration, and fully-depleted silicon-on-insulator (FD SOI) process technology, together with new software techniques for efficient resource management, including resource sharing and workload isolation. We are pioneering a system architecture approach that allows specialized silicon devices to be built even for low-volume markets where NRE costs are currently prohibitive. The EUROSERVER device will embed multiple silicon "chiplets" on an active silicon interposer. Its system architecture is being driven by requirements from three use cases: data centres and cloud computing, telecom infrastructures, and high-end embedded systems. We will build two fully integrated full-system prototypes, based on a common micro-server board, and targeting embedded servers and enterprise servers.
- Published
- 2014
9. Jericho: Achieving scalability through optimal data placement on multicore systems
- Author
-
Angelos Bilas, Yannis Sfakianakis, Manolis Marazakis, Stelios Mavridis, and Anastasios Papagiannis
- Subjects
Hardware_MEMORYSTRUCTURES ,Computer science ,Linux kernel ,Memory bandwidth ,IOPS ,Thread (computing) ,Parallel computing ,computer.software_genre ,Server ,Scalability ,Operating system ,Cache ,computer ,Dram - Abstract
Achieving high I/O throughput on modern servers presents significant challenges. With increasing core counts, server memory architectures become less uniform, both in terms of latency as well as bandwidth. In particular, the bandwidth of the interconnect among NUMA nodes is limited compared to local memory bandwidth. Moreover, interconnect congestion and contention introduce additional latency on remote accesses. These challenges severely limit the maximum achievable storage throughput and IOPS rate. Therefore, data and thread placement are critical for data-intensive applications running on NUMA architectures. In this paper we present Jericho, a new I/O stack for the Linux kernel that improves affinity between application threads, kernel threads, and buffers in the storage I/O path. Jericho consists of a NUMA-aware filesystem and a DRAM cache organized in slices mapped to NUMA nodes. The Jericho filesystem implements our task placement policy by dynamically migrating application threads that issue I/Os based on the location of the corresponding I/O buffers. The Jericho DRAM I/O cache, a replacement for the Linux page-cache, splits buffer memory in slices, and uses per-slice kernel I/O threads for I/O request processing. Our evaluation shows that running the FIO microbenchmark on a modern 64-core server with an unmodified Linux kernel results in only 5% of the memory accesses being served by local memory. With Jericho, more than 95% of accesses become local, with a corresponding 2x performance improvement.
- Published
- 2014
10. Tyche: An efficient Ethernet-based protocol for converged networked storage
- Author
-
Angelos Bilas and Pilar González-Férez
- Subjects
Ethernet ,Storage area network ,business.industry ,Direct-attached storage ,Computer science ,EMC Invista ,Server ,RDMA over Converged Ethernet ,Converged storage ,business ,ATA over Ethernet ,Computer network - Abstract
Current technology trends for efficient use of infrastructures dictate that storage converges with computation by placing storage devices, such as NVM-based cards and drives, in the servers themselves. With converged storage the role of the interconnect among servers becomes more important for achieving high I/O throughput. Given that Ethernet is emerging as the dominant technology for datacenters, it becomes imperative to examine how to reduce protocol overheads for accessing remote storage over Ethernet interconnects. In this paper we propose Tyche, a network storage protocol directly on top of Ethernet, which does not require any hardware support from the network interface. Therefore, Tyche can be deployed in existing infrastructures and to co-exist with other Ethernet-based protocols. Tyche presents remote storage as a local block device and can support any existing filesystem. At the heart of our approach, there are two main axis: reduction of host-level overheads and scaling with the number of cores and network interfaces in a server. Both target at achieving high I/O throughput in future servers. We reduce overheads via a copy-reduction technique, storage-specific packet processing, pre-allocation of memory, and using RDMA-like operations without requiring hardware support. We transparently handle multiple NICs and offer improved scaling with the number of links and cores via reduced synchronization, proper packet queue design, and NUMA affinity management. Our results show that Tyche achieves scalable I/O throughput, up to 6.4 GB/s for reads and 6.8 GB/s for writes with 6 × 10 GigE NICs. Our analysis shows that although multiple aspects of the protocol play a role for performance, NUMA affinity is particularly important. When comparing to NBD, Tyche performs better by up to one order of magnitude.
- Published
- 2014
11. Task-based parallel H.264 video encoding for explicit communication architectures
- Author
-
Dimitrios S. Nikolopoulos, Angelos Bilas, Michail Alvanos, and George Tzenakis
- Subjects
Speedup ,Computer science ,Data parallelism ,Task parallelism ,02 engineering and technology ,Parallel computing ,020202 computer hardware & architecture ,Computer architecture ,Motion estimation ,0202 electrical engineering, electronic engineering, information engineering ,Entropy (information theory) ,020201 artificial intelligence & image processing ,Granularity ,Instruction-level parallelism ,Encoder - Abstract
Future multi-core processors will necessitate exploitation of fine-grain, architecture-independent parallelism from applications to utilize many cores with relatively small local memories. We use c264, an end-to-end H.264 video encoder for the Cell processor based on ×264, to show that exploiting fine-grain parallelism remains challenging and requires significant advancement in runtime support. Our implementation of c264 achieves speedup between 4.7× and 8.6× on six synergistic processing elements (SPEs), compared to the serial version running on the power processing element (PPE). We find that the programming effort associated with efficient parallelization of c264 at fine granularity is highly non-trivial. Hand optimizations may improve performance significantly but are limited eventually by the code restructuring they require. We assess the complexity of exploiting fine-grain parallelism in realistic applications, by identifying optimizations of c264 and the effort they require.
- Published
- 2011
12. Cloud-based synchronization of distributed file system hierarchies
- Author
-
Sandesh Uppoor, Angelos Bilas, and Michail D. Flouris
- Subjects
business.industry ,Computer science ,Distributed computing ,Replica ,Distributed data store ,Synchronization (computer science) ,Cloud computing ,Data synchronization ,The Internet ,business ,Distributed File System ,File synchronization ,Computer network - Abstract
As the number of user-managed devices continues to increase, the need for synchronizing multiple file hierarchies distributed over devices with ad hoc connectivity, is becoming a significant problem. In this paper, we propose a new approach for efficient cloud-based synchronization of an arbitrary number of distributed file system hierarchies. Our approach maintains both the advantages of peer-to-peer synchronization with the cloud-based approach that stores a master replica online. In contrast, we do not assume storage of any user's data in the cloud, so we address the related capacity, cost, security, and privacy limitations. Finally, the proposed system performs data synchronization in a peer-to-peer manner, eliminating cost and bandwidth concerns that arise in the “cloud master-replica” approach.
- Published
- 2010
13. Foreword
- Author
-
Dimitrios S. Nikolopoulos, Angelos Bilas, and Ricardo Bianchini
- Subjects
Physics ,Cluster (physics) ,Astrophysics - Published
- 2010
14. FLASH: Fine-Grained Localization in Wireless Sensor Networks Using Acoustic Sound Transmissions and High Precision Clock Synchronization
- Author
-
Angelos Bilas and Evangelos Mangas
- Subjects
Range (mathematics) ,Computer science ,business.industry ,Real-time computing ,Radio frequency ,Interrupt ,Telecommunications ,business ,Focus (optics) ,Wireless sensor network ,Synchronization ,Clock synchronization - Abstract
Sensor localization in wireless sensor networks is an important component of many applications. Previous work has demonstrated how localization can be achieved using various methods. In this paper we focus on achieving fine-grained localization that does not require external infrastructure, specialized hardware support, or excessive sensor resources. We use a real sensor network and provide measurements on the actual system. We adopt a localization approach that relies on acoustic sounds and clock synchronization. The contribution of our work is achieving consistent sound pulse detection at each sensor and precise range estimation using a high-precision clock synchronization implementation. We first describe our technique and then we evaluate our approach using a real setup. Our results show that our approach achieves an average clock synchronization accuracy of 5μs. We verify this accuracy using an external global clock via an interrupt mechanism. Our sound detection technique is able to consistently identify sound pulses up to 10m distances in indoor environments. Combining the two techniques, we find that our localization method results in accurate range estimation with an average error of 11cm in distances up to 7m and in consistent range estimation up to 10m in various indoor environments.
- Published
- 2009
15. Providing security to the Desktop Data Grid
- Author
-
Angelos Bilas, Michail D. Flouris, Jesus Luna, and Manolis Marazakis
- Subjects
Cryptographic primitive ,Data grid ,business.industry ,Computer science ,Distributed computing ,Cryptographic protocol ,Grid ,computer.software_genre ,Semantic grid ,Grid computing ,Backup ,Storage security ,business ,computer ,Computer network - Abstract
Volunteer computing is becoming a new paradigm not only for the computational grid, but also for institutions using production-level data grids because of the enormous storage potential that may be achieved at a low cost by using commodity hardware within their own computing premises. However, this novel "Desktop Data Grid" depends on a set of widely distributed and untrusted storage nodes, therefore offering no guarantees about neither availability nor protection to the stored data. These security challenges must be carefully managed before fully deploying desktop data grids in sensitive environments (such as eHealth) to cope with a broad range of storage needs, including backup and caching. In this paper we propose a cryptographic protocol able to fulfil the storage security requirements related with a generic desktop data grid scenario, which were identified after applying an analysis framework extended from our previous research on the data grid's storage services. The proposed protocol uses three basic mechanisms to accomplish its goal: (a) symmetric cryptography and hashing, (b) an information dispersal algorithm and the novel (c) "quality of security" (QoSec) quantitative metric. Although the focus of this work is the associated protocol, we also present an early evaluation using an analytical model. Our results show a strong relationship between the assurance of the data at rest, the QoSec of the volunteer storage client and the number of fragments required to rebuild the original file.
- Published
- 2008
16. Exploiting spatial parallelism in Ethernet-based cluster interconnects
- Author
-
S. Passas, Angelos Bilas, G. Kotsis, and Sven Karlsson
- Subjects
Ethernet ,business.industry ,Computer science ,Network packet ,Local area network ,Throughput ,Dynamic priority scheduling ,Parallel computing ,Polling ,business ,Computer network ,Scheduling (computing) - Abstract
In this work we examine the implications of building a single logical link out of multiple physical links. We use MultiEdge to examine the throughput-CPU utilization tradeoffs and examine how overheads and performance scale with the number and speed of links. We use low- level instrumentation to understand associated overheads, we experiment with setups between 1 and 8 1-GBit/s links, and we contrast our results with a single 10-GBit/s link. We find that: (a) Our base protocol achieves up-to 65% of the nominal aggregate throughput, (b) Replacing the interrupts with polling significantly impacts only the multiple link configurations, reaching 80% of nominal throughput, (c) The impact of copying on CPU overhead is significant, and removing copying results in up-to 66% improvement in maximum throughput, reaching almost 100% of the nominal throughput, (d) Scheduling packets over heterogeneous links requires simple but dynamic scheduling to account for different link speeds and varying load.
- Published
- 2008
17. Workshop 9 introduction: The workshop on communication architecture for clusters - CAC 2008
- Author
-
Francisco J. Alfaro-Cortés, Angelos Bilas, José Duato, and Andrew Lumsdaine
- Published
- 2008
18. MultiEdge: An Edge-based Communication Subsystem for Scalable Commodity Servers
- Author
-
Angelos Bilas, G. Kotsis, Sven Karlsson, and S. Passas
- Subjects
Connection-oriented communication ,Shared memory ,business.industry ,Computer science ,Server ,Distributed computing ,Scalability ,Throughput ,Enhanced Data Rates for GSM Evolution ,business ,Communications protocol ,Computer network - Abstract
At the core of contemporary high performance computer systems is the communication infrastructure. For this reason, there has been a lot of work on providing low-latency, high-bandwidth communication subsystems for clusters. In this paper, we introduce MultiEdge, a connection oriented communication system designed for high-speed commodity hardware. MultiEdge provides support for end-to-end flow -control, ordering, and reliable transmission. It transparently supports multiple physical links within a single connection. We use MultiEdge to examine the behavior of edge-based protocols using both micro-benchmarks and real-life shared memory applications. Our results show that MultiEdge is able to deliver about 88% of the nominal link throughput with a single 10-GBit/s link and more than 95% with multiple 1-GBit/s links. Our application results show that performing all of the communication protocol at the edge does not seem to cause any degradation in performance.
- Published
- 2007
19. Using Lightweight Transactions and Snapshots for Fault-Tolerant Services Based on Shared Storage Bricks
- Author
-
M.D. Flourish, Angelos Bilas, and Renaud Lachaize
- Subjects
business.industry ,Computer science ,Distributed computing ,Journaling file system ,Computer data storage ,Disk array ,Snapshot (computer storage) ,Fault tolerance ,business ,Application layer - Abstract
To satisfy current and future application needs in a cost effective manner, storage systems are evolving from monolithic disk arrays to networked storage architectures based on commodity components. So far, this architectural transition has mostly been envisioned as a way to scale capacity and performance. In this work we examine how the block-level interface exported by such networked storage systems can be extended to deal with reliability. Our goals are: (a) At the design level, to examine how strong reliability semantics can be offered at the block level; (b) At the implementation level, to examine the mechanisms required and how they may be provided in a modular and configurable manner. We first discuss how transactional-type semantics may be offered at the block level. We present a system design that uses the concept of atomic update intervals combined with existing, block-level locking and snapshot mechanisms, in contrast to the more common journaling techniques. We discuss in detail the design of the associated mechanisms and the trade-offs and challenges when dividing the required functionality between the file-system and the block-level storage. Our approach is based on a unified and thus, non-redundant set of mechanisms for providing reliability both at the block and file level. Our design and implementation effectively provide a tunable, lightweight transactions mechanism to higher system and application layers. Finally, we describe how the associated protocols can be implemented in a modular way in a prototype storage system we are currently building. As our system is currently being implemented, we do not present performance results
- Published
- 2006
20. Experiences from Debugging a PCIX-based RDMA-capable NIC
- Author
-
Vassilis Papaefstathiou, Angelos Bilas, Manolis Marazakis, and Giorgos Kalokairinos
- Subjects
Correctness ,Remote direct memory access ,Computer science ,business.industry ,Background debug mode interface ,Event (computing) ,media_common.quotation_subject ,Network interface ,computer.software_genre ,Software ,Algorithmic program debugging ,Debugging ,Component-based software engineering ,Operating system ,Interrupt ,business ,computer ,Implementation ,media_common - Abstract
Implementing and debugging high-performance network subsystems is a challenging task. In this paper, we present our experiences from developing and debugging a network interface card (NIC). Our NIC targets networked storage subsystems (Marazakis et al., 2006). For this purpose it mainly provides support for remote direct-memory-access (RDMA) write, sender-side notification of RDMA write completion, and receiver-side interrupt generation. In our work we examine issues that arise during system implementation and debugging, both in terms of correctness as well as performance. We present an analysis of the individual problems we encounter and we discuss how we address each case. For most problems we encounter, it is not possible to rely on existing debugging tools. However, we find that most of the techniques we use in this process, rely on collecting some form of event records from software or hardware components. We believe that such capabilities can be provided for independent hardware or software components in isolation, a fairly straight-forward task, thus, significantly simplifying the debugging process in complex systems of this nature
- Published
- 2006
21. Behavior and performance of interactive multi-player game servers
- Author
-
Angelos Bilas, Andreas Moshovos, and A. Abdelkhalek
- Subjects
Sequential game ,business.industry ,Computer science ,Entertainment industry ,computer.software_genre ,Inter-process communication ,Server farm ,Server ,Scalability ,Operating system ,Online transaction processing ,business ,computer ,Server-side ,Computer network - Abstract
With the recent explosion in deployment of services to large numbers of customers over the Internet and in global services in general, issues related to the architecture of scalable servers are becoming increasingly important. However, our understanding of these types of applications is currently limited, especially on how well they scale to support large numbers of users. One such, novel, commercial class of applications, are interactive, multi-player game servers. Multi-player games are both an important class of commercial applications (in the entertainment industry) and they can be valuable in understanding the architectural requirements of scalable services. They impose requirements on system performance, scalability, and availability, stressing multiple aspects of the system architecture (e.g., compute cycles and network I/O). Recently there has been a lot of interest on client side issues with respect to games. However, there has beenlittle or no work on the server side. In this paper we use a commercial game server to gain insight in this class of applications and the requirements they impose on modern architectures. We find that: (1) In terms of the benchmarking methodology, interactive game servers are very different from scientific workloads. We propose a methodology that deals with the related issues in benchmarking this class of applications. Our methodology bears many similarities with methodologies used in benchmarking online transaction processing (OLTP) systems. (2) Current, sequential game servers can support at most up to a few tens of users (60–100) on existing processors. (3) The bottleneck in the server is both game-related as well as network-related processing (about 50–50). (4) Network bandwidth requirements are not an important issue for the numbers of players we are interested in. (5) The processor achieves a surprisingly low IPC of 0.416.
- Published
- 2005
22. Performance Evaluation of Commodity iSCSI-Based Storage Systems
- Author
-
Angelos Bilas, D. Xinidis, and Michail D. Flouris
- Subjects
Storage area network ,HyperSCSI ,Computer science ,Application server ,Operating system ,Local area network ,Linux kernel ,iSCSI ,Disk buffer ,computer.software_genre ,computer ,Host (network) - Abstract
iSCSI is proposed as a possible solution to building future storage systems. However, using iSCSI raises numerous questions about its implications on system performance. This lack of understanding of system I/O behavior in modern and future systems inhibits providing solutions at the architectural and system levels. Our main goals in this work are to understand the behavior of the application server (iSCSI initiator), to evaluate the overhead introduced by iSCSI compared to systems with directly-attached storage, and to provide insight about how future storage systems may be improved. We examine these questions in the context of commodity iSCSI systems that can benefit most from using iSCSI. We use commodity PCs with several disks as storage nodes and a Gigabit Ethernet network as the storage network. On the application server side we use a broad range of benchmarks and applications to evaluate the impact of iSCSI on application and server performance. We instrument the Linux kernel to provide detailed information about I/O activity and the various overheads of kernel I/O layers. Our analysis reveals how iSCSI affects application performance and shows that building next generation, network-based I/O architectures, requires optimizing I/O latency, reducing network and buffer cache related processing in the host CPU, and increasing the sheer network bandwidth to account for consolidation of different types of traffic.
- Published
- 2005
23. Violin: A Framework for Extensible Block-Level Storage
- Author
-
Angelos Bilas and Michail D. Flouris
- Subjects
Violin ,Metadata ,Application virtualization ,Computer science ,Metadata management ,Operating system ,Virtual device ,Storage virtualization ,Application software ,computer.software_genre ,Virtualization ,computer - Abstract
In this work we propose Violin, a virtualization framework that allows easy extensions of block-level storage stacks. Violin allows (i) developers to provide new virtualization functions and (ii) storage administrators to combine these functions in storage hierarchies with rich semantics. Violin makes it easy to develop such new functions by providing support for (i) hierarchy awareness and arbitrary mapping of blocks between virtual devices, (ii) explicit control over both the request and completion path of I/O requests, and (iii) persistent metadata management. To demonstrate the effectiveness of our approach we evaluate Violin in three ways: (i) we loosely compare the complexity of providing new virtual modules in Violin with the traditional approach of writing monolithic drivers. In many cases, adding new modules is a matter of recompiling existing user-level code that provides the required functionality. (ii) We show how simple modules in Violin can be combined in more complex hierarchies. We demonstrate hierarchies with advanced virtualization semantics that are difficult to implement with monolithic drivers. (iii) We use various benchmarks to examine the overheads introduced by Violin in the common I/O path. We find that Violin modules perform within 10% of the corresponding monolithic Linux drivers.
- Published
- 2005
24. CORMOS: a communication-oriented runtime system for sensor networks
- Author
-
J. Yannakopoulos and Angelos Bilas
- Subjects
Protocol stack ,Key distribution in wireless sensor networks ,Runtime system ,First-class citizen ,business.industry ,Computer science ,Network packet ,Concurrency ,Embedded system ,Modular design ,business ,Wireless sensor network - Abstract
Recently there has been a lot of activity in building sensor prototypes with processing and communication capabilities. Early efforts in this area focused on building the devices themselves and on understanding network issues. An issue that has not received as much attention is generic runtime system support. In this paper, we present CORMOS, a communication-oriented runtime system for sensor networks. CORMOS is tailored: (i) to provide easy-to-use abstractions and treat communication as a first class citizen rather than an extension, (ii) to be highly modular with unified application and system interfaces, and (iii) to deal with sensor limitations on concurrency and memory. We describe the design of CORMOS, discuss various design alternatives, and provide a prototype implementation on a real system. We present preliminary results for resource requirements of CORMOS using a pair of sensor devices. We find that the runtime system and a simple network stack can fit in 5.5 KBytes of program memory, occupying about 130 Bytes of RAM. On the specific devices we use, the system is able to process events at a rate of 2500 events/sec. When communicating over the radio transceiver, CORMOS achieves a maximum rate of 20 packets/sec.
- Published
- 2005
25. Parallelization, optimization, and performance analysis of portfolio choice models
- Author
-
A. Abdelkhalek, Angelos Bilas, and Alexander Michaelides
- Subjects
Class (computer programming) ,Speedup ,Computer science ,Computation ,Benchmark (computing) ,Portfolio ,Parallel computing ,Sequential algorithm - Abstract
In this paper we show how applications in computational economics can take advantage of modern parallel architectures to reduce the computation time in a wide array of models that have been, to date, computationally intractable. The specific application we use computes the optimal consumption and portfolio choice policy rules over the life-cycle of the individual. Our goal is two-fold: (i) To understand the behavior of a class of emerging applications and provide an efficient parallel implementation and (ii) to introduce a new benchmark for parallel computer architectures from an emerging and important class of applications. We start from an existing sequential algorithm for solving a portfolio choice model. We present a number of optimizations that result in highly optimized sequential code. We then present a parallel version of the application. We find that: (i) Emerging applications in this area of computational economics exhibit adequate parallelism to achieve, after a number of optimization steps, almost linear speedup for system sizes up to 64 processors. (ii) The main challenges in dealing with applications in this area are computational imbalances introduced by algorithmic dependencies and the parallelization method and granularity. (iii) We present preliminary results for a problem that has not been, to the best of our knowledge, solved in the financial economics literature to date.
- Published
- 2001
26. Limits to the performance of software shared memory: a layered approach
- Author
-
Angelos Bilas, Dongming Jiang, Jaswinder Pal Singh, and Yuanyuan Zhou
- Subjects
Distributed shared memory ,Computer science ,business.industry ,Software development ,Software performance testing ,Application software ,computer.software_genre ,Application layer ,Computer architecture ,Shared memory ,Distributed memory ,Software system ,business ,computer - Abstract
Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the end performance of applications that were written for the more proven hardware-coherent shared memory is still not very good on these systems. Three major layers of software (and hardware) stand between the end user and parallel performance, each with its own functionality and performance characteristics. They include the communication layer, the software protocol layer that supports the programming model, and the application layer. These layers provide a useful framework to identify the key remaining limitations and bottlenecks in software shared memory systems, as well as the areas where optimization efforts might yield the greatest performance improvements. This paper performs such an integrated study, using this layered framework, for two types of software distributed shared memory systems: page-based shared virtual memory (SVM) and fine-grained software systems (FG). For the two system layers (communication and protocol), we focus on the performance costs of basic operations in the layers rather than on their functionalities. This is possible because their functionalities are now fairly mature. The less mature applications layer is treated through application restructuring. We examine the layers individually and in combination, understanding their implications for the two types of protocols and exposing the synergies among layers.
- Published
- 1999
27. User-Space Communication: A Quantitative Study
- Author
-
Cezary Dubnicki, Angelos Bilas, Koichi Konishi, Soichiro Araki, James Philbin, and Jan Sterling Edler
- Subjects
business.industry ,Computer science ,Models of communication ,Bandwidth (computing) ,User space ,Computer multitasking ,Myrinet ,Latency (engineering) ,Communications system ,business ,Bottleneck ,Computer network - Abstract
Powerful commodity systems and networks offer a promising direction for high performance computing because they are inexpensive and they closely track technology progress. However, high, raw-hardware performance is rarely delivered to the end user. Previous work has shown that the bottleneck in these architectures is the overheads imposed by the software communication layer. To reduce these overheads, researchers have proposed a number of user-space communication models. The common feature of these models is that applications have direct access to the network, bypassing the operating system in the common case and thus avoiding the cost of send/receive system calls. In this paper we examine five user-space communication layers, that represent different points in the configuration space: Generic AM, BIP-0.92, FM-2.02, PM-1.2, and VMMC-2. Although these systems support different communication paradigms and employ a variety of different implementation tradeoffs, we are able to quantitatively compare them on a single testbed consisting of a cluster of high-end PCs connected by a Myrinet network. We find that all five communication systems have very low latency for small messages, in the range of 5 to 17 s. Not surprisingly, this range is strongly influenced by the functionality offered by each system. We are encouraged, however, to find that features such as protected and reliable communication at user level and multiprogramming can be provided at very low cost. Bandwidth, however, depends primarily on how data is transferred between host memory and the network. Most of the investigated libraries support zero-copy protocols for certain types of data transfers, but differ significantly in the bandwidth delivered to end users. The highest bandwidth, between 95 and 125 MBytes/s for long message transfers, is delivered by libraries that use DMA on both send and receive sides and avoid all data copies. Libraries that perform additional data copies or use programmed I/O to send data to the network achieve lower maximum bandwidth, in the range of 60-70 MBytes/s.
- Published
- 1998
28. FPGA acceleration in EVOLVE’s Converged Cloud-HPC Infrastructure
- Author
-
Fekhr Eddine Keddous, Antony Chazapis, Huy Nam Nguyen, Dimosthenis Masouros, Sotirios Xydis, Angelos Bilas, Jean-Thomas Acquaviva, Konstantina Koliogeorgi, Romain Hugues, Dimitrios Soudris, and Michelle Aubrun
- Subjects
User Friendly ,Software ,Computer architecture ,business.industry ,Interface (Java) ,Computer science ,Testbed ,Big data ,Cloud computing ,Usability ,business ,Field-programmable gate array - Abstract
The EVOLVE project aims to take important steps in bringing together Big Data, HPC and Cloud domains in a single testbed and expose its services through a user friendly and transparent interface. The EVOLVE testbed is enhanced with acceleration capabilities by leveraging the power of heterogeneous technologies and allows the user to develop and deploy applications through Zeppelin notebooks with ease of use.
- Full Text
- View/download PDF
29. The VINEYARD project: Versatile integrated accelerator-based heterogeneous data centres
- Author
-
Georgi Gaydadjiev, Dimitrios S. Nikolopoulos, Vasilis Spatadakis, Christos Strydis, Christoforos Kachris, Neil Morgan, Angelos Bilas, Dimitris Gardelis, Alexandre Almeida, Ricardo Jiménez-Peris, Huy-Nam Nguyen, and Neurosciences
- Subjects
Multi-core processor ,business.industry ,Computer science ,Distributed computing ,Big data ,Cloud computing ,Symmetric multiprocessor system ,02 engineering and technology ,7. Clean energy ,020202 computer hardware & architecture ,020204 information systems ,Server ,Spark (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,Data center ,SDG 7 - Affordable and Clean Energy ,business ,Efficient energy use - Abstract
Emerging applications like cloud computing and big data analytics have created the need for powerful centers hosting hundreds of thousands of servers. Currently, the data centers are based on general purpose processors that provide high flexibility but lacks the energy efficiency of customized accelerators. VINEYARD1 aims to develop novel servers based on programmable hardware accelerators. Furthermore, VINEYARD will develop an integrated framework for allowing end-users to seamlessly utilize these accelerators in heterogeneous computing systems by using typical data-center programming frameworks (i.e. Spark). VINEYARD will foster the expansion of the soft-IP cores industry, currently limited in the embedded systems, to the data center market. VINEYARD plans to demonstrate the advantages of its approach in three real use-cases a) a bio-informatics application for high-accuracy brain modeling, b) two critical financial applications and c) a big-data analysis application.
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.