Start Over

Adaptive Co-Scheduler for Highly Dynamic Resources

Authors :: Vedalaveni, Kartik
Publication Year :: 2013
Abstract: There are many kinds of scientific applications that run on high throughput computational (HTC) grids. HTC may utilize clusters opportunistically, only running on a given cluster when it is otherwise idle. These widely dispersed HTC clusters are heterogeneous in terms of capability and availability, but can provide significant computing power in aggregate. The scientific algorithms run on them also vary greatly. Some scientific algorithms might use high rates of disk I/O and some might need large amounts of RAM. Most schedulers only consider cpu availability, but unchecked demand on these associated resources that aren’t managed by resource managers may give rise to several issues on the cluster. On the grid there could be different schedulers on different sites and we cannot rely upon features of one kind of scheduler to characterize the nature and features of every job. Most state of the art schedulers do not take into account resources like RAM, Disk I/O or Network I/O. This is as true for the local schedulers as much it is for the grid. Often there is a need to extend these schedulers to solve situations arising from new and/or complex use cases either by writing a plugin for existing schedulers or by CoScheduling. A key issue is when resources like RAM, Disk I/O or Network I/O are used in an unchecked manner and performance degrades as a result of it. Further scheduling jobs that claim the degraded resources could overwhelm the resource to an extent that the resource will finally stop responding or theWe solve system will crash. We schedule based on minimum turnaround time of the sites which will help increase the throughput of the overall workload for the Co-Scheduler, which also is a good load- balancer in itself. The Co-Scheduler is driven by the fact that turnaround time increases when concurrent jobs accessing these resources reach a threshold value which in-turn causes degradation and this is the basis for this work. With an increase in the number of entities concurrently using the resource, there is a need to monitor and schedule concurrent and unmanaged access to any given resource to prevent degradation. These issues that we encounter in real life at the Holland Computing Center are the basis and motivation for tackling this problem and for developing an adaptive approach for scheduling. This co-scheduler must be aware of multi-resource degradation, balance load across multiple sites and run clusters at high efficiency and share resources fluidly. An initial implementation tested at HCC will be evaluated and presented. Adviser: David Swanson