Start Over

A study of the performance of novel storage-centric repairable codes.

Authors :: Datta, Anwitaman
Pamies-Juarez, Lluis
Oggier, Frédérique
Source :: Computing. Mar2016, Vol. 98 Issue 3, p319-341. 23p.
Publication Year :: 2016
Abstract: Erasure coding has become an integral part of the storage infrastructure in data-centers and cloud backends-since it provides significantly higher fault tolerance for substantially lower storage overhead compared to a naive approach like n-way replication. Fault tolerance refers to the ability to achieve very high availability despite (temporary) failures, but for long term data durability, the redundancy provided by erasure coding needs to be replenished as storage nodes fail or are retired. Traditional erasure codes are not easily amenable to repairs, and their repair process is usually both expensive and slow. Consequently, in recent years, numerous novel codes tailor-made for distributed storage have been proposed to optimize the repair process. Broadly, most of these codes belong to either of the two following families: network coding inspired regenerating codes that aim at minimizing the per repair traffic, and locally repairable codes (LRC) which minimize the number of nodes contacted per repair (which in turn leads to the reduction of repair traffic and latency). Existing studies of these codes however restrict themselves to the repair of individual data objects in isolation. They ignore many practical issues that a real system storing multiple objects needs to take into account. Our goal is to explore a subset of such issues, particularly pertaining to the scenario where multiple objects are stored in the system. We use a simulation based approach, which models the network bottlenecks at the edges of a distributed storage system, and the nodes' load and (un)availability. Specifically, we abstract the key features of both regenerating and LRC, and examine the effect of data placement and the corresponding de/correlation of failures, and the competition for limited network resources when multiple objects need to be repaired simultaneously by exploring the interplay of code parameters and trade-offs of bandwidth usage and speed of repairs. [ABSTRACT FROM AUTHOR]