The aim of this paper is reducing recovery time of a failed component in a distributed storage system, hence, increasing availability of the entire cloud infrastructure. Using proposed data placement method, we showed that recovery time of LRC codes can be reduced by 70%-80% on average, and by 160% for regenerating codes, compared to the classical approaches. Besides, this work presents an efficient algorithm of scaling regenerating codes that allows using these codes in dynamic clusters. In addition, we compared recovery speed and redundancy of above mentioned codes with popular RAID levels.
The paper is addressed to the wide IT-community, particularly to data storage and cloud computing engineers, IT-service workers, developers and QA specialists. The content of this paper has wide practicability, it presents recommendations for improvement of data availability and recovery performance.
Recommended for listeners with mid-level expertise.
Evgenii Anastasiev
Software Developer in R&, RAIDIX LLC
D
Linux kernel developer. Areas of interest are erasure coding, theory of reliability. At the present time he researches and develops algorithms of caching and deduplication in distributed storage systems.
Comment