Item Details

Fault-Tolerance in Coarse Grain Data Flow

NguyenTuong, Anh; Grimshaw, Andrew; Karpovich, John
NguyenTuong, Anh
Grimshaw, Andrew
Karpovich, John
Wide-area parallel processing systems will soon be available to researchers to solve a range of problems. It is certain that host failures and other faults will be an every day occurrence in these systems. Unfortunately contemporary parallel processing systems were not designed with fault-tolerance as a design objective. The data-flow model, long a mainstay of parallel processing, offers hope. The model's functional nature, which makes it so amenable to parallel processing, also facilitates straight- forward fault-tolerant implementations. It is the combination of ease of parallelization and fault- tolerance that we feel will increase the importance of the model in the future, and lead to the widespread use of functional components. Using Mentat, an object-oriented, data-flow based, parallel processing system, we demonstrate two orthogonal methods of providing application fault-tolerance. The first method provides transparent replication of actors and requires modification to the existing Mentat run-time system. Providing direct support for replicating actors enables the programmer to easily build fault-tolerant applications regardless of the complexity of their data-flow graph representation. The second method - the checkboard method - is applicable to applications that contain independent and restartable computations such as "bag of tasks", Monte Carlo's, and pipelines, and involves some simple restructuring of code. While these methods are presented separately, they could in fact be combined. For both methods, we present experimental data to illustrate the trade-offs between fault-tolerance, performance and resource consumption.
University of Virginia, Department of Computer Science, 1995
Published Date
Libra Open Repository
Logo for In CopyrightIn Copyright


Access Online