The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
This paper proposes a variation of the Byzantine generals problem (or Byzantine consensus). Each general has a set of good plans and a set of bad plans. The problem is to make all...
Miguel Correia, Alysson Neves Bessani, Paulo Ver&i...
In recent years, there have been a few proposals to add a small amount of trusted hardware at each replica in a Byzantine fault tolerant system to cut back replication factors. Th...
Allen Clement, Flavio Junqueira, Aniket Kate, Rodr...
This paper explores the concept of design diversity redundancy applied to mixed-signal (MS) circuit blocks, as a proposal to increase system reliability. Three different implement...
Dependable distributed systems are difficult to build. This is particularly true if they have dependability requirements that change during the execution of an application, and are...
Michel Cukier, Jennifer Ren, Chetan Sabnis, David ...