As technology scales ever further, device unreliability is creating excessive complexity for hardware to maintain the illusion of perfect operation. In this paper, we consider whe...
Marc de Kruijf, Shuou Nomura, Karthikeyan Sankaral...
Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of highpeformance com...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
Transient faults due to particle strikes are a key challenge in microprocessor design. Driven by exponentially increasing transistor counts, per-chip faults are a growing burden. ...
Kristen R. Walcott, Greg Humphreys, Sudhanva Gurum...
For a space compactor, degradation of fault detection capability caused by the masking effects from unknown values is much more serious than that caused by error masking (i.e. ali...
Abstract. Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration change...