To provide high dependability in a multithreaded system despite hardware faults, the system must detect and correct errors in its shared memory system. Recent research has explore...
Most parallel machines, such as clusters, are spaceshared in order to isolate batch parallel applications from each other and optimize their performance. However, this leads to lo...
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
There has recently been increasing interests in using system virtualization to improve the dependability of HPC cluster systems. However, it is not cost-free and may come with som...
Haibo Chen, Rong Chen, Fengzhe Zhang, Binyu Zang, ...
Traditional techniques for building dependable, highperformance distributed systems are too expensive for most non-critical systems, often causing dependability to be sidelined as...
Vikram S. Adve, Adnan Agbaria, Matti A. Hiltunen, ...