Due to modern technology trends such as decreasing feature sizes and lower voltage levels, fault tolerance is becoming increasingly important in computing systems. Shared memory i...
The advent of large scale multi-hop wireless networks highlights problems of fault tolerance and scale in distributed system, motivating designs that autonomously recover from tra...
A mobile agent is regarded as an attractive technology when developing distributed applications in mobile and ubiquitous computing environments. In this paper, we present ODDUGI, a...
CCS is a resource management system for parallel high-performance computers. At the user level, CCS provides vendor-independent access to parallel systems. At the system administr...
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. Two category tec...