In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel’s SSE and Motorola’s AltiVec that su...
A bulk synchronous computation proceeds in phases that are separated by barrier synchronization. For dynamic bulk synchronous computations that exhibit varying phase-wise computat...
In contrast to the common belief that OpenMP requires data-parallel extensions to scale well on architectures with non-uniform memory access latency, recent work has shown that it ...
All-to-all personalized exchange is one of the most dense collective communication patterns and occurs in many important parallel computing/networking applications. In this paper,...
We present new architectural concepts for uniprocessor designs that conform to the data-driven computation paradigm. Usage of our D2 -CPU (Data-Driven processor) follows the natura...