Multi-core processors naturally exploit thread-level parallelism (TLP). However, extracting instruction-level parallelism (ILP) from individual applications or threads is still a ...
Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software developmen...
Matthew J. Bridges, Neil Vachharajani, Yun Zhang, ...
Many parallel applications from scientific computing use MPI global communication operations to collect or distribute data. Since the execution times of these communication opera...
Modern high speed interconnects such as Myrinet and Gigabit Ethernet have shifted the bottleneck in communication from the interconnect to the messaging software at the sending an...
We present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures ...