Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor...
In many scientific domains, experimental devices or simulation programs generate large volumes of data. The volumes of data may reach hundreds of terabytes and therefore it is imp...
Arie Shoshani, Luis M. Bernardo, Henrik Nordberg, ...
This paper presents the architecture and design of a high-performance asynchronous Huffman decoder for compressed-code embedded processors. In such processors, embedded programs a...
With the increased clock frequency of modern, high-performance processors over 500 MHz, in some cases, limiting the power dissipation has become the most stringent design target. ...
Luca Benini, Giovanni De Micheli, Alberto Macii, E...
Exploiting locality at run-time is a complementary approach to a compiler approach for those applications with dynamic memory access patterns. This paper proposes a memory-layout ...