Thu 16 Jun 2016 17:00 - 17:30 at Grand Ballroom San Rafael - Parallelism I Chair(s): Tony Hosking

Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that directly supports higher orders and tuple values. Its templated CUDA implementation unifies all of these computations in a single 100-statement kernel. SAM is communication-efficient in the sense that it minimizes main-memory accesses. When computing prefix sums of a million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU. On the Titan X, SAM reaches memory-copy speeds for large input sizes, which cannot be surpassed. SAM outperforms CUB, the currently fastest conventional prefix sum implementation, by up to a factor of 2.9 on eighth-order prefix sums and by up to a factor of 2.6 on eight-tuple prefix sums.

Thu 16 Jun
Times are displayed in time zone: Tijuana, Baja California change

17:00 - 18:00
Parallelism IResearch Papers at Grand Ballroom San Rafael
Chair(s): Tony HoskingAustralian National University, Data61, and Purdue University
Higher-Order and Tuple-Based Massively-Parallel Prefix Sums
Research Papers
Sepideh MalekiTexas State University, Annie YangTexas State University, Martin BurtscherTexas State University
Pre-print Media Attached
A Distributed OpenCL Framework using Redundant Computation and Data Replication
Research Papers
Junghyun KimSeoul National University, Gangwon JoSeoul National University, Jaehoon JungSeoul National University, Jungwon KimOak Ridge National Laboratory, Jaejin LeeSeoul National University
Media Attached