Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that directly supports higher orders and tuple values. Its templated CUDA implementation unifies all of these computations in a single 100-statement kernel. SAM is communication-efficient in the sense that it minimizes main-memory accesses. When computing prefix sums of a million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU. On the Titan X, SAM reaches memory-copy speeds for large input sizes, which cannot be surpassed. SAM outperforms CUB, the currently fastest conventional prefix sum implementation, by up to a factor of 2.9 on eighth-order prefix sums and by up to a factor of 2.6 on eight-tuple prefix sums.
Thu 16 JunDisplayed time zone: Tijuana, Baja California change
17:00 - 18:00 | Parallelism IResearch Papers at Grand Ballroom San Rafael Chair(s): Tony Hosking Australian National University, Data61, and Purdue University | ||
17:00 30mTalk | Higher-Order and Tuple-Based Massively-Parallel Prefix Sums Research Papers Sepideh Maleki Texas State University, Annie Yang Texas State University, Martin Burtscher Texas State University Pre-print Media Attached | ||
17:30 30mTalk | A Distributed OpenCL Framework using Redundant Computation and Data Replication Research Papers Junghyun Kim Seoul National University, Gangwon Jo Seoul National University, Jaehoon Jung Seoul National University, Jungwon Kim Oak Ridge National Laboratory, Jaejin Lee Seoul National University Media Attached |