Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that directly supports higher orders and tuple values. Its templated CUDA implementation unifies all of these computations in a single 100-statement kernel. SAM is communication-efficient in the sense that it minimizes main-memory accesses. When computing prefix sums of a million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU. On the Titan X, SAM reaches memory-copy speeds for large input sizes, which cannot be surpassed. SAM outperforms CUB, the currently fastest conventional prefix sum implementation, by up to a factor of 2.9 on eighth-order prefix sums and by up to a factor of 2.6 on eight-tuple prefix sums.
Thu 16 Jun
|17:00 - 17:30|
Sepideh MalekiTexas State University, Annie YangTexas State University, Martin BurtscherTexas State UniversityPre-print Media Attached
|17:30 - 18:00|
Junghyun KimSeoul National University, Gangwon JoSeoul National University, Jaehoon JungSeoul National University, Jungwon KimOak Ridge National Laboratory, Jaejin LeeSeoul National UniversityMedia Attached