Thu 16 Jun 2016 17:30 - 18:00 at Grand Ballroom San Rafael - Parallelism I Chair(s): Tony Hosking

Applications written solely in CUDA or OpenCL cannot execute on a cluster that runs multiple operating system instances. For this reason, many studies have been done to extend these programming models to clusters. Most previous approaches are based on a common idea: designating a centralized host and coordinating the other nodes by the host for computation. However, the centralized host may be a significant performance bottleneck when the number of nodes is large. In this paper, we propose a scalable and distributed OpenCL framework for large-scale heterogeneous clusters. To overcome the limitations of the centralized approaches, it executes the host program of an OpenCL application in each node of the cluster by exploiting redundant computation with data replication. This reduces inter-node communication and synchronization overhead significantly. In addition, the proposed framework applies several optimization techniques, such as remote device virtualization and queueing optimization, to reduce the command delivery and enqueueing overhead. We also propose a new OpenCL API function to alleviate the command scheduling overhead. We show the effectiveness of the framework by evaluating it with a microbenchmark and eleven benchmark applications on a large-scale CPU cluster of 512 nodes and a medium-scale GPU cluster of 36 nodes.