A Distributed OpenCL Framework using Redundant Computation and Data Replication
Applications written solely in CUDA or OpenCL cannot execute on a cluster that runs multiple operating system instances. For this reason, many studies have been done to extend these programming models to clusters. Most previous approaches are based on a common idea: designating a centralized host and coordinating the other nodes by the host for computation. However, the centralized host may be a significant performance bottleneck when the number of nodes is large. In this paper, we propose a scalable and distributed OpenCL framework for large-scale heterogeneous clusters. To overcome the limitations of the centralized approaches, it executes the host program of an OpenCL application in each node of the cluster by exploiting redundant computation with data replication. This reduces inter-node communication and synchronization overhead significantly. In addition, the proposed framework applies several optimization techniques, such as remote device virtualization and queueing optimization, to reduce the command delivery and enqueueing overhead. We also propose a new OpenCL API function to alleviate the command scheduling overhead. We show the effectiveness of the framework by evaluating it with a microbenchmark and eleven benchmark applications on a large-scale CPU cluster of 512 nodes and a medium-scale GPU cluster of 36 nodes.
Thu 16 Jun
|17:00 - 17:30|
|Pre-print Media Attached|
|17:30 - 18:00|