Using Grand Central Dispatch With OpenCL

Developers already use Grand Central Dispatch (GCD) queues to implement concurrency in their applications. OS X v10.7 adds the ability to enqueue work coded as OpenCL kernels to GCD queues backed by OpenCL compute devices.

You can use GCD with OS X v10.7 OpenCL to:

Discovering Available Compute Devices

OpenCL kernels assume a Single instruction, Multiple Data (SIMD) parallel model of computation. This means (roughly) that you have a large amount of data divided into chunks, and you want the kernel to perform the same computation on each chunk. Some SIMD algorithms will execute better on a CPU rather than on a GPU, or on one GPU rather than another, depending on many factors. Tools in OS X version 7 and later facilitate discovery of the types of devices that are available to process data.

A context is needed to share memory objects between devices. If you use The OS X v10.7 gcl_ APIs, you can just retrieve and use the default global context; no context creation is needed.

An OpenCL context is similar to an OpenGL sharegroup. A sharegroup is a set of tools that allow blocks of memory to be accessed by both a GPU and a CPU. See “OpenCL/ OpenGL Interoperation: Data Sharing.”

When you retrieve the default global context in OS X v10.7 OpenCL, you can find out about the environment in which OpenCL kernels execute. The context includes the set of devices, the memory accessible to those devices, and one or more queues used to schedule execution of one or more kernels.

From the context, you can discover the types of devices in the system and can obtain recommendations as to the optimal configuration for running a kernel. Your application can call on GCD to create a queue for a particular type of device or to create a queue for a specific device.

  1. Call the gcl_get_context function to get the "global" OpenCL context that OS X v10.7 creates for you.

  2. Call the clGetDeviceIds( ... ) function (an API in the OpenCL standard API), specifying the context you just obtained as the context parameter. This call will return a list of the IDs of the OpenCL devices attached.

  3. When you have the IDs of the devices in the context, you can call the clGetDeviceInfo() function for each of the devices to obtain information about the device. The sample code in Listing 5-1 requests the vendor (the manufacturer) and the device name. You could also use the clGetDeviceInfo() function to request more technical information like the number of compute cores, the cache line size and so on. The types of information you can obtain are described in the OpenCL 1.1 specification. You can choose to send different types of work to a device depending upon its characteristics and capabilities.

Enqueueing A Kernel To A Dispatch Queue

You must use an OpenCL-compatible dispatch queue for your OpenCL work. You can create a queue for a particular device in the system or you can create a queue for a particular type of device. You can enqueue as many kernels on each queue as you choose. You can create as many different queues as you would like:

Both of these methods are illustrated in the sample code. See Listing 5-1.

Once you have created a queue, you can enqueue as many kernels onto that queue as necessary. Or, you can create additional queues with different characteristics.

For more information about Grand Central Dispatch queues, see Concurrency Programming Guide: Dispatch Queues.

Determining the Characteristics Of A Kernel On A Device

To obtain information specific to a kernel/device pair, including how much private and local memory the kernel will consume (on that device), as well as the workgroup size OpenCL thinks will be most optimal for execution, call the gcl_get_kernel_block_workgroup_info function. This information is useful when you are tuning performance for a particular device or debugging performance issues.

Obtaining the Kernel’s Workgroup Size

To find out what OpenCL thinks is the best workgroup size for executing a kernel on a particular device, call the gcl_get_kernel_block_workgroup_info function. You can use this value as the cl_ndrange.local_work_size for a kernel on a particular device.

In Listing 5-1, notice that we first execute this method in a block on a dispatch queue we've created with OpenCL requesting the local memory size:

gcl_get_kernel_block_workgroup_info(
                             square_kernel,
                             CL_KERNEL_LOCAL_MEM_SIZE,
                             sizeof(local_memsize),
                             &local_memsize, NULL);

Then, in Listing 5-2, we call the gcl_get_kernel_block_workgroup_info function to ask OpenCL to return what it considers to be the optimal workgroup size for this kernel, on this device:

gcl_get_kernel_block_workgroup_info(
                             square_kernel,                // this kernel
                             CL_KERNEL_WORK_GROUP_SIZE,
                             sizeof(workgroup_size), &workgroup_size, NULL);
                             fprintf(stdout, "Workgroup size: %ld\n",
                                     workgroup_size);

Finally, we call the gcl_get_kernel_block_workgroup_info function to once more to ask OpenCL for a workgroup size multiple. This is a performance hint based on the capabilities of the underlying device:

gcl_get_kernel_block_workgroup_info(
                             square_kernel, // this kernel
                             CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE,
                             sizeof(preferred_workgroup_size_multiple),
                             &preferred_workgroup_size_multiple, NULL);

You can now use these workgroup values to craft an appropriate cl_ndrange structure to use in launching your kernel.

cl_ndrange range = {
        1,                  // The number of dimensions to use.
 
        {0, 0, 0},          // The offset in each dimension.  We want to
                                // process ALL of our data, so this is 0 for
                                // our test case.
                                // Always pass an offset for each of the
                                // three dimensions even though the workgroup
                                // may have fewer than three dimensions.
 
        {NUM_VALUES, 0, 0},     // The global range -- this is how many items
                                // IN TOTAL in each dimension you want to
                                // process.
                                // Always pass the global range for each of the
                                // three dimensions even though the workgroup
                                // may have fewer than three dimensions.
 
        {workgroup_size, 0, 0 } // The local size of each workgroup.  This
                                // determines the number of workitems per
                                // workgroup.  It indirectly affects the
                                // number of workgroups, since the global
                                // size / local size yields the number of
                                // workgroups.  So in our test case, we will
                                // have NUM_VALUE/workgroup_size workgroups.
                                // Always pass the workgroup size for each of the
                                // three dimensions even though the workgroup
                                // may have fewer than three dimensions.
};

Sample Code: Creating a Dispatch Queue

Listing 5-1 demonstrates how to get the global OpenCL context, and how to ask that context about the devices it contains. It also shows how to create a dispatch queue by asking for a device type (CPU or GPU), and by specifying the queue's OpenCL device directly.

Listing 5-2 shows how to obtain workgroup information -- useful for obtaining peak performance -- from the kernel block.

Listing 5-1  Creating a dispatch queue

 
#include <stdio.h>
 
// Include OpenCL/opencl.h to include everything you need for OpenCL
//development on OS X v10.7.
#include <OpenCL/opencl.h>
 
// In this example, mykernel.cl.h is the header file that contains our kernel block
// declaration.
// This header file is generated by Xcode.
#include "mykernel.cl.h"
 
static void print_device_info(cl_device_id device) {
    char name[128];
    char vendor[128];
 
    clGetDeviceInfo(device, CL_DEVICE_NAME, 128, name, NULL);
    clGetDeviceInfo(device, CL_DEVICE_VENDOR, 128, vendor, NULL);
    fprintf(stdout, "%s : %s\n", vendor, name);
}
 
#pragma mark -
#pragma mark Hello World - Sample 1
 
// Demonstrates how to get the global OpenCL context, and how to ask that
// context about the devices it contains.  It also shows how
// to create a dispatch queue by asking for a device type (CPU or GPU) and
// by specifying the queue's OpenCL device directly.
 
static void hello_world_sample1 ()
{
    int i;
 
    // Ask for the global OpenCL context:
    // Note: If you will not be enqueing to a specific device, you do not need
    // to retrieve the context.
 
    cl_context context = gcl_get_context();
 
    // Query this context to see what kinds of devices are available to us.
 
    size_t length;
    cl_device_id devices[8];
    clGetContextInfo(
         context, CL_CONTEXT_DEVICES, sizeof(devices), devices, &length);
 
    // Walk over these devices, printing out some basic information.  We could
    // query any of the information available about the device here.
 
    fprintf(stdout, "The following devices are available for use:\n");
    int num_devices = (int)(length / sizeof(cl_device_id));
    for (i = 0; i < num_devices; i++) {
        print_device_info(devices[i]);
    }
 
    // To do any work, you need to create a dispatch queue associated
    // with some OpenCL device.  You can either let the system give you
    // a GPU -- perhaps the only GPU -- or the CPU device.  Or, you can
    // create a dispatch queue with a cl_device_id you specify.  This
    // device id comes from the OpenCL context, as above.  Below are three
    // examples.
 
    // 1. Ask for a GPU-based dispatch queue; notice that we do not provide a
    // device id - we let the system give us the most capable GPU.
 
    dispatch_queue_t gpu_queue =
       gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
 
    // Get the device from the queue, so we can ask OpenCL questions about it.
    // Note that we check to make sure there WAS an OpenCL-capable GPU in the
    // system by checking against a NULL return value.
 
    if (gpu_queue != NULL) {
 
        cl_device_id gpu_device =
          gcl_get_device_id_with_dispatch_queue(gpu_queue);
        fprintf(stdout, "\nAsking for CL_DEVICE_TYPE_GPU gives us:\n");
        print_device_info(gpu_device);
 
    } else {
        fprintf(stdout, "\nYour system does not contain an OpenCL-compatible "
                "GPU\n.");
    }
 
    // 2. Let's try the same thing for CL_DEVICE_TYPE_CPU.  All Macintosh
    // systems will have a CPU OpenCL device, so we don't have to worry about
    // checking for NULL, as we did in the case of a GPU.
 
    dispatch_queue_t cpu_queue =
        gcl_create_dispatch_queue(CL_DEVICE_TYPE_CPU, NULL);
    cl_device_id cpu_device = gcl_get_device_id_with_dispatch_queue(cpu_queue);
    fprintf(stdout, "\nAsking for CL_DEVICE_TYPE_CPU gives us:\n");
    print_device_info(cpu_device);
 
    // 3. Or perhaps you are in a situation where you want a specific device
    // from the list of devices you found on the context.
    // Notice the difference here:
    // We pass CL_DEVICE_TYPE_USE_ID and a device_id.  We'll just use the
    // first device on the context from above, whatever that might be.
 
    dispatch_queue_t custom_queue =
        gcl_create_dispatch_queue(CL_DEVICE_TYPE_USE_ID, devices[0]);
    cl_device_id custom_device =
        gcl_get_device_id_with_dispatch_queue(custom_queue);
    fprintf(stdout,
       "\nAsking for CL_DEVICE_TYPE_USE_ID and our own device gives us:\n");
    print_device_info(custom_device);
 
    // Now we could use any of these dispatch queues to run some kernels!
 
    // Use the GCD API to free your queues.
 
    dispatch_release(custom_queue);
    dispatch_release(cpu_queue);
 
    if (gpu_queue != NULL) dispatch_release(gpu_queue);
}
 

Listing 5-2  Obtaining workgroup information

#pragma mark -
#pragma mark Hello World - Sample 2
 
// This listing shows how to obtain workgroup info –
// useful for obtaining peak performance - from the kernel block.
 
static void hello_world_sample2() {
 
    // Get a queue backed by a GPU for running our squaring kernel.
    dispatch_queue_t queue =
       gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
 
    // Did we get a GPU?  If not, fall back to the CPU device.
    if (queue == NULL) {
        gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
    }
 
    // In any case, print out the device we're using:
 
    fprintf(stdout, "\nExamining workgroup info for square_kernel on device ");
    print_device_info(gcl_get_device_id_with_dispatch_queue(queue));
 
    // Now find out what OpenCL thinks is the best workgroup size for
    // executing this kernel on this particular device.  Notice that we have
    // to execute this method in a block, on a dispatch queue we've created
    // with OpenCL.
 
    dispatch_sync(queue,
                  ^{
                      size_t wgs, preferred_wgs_multiple;
                      cl_ulong local_memsize, private_memsize;
 
                      // The next two calls give us information about how much
                      // memory, local and private, is used by the kernel on this
                      // particular device.
                      gcl_get_kernel_block_workgroup_info(square_kernel,
                                CL_KERNEL_LOCAL_MEM_SIZE,
                                sizeof(local_memsize),
                                &local_memsize, NULL);
                      fprintf(stdout, "Local memory size: %lld\n", local_memsize);
 
                      gcl_get_kernel_block_workgroup_info(square_kernel,
                                CL_KERNEL_PRIVATE_MEM_SIZE,
                                sizeof(private_memsize),
                                &private_memsize, NULL);
                      fprintf(stdout, "Private memory size: %lld\n", private_memsize);
 
                       // Here we ask OpenCL what it considers the optimal workgroup
                       // size for this kernel on this device.
                       gcl_get_kernel_block_workgroup_info(square_kernel,
                               CL_KERNEL_WORK_GROUP_SIZE,
                               sizeof(wgs), &wgs, NULL);
                       fprintf(stdout, "Workgroup size: %ld\n", wgs);
 
                       // Finally, we can ask OpenCL for a workgroup size multiple.
                       // This is a performance hint.
                       gcl_get_kernel_block_workgroup_info(square_kernel,
                               CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE,
                               sizeof(preferred_wgs_multiple),
                               &preferred_wgs_multiple, NULL);
                       fprintf(stdout, "Preferred workgroup size multiple: %ld\n",
                               preferred_wgs_multiple);
 
                      // You could now use these workgroup values to craft an
                      // appropriate cl_ndrange structure for use in launching your kernel.
 
                });
 
    dispatch_release(queue);
}
 
 
int main(int argc, const char* argv[]) {
    hello_world_sample1();
    hello_world_sample2();
}

Did this document help you? Yes It's good, but... Not helpful...