How the Kernel Interacts With Data in OS X OpenCL

There are two parts of every OpenCL program. The part that runs on the device is called the kernel; the part that creates memory objects, then configures and calls the kernel is called the host and usually runs on the CPU. A kernel is essentially a function written in the OpenCL language that enables it to be compiled for execution on any device that supports OpenCL. The kernel is the only way the host can call a function that will run on a device. When the host invokes a kernel, many work items start running on the device. Each work item runs the code of the kernel, but works on a different part of the dataset. The kernel manages work items by accessing them using their IDs using functions such as get_global_id(…) and get_local_id(…). Although kernels are enqueued for execution by host applications written in C, C++, or Objective-C, a kernel must be compiled separately to be customized for the device on which it is going to run.

Interacting with kernels is easier using tools provided by OS X than it is using standard OpenCL. As of OS X v10.7, you can include OpenCL kernels as resources in Xcode projects and compile them along with the rest of your application. Also as of OS X v10.7, the host can invoke kernels by passing them parameters just as if they were typical functions (see Passing Data To a Kernel); it is no longer necessary to explicitly set kernel arguments using special OpenCL APIs.

Accessing Objects From a Kernel

In order for a device to actually process data, you have to make the data available to the work items that execute on the device.

To pass data from the host to a compute kernel:

Specifying How To Divide Up A Dataset

When you write a kernel in OpenCL, you are writing the code that each work item will execute-instructions on how to process one portion of your overall dataset. By launching many work items, each of which operates on just a small portion of the data, you end up processing the whole data set. The ndrange structure is used to specify how data is assigned to work items.

The n-dimensional range (cl_ndrange) structure you pass to the kernel consists of the following fields:

Passing Data To a Kernel

Xcode uses your kernel code to automatically generate the kernel function prototype in the kernel header file. To pass data to a kernel, pass the memory objects as parameters (just as you would pass parameters to any other function) when you call the kernel from your host code. OpenCL kernel arguments can be scoped with a local or global qualifier, designating the memory storage for these arguments. This means that, as of OS X v10.7, kernel parameters declared with the local or __local address qualifier are declared as size_t in the block declaration of the kernel.

For example, if a kernel has an argument declared with the local address qualifier:

kernel void foo(
                global float *a,
                local float *shared);  // This kernel parameter is of type
                                       // local float; will be size_t in the
                                       // kernel block

The compiler generates the following extern declaration of this kernel block:

extern void (^foo_kernel)(
                  const cl_ndrange *ndrange,
                  float *a,
                  size_t shared       // In the generated declaration,
                                      // local float is declared as size_t
             );

By associating your buffer objects with specific kernel arguments, you make it possible to process your data using a kernel function. For example, in Example: Allocating, Using, and Releasing Buffer Objects, notice how the code sample treats the input data pointer much as you would treat a pointer in C. In this example, the input data is an array of float values, and you can process each element of the float array by indexing into the pointer.

Retrieving Results From a Kernel

If the kernel will be returning results in a buffer, call a function such as gcl_memcpy(…) while inside a block on a given queue.

To make sure that the results are all accessible to the host before you continue, use dispatch_sync or wait using another synchronization method.

If the kernel will be returning results in a buffer, call the dispatch_sync function like this:

dispatch_sync(queue,
              ^{
                 gcl_memcpy(ptr_c,
                            device_c,
                            num_floats * sizeof(float));
               });

If the kernel will be returning results in an image, call the dispatch_sync function like this:

dispatch_sync(queue,^{
                         size_t origin = {0,0,0};
                         size_t region = {512, 512, 1};
                         gcl_copy_image_to_ptr(
                                             results_ptr,
                                             image,
                                             origin,
                                             region);
                       });

This will copy the bytes for 512 x 512 pixels from the image to the buffer specified by the results_ptr parameter.