Basic Programming Sample

This chapter provides a tour through the code of a simple OpenCL application that performs calculations on a test data set. The code in Listing 2-2 calls the kernel defined in Listing 2-1. The kernel squares each value. Once the kernel completes its work, the host validates that every value was processed by the kernel.

Basic Kernel Code Sample

Listing 2-1 is example kernel code. See to download the project. See How the Kernel Interacts With Data in OS X OpenCL for more information about passing parameters to the kernel and retrieving information from the kernel.

Listing 2-1  Kernel code sample

// Simple OpenCL kernel that squares an input array.
// This code is stored in a file called mykernel.cl.
// You can name your kernel file as you would name any other
// file.  Use .cl as the file extension for all kernel
// source files.
 
// Kernel block.                                      //   1
kernel void square(                                   //   2
                   global float* input,               //   3
                   global float* output)
{
    size_t i = get_global_id(0);
    output[i] = input[i] * input[i];
}

Notes:

  1. Wrap your kernel code into a kernel block:

    kernel void kernelName(
                              global float* inputParameterName,
                              global float* [anotherInputParameter],
                                 …,
                              global float* outputParameterName)
    {
                  ...
    }
  2. Kernels always return void.

  3. Pass parameters to the kernel just as you would pass them to any other function.

Basic Host Code Sample

Listing 2-2 is example code that would run on a host. It calls a kernel to square a set of values, then tests to ensure that the kernel processed all the data. Each numbered line in the listing is described in more detail following the listing. See to download the project.

Listing 2-2  Host code sample

#include <stdio.h>
#include <stdlib.h>
 
// This include pulls in everything you need to develop with OpenCL in OS X.
#include <OpenCL/opencl.h>
 
// Include the header file generated by Xcode.  This header file contains the
//  kernel block declaration.                                             // 1
#include "mykernel.cl.h"
 
// Hard-coded number of values to test, for convenience.
#define NUM_VALUES 1024
 
// A utility function that checks that our kernel execution performs the
// requested work over the entire range of data.
static int validate(cl_float* input, cl_float* output) {
    int i;
    for (i = 0; i < NUM_VALUES; i++) {
 
        // The kernel was supposed to square each value.
        if ( output[i] != (input[i] * input[i]) ) {
            fprintf(stdout,
                    "Error: Element %d did not match expected output.\n", i);
            fprintf(stdout,
                    "       Saw %1.4f, expected %1.4f\n", output[i],
                            input[i] * input[i]);
            fflush(stdout);
            return 0;
        }
    }
    return 1;
}
 
int main (int argc, const char * argv[]) {
    int i;
    char name[128];
 
    // First, try to obtain a dispatch queue that can send work to the
    // GPU in our system.                                             // 2
    dispatch_queue_t queue =
               gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
 
    // In the event that our system does NOT have an OpenCL-compatible GPU,
    // we can use the OpenCL CPU compute device instead.
    if (queue == NULL) {
        queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_CPU, NULL);
    }
 
    // This is not required, but let's print out the name of the device
    // we are using to do work.  We could use the same function,
    // clGetDeviceInfo, to obtain all manner of information about the device.
    cl_device_id gpu = gcl_get_device_id_with_dispatch_queue(queue);
    clGetDeviceInfo(gpu, CL_DEVICE_NAME, 128, name, NULL);
    fprintf(stdout, "Created a dispatch queue using the %s\n", name);
 
    // Here we hardcode some test data.
    // Normally, when this application is running for real, data would come from
    // some REAL source, such as a camera, a sensor, or some compiled collection
    // of statistics—it just depends on the problem you want to solve.
    float* test_in = (float*)malloc(sizeof(cl_float) * NUM_VALUES);
    for (i = 0; i < NUM_VALUES; i++) {
        test_in[i] = (cl_float)i;
    }
 
    // Once the computation using CL is done, will have to read the results
    // back into our application's memory space.  Allocate some space for that.
    float* test_out = (float*)malloc(sizeof(cl_float) * NUM_VALUES);
 
    // The test kernel takes two parameters: an input float array and an
    // output float array.  We can't send the application's buffers above, since
    // our CL device operates on its own memory space.  Therefore, we allocate
    // OpenCL memory for doing the work.  Notice that for the input array,
    // we specify CL_MEM_COPY_HOST_PTR and provide the fake input data we
    // created above.  This tells OpenCL to copy the data into its memory
    // space before it executes the kernel.                               // 3
    void* mem_in  = gcl_malloc(sizeof(cl_float) * NUM_VALUES, test_in,
                               CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR);
 
    // The output array is not initalized; we're going to fill it up when
    // we execute our kernel.                                             // 4
    void* mem_out =
           gcl_malloc(sizeof(cl_float) * NUM_VALUES, NULL, CL_MEM_WRITE_ONLY);
 
    // Dispatch the kernel block using one of the dispatch_ commands and the
    // queue created earlier.                                            // 5
 
    dispatch_sync(queue, ^{
        // Although we could pass NULL as the workgroup size, which would tell
        // OpenCL to pick the one it thinks is best, we can also ask
        // OpenCL for the suggested size, and pass it ourselves.
        size_t wgs;
        gcl_get_kernel_block_workgroup_info(square_kernel,
                                            CL_KERNEL_WORK_GROUP_SIZE,
                                            sizeof(wgs), &wgs, NULL);
 
        // The N-Dimensional Range over which we'd like to execute our
        // kernel.  In this case, we're operating on a 1D buffer, so
        // it makes sense that the range is 1D.
        cl_ndrange range = {                                              // 6
            1,                     // The number of dimensions to use.
 
            {0, 0, 0},             // The offset in each dimension.  To specify
                                   // that all the data is processed, this is 0
                                   // in the test case.                   // 7
 
            {NUM_VALUES, 0, 0},    // The global range—this is how many items
                                   // IN TOTAL in each dimension you want to
                                   // process.
 
            {wgs, 0, 0}            // The local size of each workgroup.  This
                                   // determines the number of work items per
                                   // workgroup.  It indirectly affects the
                                   // number of workgroups, since the global
                                   // size / local size yields the number of
                                   // workgroups.  In this test case, there are
                                   // NUM_VALUE / wgs workgroups.
        };
        // Calling the kernel is easy; simply call it like a function,
        // passing the ndrange as the first parameter, followed by the expected
        // kernel parameters.  Note that we case the 'void*' here to the
        // expected OpenCL types.  Remember, a 'float' in the
        // kernel, is a 'cl_float' from the application's perspective.   // 8
 
        square_kernel(&range,(cl_float*)mem_in, (cl_float*)mem_out);
 
        // Getting data out of the device's memory space is also easy;
        // use gcl_memcpy.  In this case, gcl_memcpy takes the output
        // computed by the kernel and copies it over to the
        // application's memory space.                                   // 9
 
        gcl_memcpy(test_out, mem_out, sizeof(cl_float) * NUM_VALUES);
 
    });
 
 
    // Check to see if the kernel did what it was supposed to:
    if ( validate(test_in, test_out)) {
        fprintf(stdout, "All values were properly squared.\n");
    }
 
    // Don't forget to free up the CL device's memory when you're done. // 10
    gcl_free(mem_in);
    gcl_free(mem_out);
 
    // And the same goes for system memory, as usual.
    free(test_in);
    free(test_out);
 
    // Finally, release your queue just as you would any GCD queue.    // 11
    dispatch_release(queue);
}
 

Notes:

  1. Include the header file that contains the kernel block declaration. The name of the header file for a .cl file will be the name of the .cl file with .h appended to it. For example, if the .cl file is named mykernel.cl, the header file you must include will be mykernel.cl.h.

  2. Call gcl_create_dispatch_queue to create the dispatch queue.

  3. Create memory objects to hold input and output data and write input data to the input objects. Allocate an array on the OpenCL device from which to read kernel results back into host memory. Use gcl_malloc and make sure to use the OpenCL size of the datatype being returned. For example, write gcl_malloc(sizeof(cl_float) * NUM_VALUES. Because the CL device operates on its own memory space, allocate OpenCL memory for the input data upon which the kernel will work. Specify CL_MEM_COPY_HOST_PTR to tell OpenCL to copy over the input data from host memory into its memory space before it executes the kernel.

  4. Allocate OpenCL memory in which the kernel will store its results.

  5. Dispatch your kernel block using one of the dispatch commands and the queue you created above. In your dispatch call, you can specify workgroup parameters.

  6. Describe the data parallel range (the ndrange) over which to execute the kernel in the cl_ndrange structure.

    OpenCL always executes kernels in a data parallel fashion—that is, instances of the same kernel (work items) execute on different portions of the total data set. Each work item is responsible for executing the kernel once and operating on its assigned portion of the data set.

    You use the cl_ndrange field to specify how the workgroups are to be organized. For more information, see Specifying How To Divide Up A Dataset.

  7. Always pass an offset for each of three dimensions even though the workgroup may have fewer than three dimensions. See Specifying How To Divide Up A Dataset for more information.

  8. Call the kernel as you would call a function. Pass the ndrange as the first parameter, followed by the expected kernel parameters. Case the void* types to the expected OpenCL types. Remember, if you use float in your kernel, that's a cl_float from the application's perspective. The call to the kernel will look something like this:

    kernelName(
           &ndrange,
           (cl_datatype*)inputArray,
           (cl_datatype*)outputArray);
  9. Retrieve the data from the OpenCL device's memory space with gcl_memcpy. The output computed by the kernel is copied over to the host application's memory space.

  10. Free OpenCL memory objects.

  11. Call dispatch_release(...) on the dispatch queue you created with gcl_create_dispatch_queue(...) once you are done with it.