Identifying Parallelizable Routines

This chapter provides a very basic introduction to how to design a program so that it can be implemented as an OpenCL kernel and called from an OpenCL host. For specifics about design patterns that may positively or negatively affect application performance, see Tuning Performance On the GPU and Improving Performance On the CPU.

The first step in using OpenCL to improve your application’s performance is to identify what portions of your application are appropriate for parallelization. Whereas in a general application you can spawn a separate thread for a task as long as the functions in the thread are re-entrant and you’re careful about how you synchronize data access, to achieve the level of parallelism for which OpenCL is ideal, it is much more important for the work items to be independent of each other. Although work items in the same workgroup can share local data, they execute synchronously and so no work item’s calculations depend on the result from another work item. Parallelization works only when the tasks that you run in parallel do not depend on each other.

For example, assume that you are writing a simple application that keeps track of the grades for students in a class. This application consists of two main tasks:

  1. Compute the final grade for each student, assuming the final grade for each student is the average of all the grades that student received.

  2. Obtain a class average by averaging the final grades of all students.

You cannot perform these two tasks in parallel because they are not independent of each other: to calculate the class average, you must first calculate the final grade for each student.

Although you cannot perform task 1 and task 2 simultaneously, there is still an opportunity for parallelization here.

The pseudocode in Listing 4-1 proceeds through each student in the class, one by one, calculating the average of each student’s grades and caching it in the student object. Although this example computes each grade average one at a time, there’s no reason that the grade averages for all the students couldn’t be calculated at the same time. Because the grades of one student do not affect the grades of another, you can calculate the final grade for each student independently and at the same time instead of looping through the same set of instructions for each student, one at a time. This is the idea behind data parallelism.

Listing 4-1  Pseudocode that computes the final grade for each student

 
// Assume 'class' is a collection of 'student' objects.
foreach(student in class)
{
    // Assume getGrades() returns a collection of integer grades.
    grades = student.getGrades();
 
    sum = 0;
    count = 0;
 
    // Iterate through each grade, adding it to sum.
    foreach(grade in grades)
    {
        sum += grade;
        count++;
    }
 
    // Cache the average grade in the student object.
    student.averageGrade = sum / count;
}

Data parallelism consists of taking a single task (in this case, calculating a student’s average grade), and repeating it over multiple sets of data. No students’ grades affect another student’s grades. This means that you can calculate each student’s average in parallel.

To express this programmatically: first separate your task (calculating the final grade of a student) from your data (the list of students in the class). Listing 4-2 shows how you can isolate the final-grading task.

Listing 4-2  The isolated grade average task

task calculateAverageGradeForStudent( student )
{
    // Assume getGrades() returns a collection of integer grades.
    grades = student.getGrades();
 
    sum = 0;
    count = 0;
 
    // Iterate through each grade, adding it to sum.
    foreach(grade in grades)
    {
        sum += grade;
        count++;
    }
 
    // Store the average grade in the student object.
    student.averageGrade = sum / count;
}

Now that you have the task isolated, you need to apply it to the individual students in the class in parallel. Because OpenCL has native support for parallel computing, you can rewrite the task shown in Listing 4-2 as an OpenCL kernel function. Using the OpenCL framework API, you can enqueue this kernel to run on a device where each compute unit on the device can apply an instance of the kernel (that is, a work item) to a different set of data.

The challenge in parallelizing your application is to identify the tasks that you can distribute across multiple compute units. Sometimes, as in this example, the identification is relatively trivial and requires few algorithmic changes. Other times, it might require designing a new algorithm from scratch that lends itself more readily to parallelization. Although there is no universal rule for parallelizing your application, there are a few tips you can keep in mind: