Improving Performance On the CPU

When optimizing code to run for a GPU or a CPU, it is important to take into consideration the strengths and limitations of the device you are writing for. This chapter focuses on optimizing for the a CPU. CPUs have fewer processing elements and more memory (both a large cache and a much larger amount of RAM) than GPUs, which have more processing elements and comparatively less memory. CPU memory access is fastest when data is in cache.

The chapter describes how to benchmark the speed of OpenCL code running on a CPU and how to set performance objectives. It provides tips for writing efficient OpenCL code. It also provides an example of an iterative process in which performance of a simple image filter application is tuned for best performance on a CPU.

See Tuning Performance On the GPU for a description of how to optimize code that will run on GPUs.

Before Optimizing Code

Before you optimize code:

  1. Decide whether the code really needs to be optimized. Optimization can take significant time and effort. Weigh the costs and benefits of optimization before starting any optimization effort.

  2. Estimate optimal performance. Run some simple kernels on your CPU device to estimate its capabilities. You can use the techniques described in Measuring Performance On Devices to measure how long kernel code takes to run. See Generating the Compute/Memory Access Peak Benchmark for examples of code you can use to test memory access speed and processing speed.

  3. Generate or collect sample data to feed through each iteration of optimization. Run the unoptimized original code through the sample code and save the results. Then run each major iteration of the optimized code against the same data and compare the results to the original results to ensure your output has not been corrupted by the changed code.

Reducing Overhead

Here are some general principles you can follow to improve the efficiency of OpenCL code intended to run on a CPU:

Estimating Optimal Performance

Before optimizing code, it is best to know what kind of performance is achievable. See Measuring Performance On Devices for information about how to set a timer to measure the speed at which a kernel runs.

The main factor determining the execution speed of an OpenCL kernel is memory usage; this is the case for both CPU and GPU devices. Benchmarking the speed of the kernel function in Listing 15-1 provides a way to estimate the memory speed of an OpenCL device.

Listing 15-1  Kernel for estimating optimal memory access speed

kernel void copyBuffer(global const float * in, global float * out) {
  int i = get_global_id(0);
  out[i] = in[i];  // R+W one float
}

The asymptotic (maximum) value memory speed can be used to estimate the speed of a memory-bound algorithm on large data.

Take the box average kernel before it has been optimized, shown in Listing 15-2 for example. This kernel accepts a single channel floating point image as input and computes a single channel floating point image where each output pixel (x,y) is the average value of all pixels in a square box centered at (x,y). A w by h image is stored in a buffer float * A, where pixel (x,y) is stored in A[x+w*y].

Listing 15-2  The boxAvg kernel, first version

constant int RANGE = 2;
kernel void boxAvg1(int w, int h, global const float * in, global float * out) {
  int x = get_global_id(0);  // Pixel to process is (x,y)
  int y = get_global_id(1);
  float sumA = 0.0f;         // Sum of pixel values
  float sum1 = 0.0f;         // Number of pixels
  for (int dy=-RANGE;dy<=RANGE;dy++)
    for (int dx=-RANGE;dx<=RANGE;dx++) {
      int xx = x + dx;
      int yy = y + dy;
      // Accumulate if inside image
      if (xx>=0 && xx<w && yy>=0 && yy<h) { sumA += in[xx + w*yy]; sum1 += 1.0f; }
    }
  out[x+w*y] = sumA/sum1;
}

For each pixel, we compute the average of 25 input values (RANGE=2). That’s 26*4=104 bytes per pixel. The maximum speed of this kernel should be near (36e9 B/s) / (104 B/pix) = 346 MP/s on the GPU, and 12e9/104 = 115 MP/s for the CPU. The actual numbers are 129 MP/s and 103 MP/s respectively.

The following two sections show how to tune the code to compute boxAvg on a CPU. CPU and GPU tuning have a lot in common, but at some point the optimization techniques differ, and reaching the best performance usually requires two versions of the kernel—one for the CPU and one for the GPU.

Tuning OpenCL Code For the CPU

To make the best use of the full computing and memory potential of a modern CPU:

Example: Tuning a Kernel To Optimize Performance On a CPU

In the example that follows, we tune our sample boxAvg kernel in several ways and check how each modification affects performance on the CPU:

  1. We divide the computation into two passes. See Putting the Horizontal and Vertical Passes Together.

  2. We modify the horizontal pass to compute one row per work item instead of one single pixel. See Putting the Horizontal and Vertical Passes Together.

  3. We modify the algorithm to read fewer values per pixel and to incrementally update the sum rather than computing it each time. See Table 15-2.

  4. We modify the horizontal pass by moving the division and conditionals out of the inner loop. See Table 15-3.

  5. We modify the vertical pass to combine rows; each work item computes a block of row. See Optimizing the Vertical Pass.

  6. We ensure that the image width (w) is a multiple of four so we can use faster 16-byte I/O functions on float4 data. See Listing 15-7.

  7. We ensure that the code works for any image width. See Example: Tuning a Kernel To Optimize Performance On a CPU.

  8. Fuse the kernels. See Putting the Horizontal and Vertical Passes Together.

Dividing Kernel Computation Into Two Passes

The computational work of the boxAvg kernel can be broken into two passes. The first pass will compute the horizontal average of the input pixels; the second pass will compute a vertical average of the horizontal averages:

Listing 15-3  The boxAvg kernel in two passes

// Horizontal pass v1. Global work size: w x h
kernel void boxAvgH1(int w,int h,global const float * in,global float * out) {
  int x = get_global_id(0);  // Pixel to process is (x,y)
  int y = get_global_id(1);
  float sumA = 0.0f;         // Sum of pixel values
  float sum1 = 0.0f;         // Number of pixels
  for (int dx=-RANGE;dx<=RANGE;dx++) {
    int xx = x + dx;
    // Accumulate if inside image
    if (xx>=0 && xx<w) { sumA += in[xx+w*y]; sum1 += 1.0f; }
  }
  out[x+w*y] = sumA/sum1;
}
 
// Vertical pass v1. Global work size: w x h
kernel void boxAvgV1(int w,int h,global const float * in,global float * out) {
  int x = get_global_id(0);  // pixel to process is (x,y)
  int y = get_global_id(1);
  float sumA = 0.0f;         // sum of pixel values
  float sum1 = 0.0f;         // number of pixels
  for (int dy=-RANGE;dy<=RANGE;dy++) {
    int yy = y + dy;
    // Accumulate if inside image
    if (yy>=0 && yy<h) { sumA += in[x + w*yy]; sum1 += 1.0f; }
  }
  out[x+w*y] = sumA/sum1;
}

In both cases, memory transfers take 24 B/pix (5 read + 1 write), and the estimated speed of each pass is 500 MP/s for the CPU. Actual values are shown in the right column of Table 15-1:

Table 15-1  Comparing estimated and actual memory transfer speeds

Kernel

Estimate

Actual

boxAvgH1

500 MP/s

523 MP/s

boxAvgV1

500 MP/s

563 MP/s

We can see some effects of the cache hierarchy. The copyBuffer value of 12 GB/s corresponds to external memory speed. In the case of boxAvg, we reuse each input value five times, and the data is found in the cache, giving speeds better than our estimate based on external memory speed only.

We can compute an absolute upper bound: we need to load each input value once from external memory, and store each output value once, and we assume that the subsequent input accesses are in cache and instantaneous. That’s only 8 B/pix, giving an upper bound of 1,500 MP/s (the speed of image copy).

Optimizing the Horizontal Pass

To take better advantage of the cache, we can have each work item process several pixels instead of only one pixel. Since we need only a few threads to saturate the CPU (this is not the case for GPUs), we can actually have only one work item per row (for the horizontal pass) or per column (for the vertical pass), like this:

Listing 15-4  Modify the horizontal pass to compute one row per work item instead of one pixel

// Horizontal pass v2. Global work size: H
kernel void boxAvgH2(int w,int h,global const float * in,global float * out)
{
  int y = get_global_id(0);       // Row to process is y
  // Process the row, loop on all pixels
  for (int x=0; x<w; x++) {
    float sumA = 0.0f;            // Sum of pixel values
    float sum1 = 0.0f;            // Number of pixels
    for (int dx=-RANGE;dx<=RANGE;dx++) {
      int xx = x + dx;
      // Accumulate if inside image
      if (xx>=0 && xx<w) { sumA += in[xx+w*y]; sum1 += 1.0f; }
    }
    out[x+w*y] = sumA/sum1;      // Store output value (x,y)
  }
}

As shown in Table 15-2, for the horizontal pass we see an increased benefit from using fast cache memory. The input values used in one iteration of the x loop are reused immediately in the next iteration, and will be in L1 or L2 cache. For the vertical pass, on the other hand, we now have different threads processing different columns at the same time. Each iteration of the y loop now accesses a different cache line. This slows execution speed.

Table 15-2  Comparing optimal and actual speeds of work item row and column processing

Kernel

Upper Bound

Actual

boxAvgH2

1500 MP/s

931 MP/s

boxAvgV2

1500 MP/s

456 MP/s

In this version of the code, the autovectorizer can be of real help. Let’s look at the execution times for different workgroup sizes, corresponding more or less to the block size used to vectorize the code. More precisely, the kernel code is vectorized up to the actual vector size (float4 for SSE, or float8 for AVX), and then when the workgroup size increases, several vector variants are executed together.

For example, Table 15-3 shows that with a workgroup size of 4, we execute float4 variants of the kernel, merging together 4 work items, and with a workgroup size of 32 we would execute 8 of these float4 variants.

Table 15-3  Effect of work group size on execution time

Workgroup size

Execution time (ms)

1

260

2

143

4

37

8

45

16

56

32

66

64

66

128

67

The best performance was achieved using the float4 vectorized variant; it is 7x faster than the scalar variant.

In addition to giving vector instructions more bandwidth, the autovectorizer virtually adds blocking (or tiling) to the outer loop (the scheduling loop in the framework) and it can be used to better match the CPU cache structure.

We can make the horizontal kernel better. Instead of computing the sum of 2*RANGE+1 values at each iteration of the x loop, we can simply update the sum between two consecutive iterations, as illustrated in Listing 15-5:

Listing 15-5  Modify the algorithm to read fewer values per pixel and to incrementally update the sum

// Horizontal pass v3. Global work size: H
kernel void boxAvgH3(int w,int h,global const float * in,global float * out) {
  int y = get_global_id(0);                             // Row to process is y
  global const float * inRow = in + y*w;                // Beginning of input row
  float sumA = 0.0f;                                    // Sum of values in moving segment
  float sum1 = 0.0f;                                    // Number of values in moving segment
  // Initialize the first sums with segment 0..RANGE-1
  for (int x=0; x<RANGE; x++) { sumA += inRow[x]; sum1 += 1.0f; }
 
  for (int x=0; x<w; x++) {
    // Here, sums are set to segment x-RANGE-1..x+RANGE-1
    // update them to x-RANGE..x+RANGE.
    // Remove x-RANGE-1.
    if (x-RANGE-1 >= 0) { sumA -= inRow[x-RANGE-1]; sum1 -= 1.0f; }
    if (x+RANGE < w) { sumA += inRow[x+RANGE]; sum1 += 1.0f; } // insert x+RANGE
    // Store current value
    out[x+w*y] = sumA/sum1;
  }
}

This variant reduces the number of memory accesses from 6 to 3 per iteration of the x loop. The execution speed of this variant is now 1366 MP/s (and only 822 MP/s without the autovectorizer). This is 91% of our upper bound.

We can move the conditionals and the division out of the x loop by splitting it into three parts:

Listing 15-6  Modify the horizontal pass by moving division and conditionals out of the inner loop

// Horizontal pass v4. Global work size: H
kernel void boxAvgH4(int w, int h, global const float * in, global float * out) {
  int y = get_global_id(0);                            // Row to process is y
  global const float * inRow = in + y*w;               // Beginning of input row
  global float * outRow = out + y*w;                   // Beginning of output row
  float sumA = 0.0f;
  float sum1 = 0.0f;
 
  // Left border
  int x = -RANGE;
  for (; x<=RANGE; x++) {
    // Here, sumA corresponds to segment 0..x+RANGE-1, update to 0..x+RANGE.
    sumA += inRow[x+RANGE]; sum1 += 1.0f;
    if (x >= 0) outRow[x] = sumA/sum1;
  }
  // x is RANGE+1 here
 
  // Internal pixels
  float k = 1.0f/(float)(2*RANGE+1);                  // Constant weight for internal pixels
  for (; x+RANGE<w; x++) {
    // Here, sumA corresponds to segment x-RANGE-1..x+RANGE-1,
    // Update to x-RANGE..x+RANGE.
    sumA -= inRow[x-RANGE-1];
    sumA += inRow[x+RANGE];
    outRow[x] = sumA * k;
  }
  // x is w-RANGE here
 
  // Right border
  for (; x < w; x++) {
    // Here, sumA corresponds to segment x-RANGE-1..w-1, update to x-RANGE..w-1.
    sumA -= inRow[x - RANGE - 1]; sum1 -= 1.0f;
    outRow[x] = sumA/sum1;
  }
  // x is w here, and we are done
}

This variant runs at 1457 MP/s, or 97% of the upper bound.

Optimizing the Vertical Pass

Now we optimize the vertical kernel. To make better use of the cache, we can combine entire rows together. Each work item will be responsible for computing one block of rows. The code in Listing 15-7 enqueues a small number of work items: one per CPU thread, which is enough.

Listing 15-7  Modify vertical pass to combine rows; each work item computes a block of rows

// Vertical pass v3. Global work size: >= number of CPU cores.
kernel void boxAvgV3(int w,int h,global const float * in,global float * out) {
  // Number of rows to process in each work item (rounded up)
  int rowsPerItem = (h+get_global_size(0)-1)/get_global_size(0);
  int y0 = rowsPerItem * get_global_id(0);         // Update the range Y0..Y1-1
  int y1 = min(h, y0 + rowsPerItem);
  for (int y=y0; y<y1; y++) {
    // Accumulate into row y all rows in y-RANGE..y+RANGE intersected with 0..h-1
    int ya0 = max(0, y-RANGE);
    int ya1 = min(h, y+RANGE+1);
    float k = 1.0f/(float)(ya1-ya0);              // 1/(number of rows)
    global float * outRow = out + w*y;            // Output row
    for (int x=0; x<w; x++) outRow[x] = 0.0f;     // Zero output row
    for (int ya=ya0; ya<ya1; ya++) {
      global const float * inRow = in + w*ya;     // Input row
      for (int x=0; x<w; x++) outRow[x] += k * inRow[x];
    }
  }
}

This kernel is executed at 891 MP/s on our test machine. If we can ensure that the image width is always a multiple of four, we can vectorize all x loops to float4:

Listing 15-8  Ensure the image width is always a multiple of 4

    global float4 * outRow = (global float4 *)(out + w*y);          // Output row
    for (int x=0; x<w/4; x++) outRow[x] = 0.0f;
      for (int ya=ya0; ya<ya1; ya++) {
        global const float4 * inRow = (global float4 *)(in + w*ya); // Input row
        for (int x=0; x<w/4; x++) outRow[x] += k * inRow[x];
      }

This code runs at 1442 MP/s, 96% of the upper bound. When accessing data through a pointer to float4, the compiler assumes the pointer is aligned to multiples of the size of the type, here 16 bytes, and generates an efficient aligned move instruction (movaps).

This code is not safe, since we can only assume in and out are four-byte aligned (the size of float). A safer variant that will work for any image width would look something like:

Listing 15-9  A safer variant that will work for any image width

// Zero output row using aligned memory access
global float * outRow = out + w*y;
int x = 0;
// Iterate by 1 until 16-byte aligned
for (; x<w && ((intptr_t)&outRow[x] & 15); x++) outRow[x] = 0.0f;
// Iterate by 4 on aligned address while we can
// outRow4 is 16-byte aligned
global float4 * outRow4 = (global float4 * outRow4)(outRow + x);
for (; x+4<w; x+=4,outRow4++) outRow4[0] = 0.0f;
// Iterate by 1 until w is reached
for (; x<w; x++) outRow[x] = 0.0f;

We now have near-optimal versions of both horizontal and vertical kernels. We can call them in sequence to compute the result, and run at a speed near 700 MP/s.

Putting the Horizontal and Vertical Passes Together

We can now fuse the two kernels together, and loop over the entire input image only once. This is illustrated in Listing 15-10:

Listing 15-10  Fused kernel

// Fused H+V variant. Global work size: any, >= number of CPU cores.
// AUX[w*global_size(0)] is temporary storage, 1 row for each work item.
kernel void boxAvg2(int w, int h, global const float * in,
                       global float * out, global float * aux) {
  // Number of rows to process in each work item (rounded up)
  int rowsPerItem = (h+get_global_size(0)-1)/get_global_size(0);
  int y0 = rowsPerItem * get_global_id(0);          // Update the range Y0..Y1-1
  int y1 = y0 + rowsPerItem;
  aux += get_global_id(0) * w;                      // Point to our work item’s row of temporary storage
  float k = 1.0f/(float)(2*RANGE+1);                // Constant weight for internal pixels
 
  // Process our rows. We need to process extra RANGE rows before and after.
  for (int y=y0-RANGE; y<y1+RANGE; y++) {
    // Zero new row entering in our update scope before we start
    // accumulating values in it.
    if (y+RANGE < min(h, y1)) {
      global float4 * outRow4 = (global float4 *)(out + w*(y + RANGE));
      for (int x=0; x<(w/4); x++) outRow4[x] = 0.0f;
    }
    if (y<0 || y>=h) continue; // Out of range
    // Compute horizontal pass in AUX.
 
 
    //   The boxAvg4 code goes here on input row y
    //   The output is stored in AUX[W].
    // Accumulate this row on output rows Y-RANGE..Y+RANGE
    for (int dy=-RANGE; dy<=RANGE; dy++) {
      int yy = y + dy;
      if (yy < max(0, y0) || yy >= min(h, y1)) continue; // Out of range
      // Get number of rows accumulated in row YY, to get the weight
      int nr = 1 + min(h-1, yy+RANGE)-max(0, yy-RANGE);
      float u = 1.0f/(float)nr;
      // Accumulate AUX in row YY
      global float4 * outRow4 = (global float4 *)(out + w*yy);
      global float4 * aux4 = (global float4 *)(aux);
      for (int x=0; x<(w/4); x++) outRow4[x] += u * aux4[x];
    }
  }
}

This fused version runs at 1166 MP/s. Some rows will be processed twice, since we have to compute horizontal filters on rows y0-RANGE to y1+RANGE-1 to update output rows y0 to y1-1. At any given time during the execution, we will access one row in aux, one input row, and 2*RANGE+1 output rows. For a 4096x4096 image, each row is 16 KiB, and all 7 rows fit in L2 cache.