Retired Document
Important: OpenCL was deprecated in macOS 10.14. To create high-performance code on GPUs, use the Metal framework instead. See Metal.
Improving Performance On the CPU
When optimizing code to run for a GPU or a CPU, it is important to take into consideration the strengths and limitations of the device you are writing for. This chapter focuses on optimizing for the a CPU. CPUs have fewer processing elements and more memory (both a large cache and a much larger amount of RAM) than GPUs, which have more processing elements and comparatively less memory. CPU memory access is fastest when data is in cache.
The chapter describes how to benchmark the speed of OpenCL code running on a CPU and how to set performance objectives. It provides tips for writing efficient OpenCL code. It also provides an example of an iterative process in which performance of a simple image filter application is tuned for best performance on a CPU.
See Tuning Performance On the GPU for a description of how to optimize code that will run on GPUs.
Before Optimizing Code
Before you optimize code:
Decide whether the code really needs to be optimized. Optimization can take significant time and effort. Weigh the costs and benefits of optimization before starting any optimization effort.
Estimate optimal performance. Run some simple kernels on your CPU device to estimate its capabilities. You can use the techniques described in Measuring Performance On Devices to measure how long kernel code takes to run. See Generating the Compute/Memory Access Peak Benchmark for examples of code you can use to test memory access speed and processing speed.
Generate or collect sample data to feed through each iteration of optimization. Run the unoptimized original code through the sample code and save the results. Then run each major iteration of the optimized code against the same data and compare the results to the original results to ensure your output has not been corrupted by the changed code.
Reducing Overhead
Here are some general principles you can follow to improve the efficiency of OpenCL code intended to run on a CPU:
Choose an efficient algorithm.
OpenCL can take advantage of all the devices in the system, but only if the algorithms in your program are written to allow parallel processing.
Consider the following when choosing an algorithm:
When sending work to a CPU, which typically has fewer cores than a GPU, it is important to match the number of work items to the number of threads the CPU can effectively support.
OpenCL is most efficient when working with large datasets. If possible, select an algorithm that works on large chunks of data or merge several smaller tasks into one.
Building an OpenCL program is computationally expensive and should ideally occur only once in a process.
Be sure to take advantage of tools in OS X v10.7 or later that allow you to compile once and then run many times. If you instead choose to compile a kernel during runtime, you will need to execute that kernel many times to amortize the cost of compiling it. You can save the binary after the first time the program is run and reuse the compiled code on subsequent invocations, but be prepared to recompile the kernel if the build fails because of an OpenCL revision or a change in the hardware of the host machine.
You can also use bitcode generated by the OpenCL compiler instead of source code; if you do this, compilation speed will be much faster and you won’t have to ship source code with your application.
Allocating and freeing OpenCL resources (memory objects, kernels, and so on) takes time. Reuse these objects whenever possible instead of releasing them and recreating them repeatedly. Note, however, that image objects can be reused only if they are the same size and pixel format as needed by the new image.
Experiment with your code to find the kernel size that works best.
Using smaller kernels can be efficient because each tiny kernel uses minimal resources and breaking a job into many small kernels can allow for the creation of very large and efficient workgroups. On the other hand, starting each kernel does take between 10-100 μs. When each kernel exits, the the results must be stored in global memory. Because reading and writing to global memory is expensive, concatenating many small kernels into one large kernel may save considerable overhead.
To determine the ideal kernel size for your application, experiment with your code to find the kernel size that provides optimal performance.
Use OpenCL’s built-in functions whenever possible.
Optimal code is generated for these functions.
Take advantage of the memory subsystem of the device.
When writing for the CPU, take advantage of the memory subsystem: reuse data while it’s still in L1 or L2 cache. To achieve this, use loop blocking and access memory in a cache-friendly pattern.
Avoid divergent execution.
The CPU predicts the result of conditional jump instructions (corresponding to
if
,for
,while
, and so on) and starts processing the selected branch before knowing the effective result of the test. If the prediction is wrong, the entire pipeline needs to be flushed, and you lose cycles. If possible, use conditional assignment instead.
Write simple scalar code first. The compiler and the autovectorizer work best on scalar code and can generate near-optimal code with no effort required from you. If the autovectorizer provides sub-optimal results, add vectors to the code by hand.
Use the
-cl-denorms-are-zero
option in clBuildProgram, unless you need to use denormals (denormals are very small numbers with a slightly different floating-point representation). Denormals handling can be extremely slow (100x slower) and can lead to puzzling benchmark results.CPUs are not optimized for graphics processing. Avoid using images. CPUs provide no hardware acceleration for images and image access is slower than the equivalent buffer access.
CPU access to global memory is not as expensive as GPU access to global memory.
Estimating Optimal Performance
Before optimizing code, it is best to know what kind of performance is achievable. See Measuring Performance On Devices for information about how to set a timer to measure the speed at which a kernel runs.
The main factor determining the execution speed of an OpenCL kernel is memory usage; this is the case for both CPU and GPU devices. Benchmarking the speed of the kernel function in Listing 15-1 provides a way to estimate the memory speed of an OpenCL device.
Listing 15-1 Kernel for estimating optimal memory access speed
kernel void copyBuffer(global const float * in, global float * out) { |
int i = get_global_id(0); |
out[i] = in[i]; // R+W one float |
} |
The asymptotic (maximum) value memory speed can be used to estimate the speed of a memory-bound algorithm on large data.
Take the box average kernel before it has been optimized, shown in Listing 15-2 for example. This kernel accepts a single channel floating point image as input and computes a single channel floating point image where each output pixel (x,y) is the average value of all pixels in a square box centered at (x,y). A w
by h
image is stored in a buffer float * A
, where pixel (x,y) is stored in A[x+w*y]
.
Listing 15-2 The boxAvg kernel, first version
constant int RANGE = 2; |
kernel void boxAvg1(int w, int h, global const float * in, global float * out) { |
int x = get_global_id(0); // Pixel to process is (x,y) |
int y = get_global_id(1); |
float sumA = 0.0f; // Sum of pixel values |
float sum1 = 0.0f; // Number of pixels |
for (int dy=-RANGE;dy<=RANGE;dy++) |
for (int dx=-RANGE;dx<=RANGE;dx++) { |
int xx = x + dx; |
int yy = y + dy; |
// Accumulate if inside image |
if (xx>=0 && xx<w && yy>=0 && yy<h) { sumA += in[xx + w*yy]; sum1 += 1.0f; } |
} |
out[x+w*y] = sumA/sum1; |
} |
For each pixel, we compute the average of 25 input values (RANGE=2). That’s 26*4=104 bytes per pixel. The maximum speed of this kernel should be near (36e9 B/s) / (104 B/pix) = 346 MP/s on the GPU, and 12e9/104 = 115 MP/s for the CPU. The actual numbers are 129 MP/s and 103 MP/s respectively.
The following two sections show how to tune the code to compute boxAvg on a CPU. CPU and GPU tuning have a lot in common, but at some point the optimization techniques differ, and reaching the best performance usually requires two versions of the kernel—one for the CPU and one for the GPU.
Tuning OpenCL Code For the CPU
To make the best use of the full computing and memory potential of a modern CPU:
Try the autovectorizer first.
The autovectorizer can transform scalar OpenCL code into suitable vector code automatically. See Autovectorizer. The autovectorizer works on a restricted class of code, and in some cases you may find it necessary to vectorize the code by hand. But the first step is to try scalar code and let the autovectorizer vectorize for you. Vectorize manually only if really needed.
Use all cores of the processor.
OpenCL automatically ensures that all processor cores are used. Work items are scheduled in different tasks submitted to Grand Central Dispatch (GCD) and then executed on all available CPU cores. All that’s needed is to have enough work items to run on all threads.
Use the whole width (16 bytes for SSE or 32 bytes for AVX) of the SIMD execution units.
Use the memory cache hierarchy efficiently.
OpenCL execution speed on CPUs is related to the optimal usage of the various levels of data cache. Tuning the code to match the cache levels requires effort, but may provide substantial acceleration.
Each core of the CPU (the values shown correspond to the Intel Core i7 CPU found in our test machine) has a bank of registers (typically a few hundreds of bytes, 1 cycle of latency), an L1 data cache (32 KB, 3 cycles), and an L2 cache (256 KB, 12 cycles). All cores of the CPU share a large L3 cache (6 MB, 40 cycles). External memory access has a latency in the ~250 cycles range. Each core has several hardware prefetcher units able to identify regular data streams, and move the data closer to the core before it is needed (prefetch).
Efficient OpenCL kernels should take advantage of this architecture by adopting regular memory access patterns and reusing the data before it is evicted from the cache. This is usually done through the use of several levels of loops (blocking) matching the various cache levels. The OpenCL framework, executing work items sequentially on several threads, provides the highest loop level.
Example: Tuning a Kernel To Optimize Performance On a CPU
In the example that follows, we tune our sample boxAvg kernel in several ways and check how each modification affects performance on the CPU:
We divide the computation into two passes. See Putting the Horizontal and Vertical Passes Together.
We modify the horizontal pass to compute one row per work item instead of one single pixel. See Putting the Horizontal and Vertical Passes Together.
We modify the algorithm to read fewer values per pixel and to incrementally update the sum rather than computing it each time. See Table 15-2.
We modify the horizontal pass by moving the division and conditionals out of the inner loop. See Table 15-3.
We modify the vertical pass to combine rows; each work item computes a block of row. See Optimizing the Vertical Pass.
We ensure that the image width (
w
) is a multiple of four so we can use faster 16-byte I/O functions on float4 data. See Listing 15-7.We ensure that the code works for any image width. See Example: Tuning a Kernel To Optimize Performance On a CPU.
Fuse the kernels. See Putting the Horizontal and Vertical Passes Together.
Dividing Kernel Computation Into Two Passes
The computational work of the boxAvg kernel can be broken into two passes. The first pass will compute the horizontal average of the input pixels; the second pass will compute a vertical average of the horizontal averages:
Listing 15-3 The boxAvg kernel in two passes
// Horizontal pass v1. Global work size: w x h |
kernel void boxAvgH1(int w,int h,global const float * in,global float * out) { |
int x = get_global_id(0); // Pixel to process is (x,y) |
int y = get_global_id(1); |
float sumA = 0.0f; // Sum of pixel values |
float sum1 = 0.0f; // Number of pixels |
for (int dx=-RANGE;dx<=RANGE;dx++) { |
int xx = x + dx; |
// Accumulate if inside image |
if (xx>=0 && xx<w) { sumA += in[xx+w*y]; sum1 += 1.0f; } |
} |
out[x+w*y] = sumA/sum1; |
} |
// Vertical pass v1. Global work size: w x h |
kernel void boxAvgV1(int w,int h,global const float * in,global float * out) { |
int x = get_global_id(0); // pixel to process is (x,y) |
int y = get_global_id(1); |
float sumA = 0.0f; // sum of pixel values |
float sum1 = 0.0f; // number of pixels |
for (int dy=-RANGE;dy<=RANGE;dy++) { |
int yy = y + dy; |
// Accumulate if inside image |
if (yy>=0 && yy<h) { sumA += in[x + w*yy]; sum1 += 1.0f; } |
} |
out[x+w*y] = sumA/sum1; |
} |
In both cases, memory transfers take 24 B/pix (5 read + 1 write), and the estimated speed of each pass is 500 MP/s for the CPU. Actual values are shown in the right column of Table 15-1:
Kernel | Estimate | Actual |
---|---|---|
boxAvgH1 | 500 MP/s | 523 MP/s |
boxAvgV1 | 500 MP/s | 563 MP/s |
We can see some effects of the cache hierarchy. The copyBuffer value of 12 GB/s corresponds to external memory speed. In the case of boxAvg, we reuse each input value five times, and the data is found in the cache, giving speeds better than our estimate based on external memory speed only.
We can compute an absolute upper bound: we need to load each input value once from external memory, and store each output value once, and we assume that the subsequent input accesses are in cache and instantaneous. That’s only 8 B/pix, giving an upper bound of 1,500 MP/s (the speed of image copy).
Optimizing the Horizontal Pass
To take better advantage of the cache, we can have each work item process several pixels instead of only one pixel. Since we need only a few threads to saturate the CPU (this is not the case for GPUs), we can actually have only one work item per row (for the horizontal pass) or per column (for the vertical pass), like this:
Listing 15-4 Modify the horizontal pass to compute one row per work item instead of one pixel
// Horizontal pass v2. Global work size: H |
kernel void boxAvgH2(int w,int h,global const float * in,global float * out) |
{ |
int y = get_global_id(0); // Row to process is y |
// Process the row, loop on all pixels |
for (int x=0; x<w; x++) { |
float sumA = 0.0f; // Sum of pixel values |
float sum1 = 0.0f; // Number of pixels |
for (int dx=-RANGE;dx<=RANGE;dx++) { |
int xx = x + dx; |
// Accumulate if inside image |
if (xx>=0 && xx<w) { sumA += in[xx+w*y]; sum1 += 1.0f; } |
} |
out[x+w*y] = sumA/sum1; // Store output value (x,y) |
} |
} |
As shown in Table 15-2, for the horizontal pass we see an increased benefit from using fast cache memory. The input values used in one iteration of the x loop are reused immediately in the next iteration, and will be in L1 or L2 cache. For the vertical pass, on the other hand, we now have different threads processing different columns at the same time. Each iteration of the y loop now accesses a different cache line. This slows execution speed.
Kernel | Upper Bound | Actual |
---|---|---|
boxAvgH2 | 1500 MP/s | 931 MP/s |
boxAvgV2 | 1500 MP/s | 456 MP/s |
In this version of the code, the autovectorizer can be of real help. Let’s look at the execution times for different workgroup sizes, corresponding more or less to the block size used to vectorize the code. More precisely, the kernel code is vectorized up to the actual vector size (float4
for SSE, or float8
for AVX), and then when the workgroup size increases, several vector variants are executed together.
For example, Table 15-3 shows that with a workgroup size of 4, we execute float4
variants of the kernel, merging together 4 work items, and with a workgroup size of 32 we would execute 8 of these float4
variants.
Workgroup size | Execution time (ms) |
---|---|
1 | 260 |
2 | 143 |
4 | 37 |
8 | 45 |
16 | 56 |
32 | 66 |
64 | 66 |
128 | 67 |
The best performance was achieved using the float4 vectorized variant; it is 7x faster than the scalar variant.
In addition to giving vector instructions more bandwidth, the autovectorizer virtually adds blocking (or tiling) to the outer loop (the scheduling loop in the framework) and it can be used to better match the CPU cache structure.
We can make the horizontal kernel better. Instead of computing the sum of 2*RANGE+1
values at each iteration of the x loop, we can simply update the sum between two consecutive iterations, as illustrated in Listing 15-5:
Listing 15-5 Modify the algorithm to read fewer values per pixel and to incrementally update the sum
// Horizontal pass v3. Global work size: H |
kernel void boxAvgH3(int w,int h,global const float * in,global float * out) { |
int y = get_global_id(0); // Row to process is y |
global const float * inRow = in + y*w; // Beginning of input row |
float sumA = 0.0f; // Sum of values in moving segment |
float sum1 = 0.0f; // Number of values in moving segment |
// Initialize the first sums with segment 0..RANGE-1 |
for (int x=0; x<RANGE; x++) { sumA += inRow[x]; sum1 += 1.0f; } |
for (int x=0; x<w; x++) { |
// Here, sums are set to segment x-RANGE-1..x+RANGE-1 |
// update them to x-RANGE..x+RANGE. |
// Remove x-RANGE-1. |
if (x-RANGE-1 >= 0) { sumA -= inRow[x-RANGE-1]; sum1 -= 1.0f; } |
if (x+RANGE < w) { sumA += inRow[x+RANGE]; sum1 += 1.0f; } // insert x+RANGE |
// Store current value |
out[x+w*y] = sumA/sum1; |
} |
} |
This variant reduces the number of memory accesses from 6 to 3 per iteration of the x loop. The execution speed of this variant is now 1366 MP/s (and only 822 MP/s without the autovectorizer). This is 91% of our upper bound.
We can move the conditionals and the division out of the x loop by splitting it into three parts:
Listing 15-6 Modify the horizontal pass by moving division and conditionals out of the inner loop
// Horizontal pass v4. Global work size: H |
kernel void boxAvgH4(int w, int h, global const float * in, global float * out) { |
int y = get_global_id(0); // Row to process is y |
global const float * inRow = in + y*w; // Beginning of input row |
global float * outRow = out + y*w; // Beginning of output row |
float sumA = 0.0f; |
float sum1 = 0.0f; |
// Left border |
int x = -RANGE; |
for (; x<=RANGE; x++) { |
// Here, sumA corresponds to segment 0..x+RANGE-1, update to 0..x+RANGE. |
sumA += inRow[x+RANGE]; sum1 += 1.0f; |
if (x >= 0) outRow[x] = sumA/sum1; |
} |
// x is RANGE+1 here |
// Internal pixels |
float k = 1.0f/(float)(2*RANGE+1); // Constant weight for internal pixels |
for (; x+RANGE<w; x++) { |
// Here, sumA corresponds to segment x-RANGE-1..x+RANGE-1, |
// Update to x-RANGE..x+RANGE. |
sumA -= inRow[x-RANGE-1]; |
sumA += inRow[x+RANGE]; |
outRow[x] = sumA * k; |
} |
// x is w-RANGE here |
// Right border |
for (; x < w; x++) { |
// Here, sumA corresponds to segment x-RANGE-1..w-1, update to x-RANGE..w-1. |
sumA -= inRow[x - RANGE - 1]; sum1 -= 1.0f; |
outRow[x] = sumA/sum1; |
} |
// x is w here, and we are done |
} |
This variant runs at 1457 MP/s, or 97% of the upper bound.
Optimizing the Vertical Pass
Now we optimize the vertical kernel. To make better use of the cache, we can combine entire rows together. Each work item will be responsible for computing one block of rows. The code in Listing 15-7 enqueues a small number of work items: one per CPU thread, which is enough.
Listing 15-7 Modify vertical pass to combine rows; each work item computes a block of rows
// Vertical pass v3. Global work size: >= number of CPU cores. |
kernel void boxAvgV3(int w,int h,global const float * in,global float * out) { |
// Number of rows to process in each work item (rounded up) |
int rowsPerItem = (h+get_global_size(0)-1)/get_global_size(0); |
int y0 = rowsPerItem * get_global_id(0); // Update the range Y0..Y1-1 |
int y1 = min(h, y0 + rowsPerItem); |
for (int y=y0; y<y1; y++) { |
// Accumulate into row y all rows in y-RANGE..y+RANGE intersected with 0..h-1 |
int ya0 = max(0, y-RANGE); |
int ya1 = min(h, y+RANGE+1); |
float k = 1.0f/(float)(ya1-ya0); // 1/(number of rows) |
global float * outRow = out + w*y; // Output row |
for (int x=0; x<w; x++) outRow[x] = 0.0f; // Zero output row |
for (int ya=ya0; ya<ya1; ya++) { |
global const float * inRow = in + w*ya; // Input row |
for (int x=0; x<w; x++) outRow[x] += k * inRow[x]; |
} |
} |
} |
This kernel is executed at 891 MP/s on our test machine. If we can ensure that the image width is always a multiple of four, we can vectorize all x loops to float4
:
Listing 15-8 Ensure the image width is always a multiple of 4
… |
global float4 * outRow = (global float4 *)(out + w*y); // Output row |
for (int x=0; x<w/4; x++) outRow[x] = 0.0f; |
for (int ya=ya0; ya<ya1; ya++) { |
global const float4 * inRow = (global float4 *)(in + w*ya); // Input row |
for (int x=0; x<w/4; x++) outRow[x] += k * inRow[x]; |
} |
… |
This code runs at 1442 MP/s, 96% of the upper bound. When accessing data through a pointer to float4
, the compiler assumes the pointer is aligned to multiples of the size of the type, here 16 bytes, and generates an efficient aligned move instruction (movaps
).
This code is not safe, since we can only assume in and out are four-byte aligned (the size of float
). A safer variant that will work for any image width would look something like:
Listing 15-9 A safer variant that will work for any image width
// Zero output row using aligned memory access |
global float * outRow = out + w*y; |
int x = 0; |
// Iterate by 1 until 16-byte aligned |
for (; x<w && ((intptr_t)&outRow[x] & 15); x++) outRow[x] = 0.0f; |
// Iterate by 4 on aligned address while we can |
// outRow4 is 16-byte aligned |
global float4 * outRow4 = (global float4 * outRow4)(outRow + x); |
for (; x+4<w; x+=4,outRow4++) outRow4[0] = 0.0f; |
// Iterate by 1 until w is reached |
for (; x<w; x++) outRow[x] = 0.0f; |
We now have near-optimal versions of both horizontal and vertical kernels. We can call them in sequence to compute the result, and run at a speed near 700 MP/s.
Putting the Horizontal and Vertical Passes Together
We can now fuse the two kernels together, and loop over the entire input image only once. This is illustrated in Listing 15-10:
Listing 15-10 Fused kernel
// Fused H+V variant. Global work size: any, >= number of CPU cores. |
// AUX[w*global_size(0)] is temporary storage, 1 row for each work item. |
kernel void boxAvg2(int w, int h, global const float * in, |
global float * out, global float * aux) { |
// Number of rows to process in each work item (rounded up) |
int rowsPerItem = (h+get_global_size(0)-1)/get_global_size(0); |
int y0 = rowsPerItem * get_global_id(0); // Update the range Y0..Y1-1 |
int y1 = y0 + rowsPerItem; |
aux += get_global_id(0) * w; // Point to our work item’s row of temporary storage |
float k = 1.0f/(float)(2*RANGE+1); // Constant weight for internal pixels |
// Process our rows. We need to process extra RANGE rows before and after. |
for (int y=y0-RANGE; y<y1+RANGE; y++) { |
// Zero new row entering in our update scope before we start |
// accumulating values in it. |
if (y+RANGE < min(h, y1)) { |
global float4 * outRow4 = (global float4 *)(out + w*(y + RANGE)); |
for (int x=0; x<(w/4); x++) outRow4[x] = 0.0f; |
} |
if (y<0 || y>=h) continue; // Out of range |
// Compute horizontal pass in AUX. |
// The boxAvg4 code goes here on input row y |
// The output is stored in AUX[W]. |
// Accumulate this row on output rows Y-RANGE..Y+RANGE |
for (int dy=-RANGE; dy<=RANGE; dy++) { |
int yy = y + dy; |
if (yy < max(0, y0) || yy >= min(h, y1)) continue; // Out of range |
// Get number of rows accumulated in row YY, to get the weight |
int nr = 1 + min(h-1, yy+RANGE)-max(0, yy-RANGE); |
float u = 1.0f/(float)nr; |
// Accumulate AUX in row YY |
global float4 * outRow4 = (global float4 *)(out + w*yy); |
global float4 * aux4 = (global float4 *)(aux); |
for (int x=0; x<(w/4); x++) outRow4[x] += u * aux4[x]; |
} |
} |
} |
This fused version runs at 1166 MP/s. Some rows will be processed twice, since we have to compute horizontal filters on rows y0-RANGE to y1+RANGE-1 to update output rows y0 to y1-1. At any given time during the execution, we will access one row in aux, one input row, and 2*RANGE+1 output rows. For a 4096x4096 image, each row is 16 KiB, and all 7 rows fit in L2 cache.
Copyright © 2018 Apple Inc. All Rights Reserved. Terms of Use | Privacy Policy | Updated: 2018-06-04