Autovectorizer
The autovectorizer detects operations in a scalar program that can be run in parallel and converts them into sequential operations that can be handled efficiently by today's CPUs.
The autovectorizer frees you to write simple scalar code. It then vectorizes that code for you so that its performance on the CPU is maximized while the same code runs on the GPU as well.
Features
-
Runs by default when compiling to the CPU.
-
Packs work items together.
-
Generates a loop over the entire workgroup.
-
Can provide performance improvement of up to the vector width of the CPU without additional effort.
-
Allows you to write one scalar kernel that runs on CPU or GPU.
Without the Autovectorizer
The issue is that a GPU will process scalar data efficiently, but the CPU needs vectorized data to keep it fully busy. Which means that, without the autovectorizer, you either have to write multiple device-specific kernels that all perform the same function, or your performance will suffer.
OpenCL sees devices as having a number of compute cores and within them a number of processing elements. When scalar code runs on the CPU, it will run on each core but will not take advantage of the vector unit.
For example, on a SSE4 machine, scalar code would run in one lane of the vector unit when it could be running in four lanes. The monitor would report that the CPU is completely busy because all the cores are running, but the CPU is actually only using a quarter of its vector width.
If you pass simple floats into a kernel:
Listing 10-1 Passing single floats into a kernel
kernel void add_arrays(global float* a, global float* b, global float* c) |
{ |
size_t i = get_global_id(0); |
c[i] = a[i] + b[i]; |
} |
The kernel will be doing a scalar addition; operating on one data element at a time. If you send the scalar float to the CPU and the GPU, the GPU will become fully engaged in processing the data. In the CPU, although all the cores are busy, only one quarter of the vector width of the processing element in each core is used.
If you instead pass in float4* parameters to the kernel, that makes the addition a vector addition. The addition is now CPU-only, specialized for that device. That would extract as much work as possible from the CPU but leave the GPU idle.
In other words, without the autovectorizer, you would have to write multiple device-specific, non-scalar kernels, one for each type of device.
Writing Optimal Code For the CPU: Let the autovectorizer do the work for you
Do
-
Write one simple (non-vectorized) kernel that can run on any device.
Don’t
-
Write device-specific optimizations.
-
Write work item ID-dependent control flow, if possible. (If this occurs in many places in the code, it would likely prevent autovectorization from succeeding.)
What the autovectorizer does
-
Runs by default whenever compiling kernels to a CPU.
-
Packs work items together into vector instructions.
-
Workgroup size can be increased if autovectorization is successful.
-
Achieves performance improvements of up to the vector width of the CPU without additional effort on your part.
Vectorization Example
Xcode
|
Setting |
Type |
Default |
Command Line Flag |
|
Auto-vectorizer |
Boolean |
|
If this is set to |
© 2012 Apple Inc. All Rights Reserved. (Last updated: 2012-07-23)