ReadMe.txt
### OpenCL Matrix Transpose Example ### |
=========================================================================== |
DESCRIPTION: |
This example shows how to efficiently perform a transpose of a matrix composed |
of M x N power-of-two elements for GPU architectures which require specific |
memory addressing to avoid memory bank conflicts. |
Transposing large power-of-two matrices naively can easily cause bank |
conflicts which can severly affect the performance. |
With appropriate padding and choice of local block size, good performance |
can be ensured. |
In this example 64 work items are issued per work-group which individually |
operate small 32x2 sections to fill a 32x32 sub-matrix (over 8 iterations). |
The final 32 x 32 sub-matrix is transposed locally using local memory |
with one column padding to avoid bank conflicts. Performing the transpose |
in local memory allows the reads and writes to global memory to be coalesced. |
The extra column padding is used to offset the write addresses, so that |
they don't conflict with the read requests. |
Using a padding of 32 (or any odd multiple of GROUP_DIMX = 32) ensures that |
the reads and writes for each element in global memory will be offset and |
not operate on the same memory bank/channel/port. |
This is important for the global memory write operations, since the column |
major indices are non-sequential and can cause global memory bank conflicts. |
Global memory read requests will operate on sequential indices for the |
row-major elements, and will not conflict. |
Note that the .cl compute kernel file(s) are loaded and compiled at |
runtime. The example source assumes that these files are in the same |
path as the built executable. |
For simplicity, this example is intended to be run from the command line. |
If run from within XCode, open the Run Log (Command-Shift-R) to see the |
output. Alternatively, run the applications from within a Terminal.app |
session to launch from the command line. |
=========================================================================== |
BUILD REQUIREMENTS: |
Mac OS X v10.6 or later |
=========================================================================== |
RUNTIME REQUIREMENTS: |
Mac OS X v10.6 or later |
To use the GPU as a compute device, use one of the following devices: |
- MacBook Pro w/NVidia GeForce 8600M |
- Mac Pro w/NVidia GeForce 8800GT |
=========================================================================== |
PACKAGING LIST: |
ReadMe.txt |
transpose.c |
transpose.xcodeproj |
transpose_kernel.cl |
=========================================================================== |
CHANGES FROM PREVIOUS VERSIONS: |
Version 1.0 |
- First version. |
=========================================================================== |
Copyright (C) 2008 Apple Inc. All rights reserved. |
Copyright © 2009 Apple Inc. All Rights Reserved. Terms of Use | Privacy Policy | Updated: 2009-05-13