Next Previous

ReadMe.txt

### OpenCL Matrix Transpose Example ###

===========================================================================

DESCRIPTION:

This example shows how to efficiently perform a transpose of a matrix composed

of M x N power-of-two elements for GPU architectures which require specific

memory addressing to avoid memory bank conflicts.

Transposing large power-of-two matrices naively can easily cause bank

conflicts which can severly affect the performance.

With appropriate padding and choice of local block size, good performance

can be ensured.

In this example 64 work items are issued per work-group which individually

operate small 32x2 sections to fill a 32x32 sub-matrix (over 8 iterations).

The final 32 x 32 sub-matrix is transposed locally using local memory

with one column padding to avoid bank conflicts.   Performing the transpose

in local memory allows the reads and writes to global memory to be coalesced.

The extra column padding is used to offset the write addresses, so that

they don't conflict with the read requests.

Using a padding of 32 (or any odd multiple of GROUP_DIMX = 32) ensures that

the reads and writes for each element in global memory will be offset and

not operate on the same memory bank/channel/port.

This is important for the global memory write operations, since the column

major indices are non-sequential and can cause global memory bank conflicts.

Global memory read requests will operate on sequential indices for the

row-major elements, and will not conflict.

Note that the .cl compute kernel file(s) are loaded and compiled at

runtime.  The example source assumes that these files are in the same

path as the built executable.

For simplicity, this example is intended to be run from the command line.

If run from within XCode, open the Run Log (Command-Shift-R) to see the

output.  Alternatively, run the applications from within a Terminal.app

session to launch from the command line.

===========================================================================

BUILD REQUIREMENTS:

Mac OS X v10.6 or later

===========================================================================

RUNTIME REQUIREMENTS:

Mac OS X v10.6 or later

To use the GPU as a compute device, use one of the following devices:

- MacBook Pro w/NVidia GeForce 8600M

- Mac Pro w/NVidia GeForce 8800GT

===========================================================================

PACKAGING LIST:

ReadMe.txt

transpose.c

transpose.xcodeproj

transpose_kernel.cl

===========================================================================

CHANGES FROM PREVIOUS VERSIONS:

Version 1.0

- First version.

===========================================================================

Copyright (C) 2008 Apple Inc. All rights reserved.

Next Previous