ReadMe.txt

### OpenCL Matrix Transpose Example ###
 
===========================================================================
DESCRIPTION:
 
This example shows how to efficiently perform a transpose of a matrix composed
of M x N power-of-two elements for GPU architectures which require specific
memory addressing to avoid memory bank conflicts. 
 
Transposing large power-of-two matrices naively can easily cause bank 
conflicts which can severly affect the performance.
 
With appropriate padding and choice of local block size, good performance 
can be ensured.
 
In this example 64 work items are issued per work-group which individually 
operate small 32x2 sections to fill a 32x32 sub-matrix (over 8 iterations). 
The final 32 x 32 sub-matrix is transposed locally using local memory 
with one column padding to avoid bank conflicts.   Performing the transpose 
in local memory allows the reads and writes to global memory to be coalesced.
 
The extra column padding is used to offset the write addresses, so that
they don't conflict with the read requests. 
 
Using a padding of 32 (or any odd multiple of GROUP_DIMX = 32) ensures that
the reads and writes for each element in global memory will be offset and 
not operate on the same memory bank/channel/port.  
 
This is important for the global memory write operations, since the column 
major indices are non-sequential and can cause global memory bank conflicts.
 
Global memory read requests will operate on sequential indices for the 
row-major elements, and will not conflict.
 
Note that the .cl compute kernel file(s) are loaded and compiled at
runtime.  The example source assumes that these files are in the same 
path as the built executable.
 
For simplicity, this example is intended to be run from the command line.
If run from within XCode, open the Run Log (Command-Shift-R) to see the 
output.  Alternatively, run the applications from within a Terminal.app 
session to launch from the command line.
 
===========================================================================
BUILD REQUIREMENTS:
 
Mac OS X v10.6 or later
 
===========================================================================
RUNTIME REQUIREMENTS:
 
Mac OS X v10.6 or later
 
To use the GPU as a compute device, use one of the following devices:
- MacBook Pro w/NVidia GeForce 8600M 
- Mac Pro w/NVidia GeForce 8800GT
 
 
===========================================================================
PACKAGING LIST:
 
ReadMe.txt
transpose.c
transpose.xcodeproj
transpose_kernel.cl
 
===========================================================================
CHANGES FROM PREVIOUS VERSIONS:
 
Version 1.0
- First version.
 
===========================================================================
Copyright (C) 2008 Apple Inc. All rights reserved.