Compute kernels on beta 5 (and 6) are totally broken (radar 22029422)

The simplest compute kernels run faster on beta 5 (near OpenCL performance), but there is many problems that have broken all our "serious" compute kernels... It will be really difficult for us to isolate all the issues in a kernel to send a radar, as there are too many, but we will try.


For exemple :

- A threadgroup_barrier on a threadgroup memory seems to reset the threadgroup buffers.

- A conditional test on a threadgroup buffer component, into a for loop is not executed.


All that we done and that have work on beta 4 do not work anymore 😢

Below are two sample kernels that do not work anymore on beta 5 (but were working on beta 4)...


As a reference point, this one is working :

kernel void kernelWorking(const device float *inFloat [[ buffer(0) ]],
                          device int *outCount [[ buffer(1) ]],
                          constant kernelIssuesUniforms &uniforms [[ buffer(2) ]],
                          
                          threadgroup float *localFloat [[ threadgroup(0) ]],
                          threadgroup int *localCount [[ threadgroup(1) ]],
                          uint globalId [[ thread_position_in_grid ]],
                          uint localId [[ thread_position_in_threadgroup ]],
                          uint localSize [[ threads_per_threadgroup ]])
{
  if (globalId >= uniforms._workSize)
    return;
  localFloat[localId] = inFloat[globalId];
  localCount[localId] = 0;

  threadgroup_barrier(mem_flags::mem_threadgroup);

  if (localFloat[localId] < 1000)
  {
    ++(localCount[localId]);
  }
  outCount[globalId] = localCount[localId];
}


This one doesn't work because of the barrier at line 26 (outCount was always filled with 0) :


kernel void kernelIssue0(const device float *inFloat [[ buffer(0) ]],
                         device int *outCount [[ buffer(1) ]],
                   
                         constant kernelIssuesUniforms &uniforms [[ buffer(2) ]],
                   
                         threadgroup float *localFloat [[ threadgroup(0) ]],
                         threadgroup int *localCount [[ threadgroup(1) ]],
                   
                         uint globalId [[ thread_position_in_grid ]],
                         uint localId [[ thread_position_in_threadgroup ]],
                         uint localSize [[ threads_per_threadgroup ]])
{
  if (globalId >= uniforms._workSize)
    return;

  localFloat[localId] = inFloat[globalId];
  localCount[localId] = 0;

  threadgroup_barrier(mem_flags::mem_threadgroup);

  if (localFloat[localId] < 1000)
  {
    ++(localCount[localId]);
  }

  threadgroup_barrier(mem_flags::mem_threadgroup);

  outCount[globalId] = localCount[localId];
}


This one encapsulates the conditional of the working sample (line 18-21), into a for-loop (line 20) and doesn't work by only producing 0 into the outCount buffer :


kernel void kernelIssue1(const device float *inFloat [[ buffer(0) ]],
                         device int *outCount [[ buffer(1) ]],
                   
                         constant kernelIssuesUniforms &uniforms [[ buffer(2) ]],
                   
                         threadgroup float *localFloat [[ threadgroup(0) ]],
                         threadgroup int *localCount [[ threadgroup(1) ]],
                   
                         uint globalId [[ thread_position_in_grid ]],
                         uint localId [[ thread_position_in_threadgroup ]])
{
  if (globalId >= uniforms._workSize)
    return;

  localFloat[localId] = inFloat[globalId];
  localCount[localId] = 0;

  threadgroup_barrier(mem_flags::mem_threadgroup);

  for (int i = 0; i < 16; i++)
  {
    if (localFloat[localId] < 1000)
    {
      ++(localCount[localId]);
    }
  }

  outCount[globalId] = localCount[localId];
}



I set a little xcode project with this samples and report it (radar 22029422)

How the **** these kind of bugs can pass your validation process ? Beta 4, with its assortment of issues was able to execute these kernels…


Don’t you have some moderately complex compute kernels that can help you, or do you cross your finger waiting for our returns ? 👿


Here at Iluac we have a bunch of kernels, translated from OpenCL, that can vigorously shake your drivers… And seriously, as I test and debug El Capitan as a part time job for nothing, hire me for three monthes and I’ll help you provide decent drivers (yeah… you know, I’m french and supposedly arrogant) ! And because I love Apple, I’ll do that for a honest salary, a business class seat, an unrestricted access to your Redbull's storehouse, and a ticket for WWDC ’16 ! 😁

iPerKard, thanks for the report. I'm looking into what changed in Beta 5.

Again, it seems this is specifically an AMD drivers issue, because these problems occur only on our MacPro (D500) and not on our iMac with Nvidia GT 750M... I have updated the radar...


PLEASE, TELL US WHAT IS THE PROBLEM WITH AMD GPUs ??? 😕


Will we ever compile and run kernels on these GPU ? It is unfair to let us into the dark... Will we have the same poor support than with OpenCL on these GPUs ? Have you a plan to solve that ?

Still broken on beta 6...

Thanks for the update iPerKard. This should be addressed in an upcoming beta.

Compute kernels on beta 5 (and 6) are totally broken (radar 22029422)
 
 
Q