Metal Matrix Multiplication reference code

Hi,

I have just started experimenting with Metal. I really just want to offload a few matrix multiplications to the GPU.


So, I downloaded this perfectly applicable example code and ran it: https://developer.apple.com/library/ios/samplecode/MetalPartialSumsCompute/Introduction/Intro.html#//apple_ref/doc/uid/TP40015013-Intro-DontLinkElementID_2


I can run it on my iphone 6 but there are several issues:

  1. It crashes sometimes (EXC_BAD_ACCESS (code=1), and other malloc errors). That's a bit worrying for reference code.
  2. It says I am getting approximately 0.04 "gflops/sec"[sic] using either CPU (Accelerate) or GPU (Metal). I believe I should be getting 3+ GFLOPS from the CPU? (http://www.anandtech.com/show/8554/the-iphone-6-review/3) I'm not sure what to expect from the GPU/Metal but I am assuming a considerably higher number.


Here's example output:

2015-12-02 14:48:27.772 MetalMatrixMultiply[5477:2056717] >> [12] Matrix Dimensions: A = [1248 x 1137], B = [1137 x 2004], C = [1248 x 2004], lda = 1248, ldb = 2008, ldc = 2008*

[12] Accelerate 0.029273 gflops/sec, Metal 0.056429 gflops/sec, Accelerate 19428.553542 millisec, Metal 10078.565833 millisec, Diff 2.669654e-08

My understanding is that a [1000x1000] matrix times [1000x1000] matrix should be approximately 1 GFLOP (n^3), so a 20 second runtime implies ~0.05 GLOPS. (It's more complicated that that but the right order of magnitude... https://devtalk.nvidia.com/default/topic/482834/how-to-compute-gflops-for-gemm-blas/)


Does anyone have any experience with this code? I didn't change any part of the code, so I am assuming this behavior is unexpected?

Thanks for any hints on this...

Brian

I reimplemented my own matrix multiplication kernel to compare. My implementation is a pretty minimal for loop, comparable to this: https://github.com/CNugteren/myGEMM/blob/master/extra/minimal.cpp.

I get very similar performance to the reference code above: a 1024x1024 * 1024x2048 multiplication (i.e., approximately the same size as the matrix in my example above) takes 8 seconds, which comes out to 0.5 GFLOPS.

(Note, I don't know why the reference code says 0.05 GFLOPS and my calculation is 0.5 GFLOPS when they both take the same amount of time for the same sized matrices. My calculation is simply 1024x1024x2048x2 / (8x1e9).)


I am still very puzzled about this. 0.5 GFLOPS seems too low by a couple of orders of magnitude, but I don't see how a kernel could be much more efficient than a simple for loop. I am hoping I am missing something obvious here.....

Metal Matrix Multiplication reference code
 
 
Q