Hi,
I have just started experimenting with Metal. I really just want to offload a few matrix multiplications to the GPU.
So, I downloaded this perfectly applicable example code and ran it: https://developer.apple.com/library/ios/samplecode/MetalPartialSumsCompute/Introduction/Intro.html#//apple_ref/doc/uid/TP40015013-Intro-DontLinkElementID_2
I can run it on my iphone 6 but there are several issues:
- It crashes sometimes (EXC_BAD_ACCESS (code=1), and other malloc errors). That's a bit worrying for reference code.
- It says I am getting approximately 0.04 "gflops/sec"[sic] using either CPU (Accelerate) or GPU (Metal). I believe I should be getting 3+ GFLOPS from the CPU? (http://www.anandtech.com/show/8554/the-iphone-6-review/3) I'm not sure what to expect from the GPU/Metal but I am assuming a considerably higher number.
Here's example output:
2015-12-02 14:48:27.772 MetalMatrixMultiply[5477:2056717] >> [12] Matrix Dimensions: A = [1248 x 1137], B = [1137 x 2004], C = [1248 x 2004], lda = 1248, ldb = 2008, ldc = 2008*
[12] Accelerate 0.029273 gflops/sec, Metal 0.056429 gflops/sec, Accelerate 19428.553542 millisec, Metal 10078.565833 millisec, Diff 2.669654e-08
My understanding is that a [1000x1000] matrix times [1000x1000] matrix should be approximately 1 GFLOP (n^3), so a 20 second runtime implies ~0.05 GLOPS. (It's more complicated that that but the right order of magnitude... https://devtalk.nvidia.com/default/topic/482834/how-to-compute-gflops-for-gemm-blas/)
Does anyone have any experience with this code? I didn't change any part of the code, so I am assuming this behavior is unexpected?
Thanks for any hints on this...
Brian