Performance of MPSMatrixMultiplication relative to cblas_sgemm

I've been experimenting with MPSMatrixMultiplication performance shader but am not able to get even close to blas performance.


Is this performance of MPSMatrixMultiplication expected?

A side note: it took me about 10min to compose the full message (much longer than originally posted) and then another 30-40min to deal with the spam filter which is NOT doing its job). Thus, the criptic intial message...


Experiment is on iPhone 6, iOS 10, Objective C

params.numRowsA = 5;

params.numColsA = 350;

params.numRowsB = 3500;

params.numColsB = 350;

params.numRowsC = 5;

params.numColsC = 3500;

params.alpha = 1;

params.beta = 1;

params.transA = NO;

params.transB = YES;


MPS on average takes about 20ms for this operation. wheres blas takes about 7ms. Majority of time for MPS is spent in computation (i.e. not in preparing the buffers, etc.) as can be seen from the log of one of the iterations (times are cummulative):

0.517964ms (since start) to allocate matrices

0.835955ms (since start) to allocate kernel and buffers

1.121998ms (since start) to encode to command buffer

1.312971ms (since start) to commit

20.446956ms (since start) to complete


I ran this in the loop (100 iterations) and aside from the first one which takes longer, performance is consistent across iterations 2..100. Each iteration allocates its own buffers/matrices, just to factor out any caching, etc.


If I test with larger matrices, e.g.

params.numRowsA = 1024;

params.numColsA = 1024;

params.numRowsB = 350;

params.numColsB = 1024;

params.numRowsC = 1024;

params.numColsC = 350;


MPS is at around 82ms and blas at 30ms.


Is this performance of MPSMatrixMultiplication expected?

Performance of MPSMatrixMultiplication relative to cblas_sgemm
 
 
Q