i have similar experiences with compute kernels. They seem to be generally slower on most (if not all) devices I use, namely both GPUs of 2012 rMBP (nVidia and Intel) as well as on iPad Pro 12.9. But the devil's in details:
1) Some tasks (for example scatter or communication between workers) are awkward/not possible when done via shaders and render pipeline. Compute is more natural here. Can be faster, too
2) I an under impression that iPad Pro can execute compute tasks parallel to render tasks. Therefore by moving part of work from overloaded renderer into compute kernel, one can still get better overall performance
3) it is harder to write (and schedule) optimal compute kernel than its shader counterpart. One is basically left with many hard decisions here, like threadgroup sizes. What is even more important is that these may vary between devices. I believe that serious performance work in CUDA or OpenCL freqently involves writing "autotuning" code.
4) Naive compute kernel (for example simple 1 to 1 mapping between data size and number of threads/threadgroups) is usually less efficient than say, similar kernel but processing several - like 2, 4 or 8 - items at once)
5) On the other hand, GPU and/or driver understands rendering better, so it may help with rendering in ways it can't with compute. And then you get device-specific optimization "for free". I guess this is even more importand with TBDRs with their specific tiling requirements.
6) I'd say that buffers are sometimes abused in compute kernels. Rendering usually uses textures and their caches help a lot.
7) Hierarchical structure of rendering (vertex, then fragment shaders) sometimes allows for huge optimizations by simply moving "common" calculations from fragment to vertex shaders. This depends on particulars, but usually a big win.
And in your particular example above, number of vertex shaders is very small, it is unlikely to have performance impact.
Best regards
Michal
PS. What I wrote above is based upon Metal experience. I have also experience with OpenCL/OpenGL on various other devices (older MBPs, standalone PC GPUs) and most of what I wrote above holds there, too. I remember writing CL kernels faster than GL shaders, but kernels were very painfully optimized AND GL had performance penalties when switching render targets, which hampered performance of GL shaders doing compute work.