I have encounted an strange performance issue.
For the same binary with neon cpu intensive task,iPhone4s performs better than iPhone7Plus.
At first i guess it may be caused by different CPU arch: iPhone4 using armv7 and iPhone7Plus using arm64.
But when I test the same binary on iPhone6, it performs normally as iPhone4s, far more better than iPhone7Plus, so the arm64 is confirmed no problem.
Then I started to think may be the neon ASM code not optimized for Apple A10, hit align or cache missing issues.
After many tests, I found iPhone7 performs good in some short period of time,and finnally I spoted when I set a breakpoint and then continue executing, the performance will greatly raised.
Now I doubt this issue may caused by iPhone7 asymmetric multi core CPU and OS task schedle algorithm.
Here is the first thread I've posted to webm fourm, for reference: