Hi I am porting some applications to M1 that make extensive use of vDSP. I found in many cases there to be a minimal speed-up, which I put down to Rosetta doing a good job translating SSE instructions into equivalent Neon instructions in the vDSP library. To try and understand this more I started profiling various areas of code and have found situations where the performance of translated code runs faster than natively. Often native code speed is similar or faster as expected, but there are a notable numbers of cases where it is not. This is not what I expected. I include a sample below to show a somewhat contrived and trivial routine exhibiting the effect. I have built it using XCode 12.5.1 in Release with an 11.3 deployment target. The Mac is running macOS 11.6. On my M1 Mac mini the Rosetta build takes around 900-1000 µs to run to completion, switching to native code it takes around 1500-1600 µs. I can make various adjustments to the data size or types of vDSP operations used to find scen
1
0
1.1k