How to profile/avoid neon data stalls on iPhones

Hi,

I've spent lots of time unsuccessfully trying to figure out what I need to do to avoid stalls in performance critical pieces of my NEON code. For example screenshot of CPU profiler results:


that's the fuction that I wrote to for testing the issue:


//CheckSolid_mem(unsigned char const* src, unsigned int* dst, int count):
function CheckSolid_mem, export=1
add r12, r2, #1
0:
sub r12, r12, #1
mov r2, r0
mov r3, r1
cmp r12, #0
vld1.32 {d16, d17, d18, d19}, [r2]!
vld1.32 {d20, d21, d22, d23}, [r2]!
vdup.32 q12, d16[0]
vceq.i32 q8, q8, q12
vceq.i32 q9, q9, q12
vceq.i32 q10, q10, q12
vceq.i32 q11, q11, q12
vand q8, q8, q9
vand q10, q10, q11
vand q8, q8, q10
vand d16, d16, d17
vpmin.u32 d16, d16, d16
vbic.i16 d24, #0x7
vbic.i16 d24, #0x700
vorr.i32 d24, #0x2000000
vand d16, d16, d24
vst1.32 {d16}, [r3]!
bgt 0b
mov r0, r2
bx lr
endfunc


in short CheckSolid_mem reads sixteen 32-bit ints, then compares if they are all equal (creates mask of zeros or ones based on that equality in d16), then then does a few bitwise ops with d24 and then does vand of the mask in d16 and result in d24. Final result is stored in memory pointed to by dst pointer. This entire block is repeated count times. For testing I run it with count= 100M times or something like that and I get that profile picture. As you can see commonly in my neon code I get that stall on `vst1.32 {d16}, [r3]!` line. That instruction alone takes 50% of the entire runtime of the function. This is something that's expected to happen in neon code when trying to store value of a neon register to memory or to an ARM register. I want to understand why exactly this happens and what I have to do to avoid/mitigate the issue. Normally I know what to do in such case, but when dealing with iPhones I don't get why it never works: no matter what I try and no matter how I reshuffle my code I always get these stalls in places there they simply kill performance of my code.

From my understanding, the code shown has a few issues: 1) result in d16 isn't immediately available, so there is some latency added before it can be stored. If I replace that store with (d2 wich wasn't used in that entire function), then that like takes roughly 200x instead of 255x, so, supposedly stall from writing neon register to memory is 200x in this function.


Normally there shouldn't be any stalls from writing neon register to some memory. All specs list vst1 as an opcode that takes just a few cycles and in my case I use simples case which takes 1 cycle. Now, if I tried to access that memory using an arm register, then I would have to experience that stall on reading that memory as it should take 10-20 cycles before this piece of memory is available on arm side. Similar goes for moving results from neon registers to arm registers: `vmov.32 r3, d16[0]` takes a cycle, but then if I try to access r3 I'll get the same apic stall. In short, accessing any data on arm side will stall if that data originated on neon side.

This was my understanding of neon for many years. To avoid stalls when dealing with neon->arm transfers you run some other unrelated code for 10-20 more cycles and then data becomes ready on arm side and can be accessed without that epic stall that takes 50% of function runtime.


So.. apparently, there is something wrong with either profiler, my understanding of NEON, or even with apple's chips, but I cannot figure out what I could possily do to avoid these stalls. I tried to insert like 50 nop instructions before storing neon register, I tried to add like 50 nop instructions after storing the neon register: timeing changes, but this specific instruction always takes no matter what an epicly huge amount of time compared to all other instructions.

Can some engineers from apple clarify what's going on with apple's chips, why I cannot avoid these stalls? I spent weeks tryign different approaches, without any success. People experienced in this type of high perf optimizations strongly advise me against uise of iPhones for this type of profiling work as they say that I will never get correct results or what I'd normally should have gotten from arm chips. But my target software mainly runs on iPhones, so I would really like to understand what is going on.


I use iPhone6 for development. I build this test code in 32-bit ARM mode.

There was a screenshot of profiler window which got ommitted. Here's the link.

Also, I use cpu sampling profiler, and somehow cannot figure out how to use counter profiler which should supposedly give me better insights. But when I sue counters profiler I don't get the same lie-by-line printout.

I can't speak to the CPU architecture questions, but I can tell you about the profiler. The profiler uses a timer interrupt, and the PC that's sampled is the instruction that needs to be executed next to continue. Unfortunately that means the behavior depends on how the processor has fused the instructions and which instructions it considers interruptible. It's totally possible for the processor to advance the architectural PC even though the instructions are still in flight, so the instruction in the profile with all the weight may also include the weight of the prior instructions. It could also be a dependency issue, or a memory latency issue, or any of the other concerns you listed above.


The performance interrupts (rather than time) have a similar problem. The PC doesn't have to be pointing at the instruction that actually caused the interrupt. So, even switching to counters could have a similar result. Generally when I'm looking at profiles I tend to look at the instructions before and after the heaviest. We unfortuantely just don't have the state and models it would require to get a closer approximation.

As I understand latest arm chips have out of order execution, so, even arm.com docs say that it's not easy to predict exact timing. Also, as you suggest in prfoler results I see that lots of instructions are batched (or executed in parallel?) and don't have any timing info on the pic. Also, in case of neon PC doesn't necessarily mean position of execution, as neon has it's own execution queue, which also explains why moves from neon to arm state take lots of time: neon instructions are actually executed later from their own execution queue. That was my vague understanding how it works. But in any case, correct understanding of what's going on should make it possible to write neon code that doesn't have these epic stalls that I see. In the example code posted I tried to write some simple neon code and then I tried to reorder instructions or do something so that there would be no stall on NEON->ARM, or NEON->ram move. But it seems like this is impossible with iPhones, no matter what I do, I always get that epic stall. What's even more surprising is that I get that problem with NEON->ram, which shouldn't have these problems (unless you try to access that written-to memory on arm-side).

Basically I spend lots of time experimenting and talking to different people who should know it better, and it seems that my last resort is to ask for the information at the source: would anybody at apple be able to explain what's going on and what needs to be done in general to avoid this type of problems.

> would anybody at apple be able to explain what's going on and what needs to be done in general to avoid this type of problems.


Might want to consider burning a support ticket w/DTS via the Member Center. Be sure to have a sample project that demonstrates...


Good luck.

Sometimes this can happen normally when the buffer you are writing to either isn't resident in the cache, or isn't resident in memory at all, because it was recently allocated and needs to be zero filled by the kernel after the pages are made resident (and probably other pages evicted). In the latter case, you are seeing hundreds of stores complete normally and in one loop iteration that hits a new page, the store taking a very long time due to the VM trap. You are more likely to see this in situ in an app than some tight benchmark loop that is reusing the same memory over and over. If these effects are the ones causing trouble for you, calling memset on the buffer before you write to it should move the stall to memset, assuming the buffer is not so large it doesn't fit in the cache causing other problems. This won't make things go any faster, but you'll at least see your code running as you thought it would and allow you to rule out instruction selection as the problem and go see about reusing memory more effectively.

As I understand latest arm chips have out of order execution, so, even arm.com docs say that it's not easy to predict exact timing.

Out of order execution, but in order completion, so that the program order is observed to occur as written and we don't get unpredictable results. What usually (not always) happens with relatively straight line, non-branchy code on out of order machines is that the out of order buffers fill up with work as intended and the pipeline is thereafter limited by the rate at which instructions can retire. New instructions can not enter the pipeline until others retire to make room. Instructions can not retire if the work is not done and the program counter where the samples land is not updated until the instruction retires, so the pattern of long instructions showing up as stalls in traces reasserts itself even with out of order execution. Most vector loops have this behavior. If the machine is able to retire some number of instructions per cycle, which then is the pattern you see.

For example, a set of Intel processors at one point could retire 4 instructions per cycle, and if you had well tuned code with no microcode expansion or long instructions like division going on, you'd see the samples landing every 4 instructions. When you are not seeing instructions retiring at 4 instructions per cycle, but perhaps only manage one or two, this would then be an indication of either a stall or a instruction that was broken down into many operations as microcode. Microcode can both load a pile of unobservable operations into the pipeline slowing things down and introduce decoder stalls that prevents other instructions from being decoded. So, I try to avoid microcoded instructions in my own code. You can look up the work of Agner Fog online for a list of Intel microcoded instructions. I am not aware of a similar effort for Apple Silicon.

That said, sometimes you can just tell. If an instruction is doing multiple very different things, such as store with update -- we have address calculation, a data store and an update of a general purpose register -- this would then be a very tempting target for a hardware engineer to involve different ALUs to accomplish each part, which means the instruction will have to be microcoded. Sometimes machines have magical boundaries that optimize instructions that are cracked into two microops but not three or more (for example), so depending on microarchitecture, some limited microcoding can be okay but really complex ops with lots of micro operations are much more likely to be bad news. You see a lot of this in older ISAs like arm32 / thumb, and not so much in newer ones like arm64 which emphasize more RISC style instructions over complex ones.

If I didn't have specs on the microarchitecture, I might try removing the update on the store and see if that makes any difference. You can also try compiling the code in C and seeing what the compiler does. The compiler writers at Apple do have specs on the hardware and tune the compiler accordingly, so if the compiler is not emitting a store with update, that may be sign that either the compiler writer missed an important and obvious optimization, or the compiler writer knows something you don't and the instruction is bad news!! The former outcome might have been more common 20 years ago due to man power constraints, but these days, I'd be betting on the back end engineer knowing what he is doing and having the time to do the right thing. There are of course many, many right things to do.
How to profile/avoid neon data stalls on iPhones
 
 
Q