Which instructions are included in the hardware event `INST_SIMD_ALU`?

I asked this on StackOverflow too, but did not get a response. Copying verbatim (images might not work as expected).

Short question: which instructions other than floating point arithmetic instructions like fmul, fadd, fdiv etc are counted under the hardware event INST_SIMD_ALU in XCode Instruments? Alternatively, how can I count the number of floating point operations in a program using CPU counters?

I want to measure/estimate the FLOPs count of my program and thought that CPU counters might be a good tool for this. The closest hardware event mnemonic that I could find is INST_SIMD_ALU, whose description reads.

Retired non-load/store Advanced SIMD and FP unit instructions

So, as a sanity check I wrote a tiny Swift code with ostensibly predictable FLOPs count.

let iterCount = 1_000_000_000
var x = 3.1415926
let a = 2.3e1
let ainv = 1 / a  // avoid inf
for _ in 1...iterCount {
    x *= a
    x += 1.0
    x -= 6.1
    x *= ainv
}

So, I expect there to be around 4 * iterCount = 4e9 FLOPs. But, on running this under CPU Counters with the event INST_SIMD_ALU I get a count of 5e9, 1 extra FLOP per loop iteration than expected. See screenshot below. dumbLoop is the name of the function that I wrapped the code in.

https://i.stack.imgur.com/WQGG3.png

Here is the assembly for the loop

+0x3c	fmul                d0, d0, d1   <----------------------------------
+0x40	fadd                d0, d0, d2                                      |
+0x44	fmov                d4, x10                                         | 
+0x48	fadd                d0, d0, d4                                      |
+0x4c	fmul                d0, d0, d3                                      |
+0x50	subs                x9, x9, #0x1                                    |
+0x54	b.ne                "specialized dumbLoop(_:initialValue:)+0x3c" ---

Since it's non-load/store instructions, it shouldn't be counting fmov and b.ne. That leaves subs, which is an integer subtraction instruction used for decrementing the loop counter. So, I ran two more "tests" to see if the one extra count comes from subs.

On running it again with CPU Counters with the hardware event INST_INT_ALU, I found a count of one billion, which adds up with the number of loop decrements.

https://i.stack.imgur.com/q79jo.png

Just to be sure, I unrolled the loop by a factor of 4, so that the number of loop decrements becomes 250 million from one billion.

let iterCount = 1_000_000_000
var x = 3.1415926
let a = 2.3e1
let ainv = 1 / a  // avoid inf
let n = Int(iter_count / 4)
for _ in 1...n {
    x *= a
    x += 1.0
    x -= 6.1
    x *= ainv
    x *= a
    x += 1.0
    x -= 6.1
    x *= ainv
    x *= a
    x += 1.0
    x -= 6.1
    x *= ainv
    x *= a
    x += 1.0
    x -= 6.1
    x *= ainv
}
print(x)

And it adds up, around 250 million integer ALU instructions, and the total ALU instructions is 4.23 billion, somewhat short of the expected 4.25 billion.

https://i.stack.imgur.com/AEsB6.png

So, at the moment if I want to count the FLOPs in my program, one estimate I can use is INST_SIMD_ALU - INST_INT_ALU. But, is this description complete, or are there an other instructions that I might spuriously count as floating point operations? Is there a better way to count the number of FLOPs?