I asked this on StackOverflow too, but did not get a response. Copying verbatim (images might not work as expected).
Short question: which instructions other than floating point arithmetic instructions like fmul
, fadd
, fdiv
etc are counted under the hardware event INST_SIMD_ALU
in XCode Instruments? Alternatively, how can I count the number of floating point operations in a program using CPU counters?
I want to measure/estimate the FLOPs count of my program and thought that CPU counters might be a good tool for this. The closest hardware event mnemonic that I could find is INST_SIMD_ALU
, whose description reads.
Retired non-load/store Advanced SIMD and FP unit instructions
So, as a sanity check I wrote a tiny Swift code with ostensibly predictable FLOPs count.
let iterCount = 1_000_000_000
var x = 3.1415926
let a = 2.3e1
let ainv = 1 / a // avoid inf
for _ in 1...iterCount {
x *= a
x += 1.0
x -= 6.1
x *= ainv
}
So, I expect there to be around 4 * iterCount = 4e9
FLOPs. But, on running this under CPU Counters with the event INST_SIMD_ALU
I get a count of 5e9, 1 extra FLOP per loop iteration than expected. See screenshot below. dumbLoop
is the name of the function that I wrapped the code in.
https://i.stack.imgur.com/WQGG3.png
Here is the assembly for the loop
+0x3c fmul d0, d0, d1 <----------------------------------
+0x40 fadd d0, d0, d2 |
+0x44 fmov d4, x10 |
+0x48 fadd d0, d0, d4 |
+0x4c fmul d0, d0, d3 |
+0x50 subs x9, x9, #0x1 |
+0x54 b.ne "specialized dumbLoop(_:initialValue:)+0x3c" ---
Since it's non-load/store instructions, it shouldn't be counting fmov
and b.ne
. That leaves subs
, which is an integer subtraction instruction used for decrementing the loop counter. So, I ran two more "tests" to see if the one extra count comes from subs
.
On running it again with CPU Counters with the hardware event INST_INT_ALU
, I found a count of one billion, which adds up with the number of loop decrements.
https://i.stack.imgur.com/q79jo.png
Just to be sure, I unrolled the loop by a factor of 4, so that the number of loop decrements becomes 250 million from one billion.
let iterCount = 1_000_000_000
var x = 3.1415926
let a = 2.3e1
let ainv = 1 / a // avoid inf
let n = Int(iter_count / 4)
for _ in 1...n {
x *= a
x += 1.0
x -= 6.1
x *= ainv
x *= a
x += 1.0
x -= 6.1
x *= ainv
x *= a
x += 1.0
x -= 6.1
x *= ainv
x *= a
x += 1.0
x -= 6.1
x *= ainv
}
print(x)
And it adds up, around 250 million integer ALU instructions, and the total ALU instructions is 4.23 billion, somewhat short of the expected 4.25 billion.
https://i.stack.imgur.com/AEsB6.png
So, at the moment if I want to count the FLOPs in my program, one estimate I can use is INST_SIMD_ALU - INST_INT_ALU
. But, is this description complete, or are there an other instructions that I might spuriously count as floating point operations? Is there a better way to count the number of FLOPs?