Core ML: Strong increase in load t… | Apple Developer Forums

Core ML: Strong increase in load time with quantisation from 16 to 8 bits

Hi,

I have a BERT-large model that has been quantised to 16 bits. According to the performance report, the model takes about 500 ms to load:

If I quantise the model further to 8 bits, the loading time increases drastically to almost 2500 ms. That is a fivefold increase:

Does anyone have an explanation for this? I would rather have expected a reduction in the loading time, since the model is only half as large due to the further quantisation. I have carried out the measurements several times and always come to the same result. My only idea is that it might be due to the compute precision, which is limited to 16 bits in the CPU, GPU and NPU.