I am trying to benchmark and see if the Qwen3 1.7B model can run in an iPhone SE 3 [4 GB RAM].
My core problem is - Even with weight quantization the SE 3 is not able to load into memory.
What I've tried:
I am converting a Torch model to the Core ML format using coremltools. I have tried the following combinations of quantization and context length
- 8 bit + 1024
- 8 bit + 2048
- 4 bit + 1024
- 4 bit + 2048
All the above quantizations are done with dynamic shape with the default being [1,1] in the hope that the whole context length does not get allocated in memory
- The 4-bit model is approximately 865MB on disk
- The 8-bit model is approximately 1.7 GB on disk
During load:
- With the int4 quantization the memory spikes during intitial load a lot. Could this be because many operations are converted to int8 or fp16 as core ML does not perform operations natively on int4?
- With int8 on the profiler the memory does not go above 2 GB (only 900 MB) but it is still not able to load as it shows the following error. 2GB is the limit where jetsam kills the app for the iPhone SE 3
E5RT: Error(s) occurred compiling MIL to BNNS graph:
[CreateBnnsGraphProgramFromMIL]: BNNS Graph Compile:
failed to preallocate file with error: No space left on device
for path: /var/mobile/Containers/Data/Application/
5B8BB7D2-06A6-4BAE-A042-407B6D805E7C/Library/Caches
/com.tss.qwen3-coreml/
com.apple.e5rt.e5bundlecache/
23A341/<long key>.tmp.12586_4362093968.bundle/
H14.bundle/main/main_bnns/bnns_program.bnnsir
Some online sources have suggested activation quantization but I am unsure if that will have any impact on loading [as the spike is during load and not inference]
The model spec also suggests that there is no dequantization happening (for e.g from 4 bit -> fp16)
So I had couple of queries:
- Has anyone faced similar issues?
- What could be the reasons for the temporary memory spike during LOAD
- What are approaches that can be adopted to deal with this issue?
Any help would be greatly appreciated. Thank you.