Unable to load a quantized Qwen 1.7B model on an iPhone SE 3

Question

naren1991 OP

Created Feb ’26

Replies 2

Boosts 0

Participants 2

I am trying to benchmark and see if the Qwen3 1.7B model can run in an iPhone SE 3 [4 GB RAM].

My core problem is - Even with weight quantization the SE 3 is not able to load into memory.

What I've tried:

I am converting a Torch model to the Core ML format using coremltools. I have tried the following combinations of quantization and context length

8 bit + 1024
8 bit + 2048
4 bit + 1024
4 bit + 2048

All the above quantizations are done with dynamic shape with the default being [1,1] in the hope that the whole context length does not get allocated in memory

The 4-bit model is approximately 865MB on disk
The 8-bit model is approximately 1.7 GB on disk

During load:

With the int4 quantization the memory spikes during intitial load a lot. Could this be because many operations are converted to int8 or fp16 as core ML does not perform operations natively on int4?
With int8 on the profiler the memory does not go above 2 GB (only 900 MB) but it is still not able to load as it shows the following error. 2GB is the limit where jetsam kills the app for the iPhone SE 3

E5RT: Error(s) occurred compiling MIL to BNNS graph:
[CreateBnnsGraphProgramFromMIL]: BNNS Graph Compile: 
failed to preallocate file with error: No space left on device 
for path: /var/mobile/Containers/Data/Application/
5B8BB7D2-06A6-4BAE-A042-407B6D805E7C/Library/Caches
/com.tss.qwen3-coreml/
com.apple.e5rt.e5bundlecache/
23A341/<long key>.tmp.12586_4362093968.bundle/
H14.bundle/main/main_bnns/bnns_program.bnnsir

Some online sources have suggested activation quantization but I am unsure if that will have any impact on loading [as the spike is during load and not inference]

The model spec also suggests that there is no dequantization happening (for e.g from 4 bit -> fp16)

So I had couple of queries:

Has anyone faced similar issues?
What could be the reasons for the temporary memory spike during LOAD
What are approaches that can be adopted to deal with this issue?

Any help would be greatly appreciated. Thank you.

Answer 1

IAiSeed OP

Feb ’26

Running large models like Qwen3 1.7B on devices with limited resources, such as the iPhone SE 3 with 4 GB of RAM, is indeed challenging. Let's address your questions and explore potential approaches to mitigate the issues you're facing:

Observations and Possible Causes

Memory Spike During Load: Dynamic Shape Usage: While dynamic shapes can help manage memory allocation during inference by not preallocating for the full context length, the initial loading process often requires more fixed memory to parse the model, set up internal structures, and potentially convert certain operations to formats supported by the hardware. Intermediate Conversion: Core ML might internally convert some operations from INT4 to INT8 or FP16 if the underlying hardware (BNNS/ANE) doesn't natively support INT4 operations. This can cause temporary spikes in memory usage during loading. Memory Limitations: The iPhone SE 3's 4 GB RAM is quite limited for large models, especially if additional overhead from the operating system and other processes is considered. The jetsam threshold you're hitting (~2 GB) is a hard limit imposed to prevent apps from consuming too much memory and crashing the device.

Potential Solutions and Approaches

Model Splitting: Partial Loading: Consider splitting the model into smaller segments that can be loaded and processed independently. For instance, you could load and process the model in chunks of the context length, handling each chunk sequentially. Layer-wise Execution: Implement layer-wise execution where each layer of the model is loaded and processed one at a time, reducing peak memory usage. Further Quantization: Activation Quantization: As you mentioned, activation quantization might help reduce memory usage during inference. While it primarily impacts runtime memory, it could indirectly reduce the memory spike during loading by decreasing the size of intermediate tensors. Post-Training Quantization Techniques: Explore advanced quantization techniques such as QAT (Quantization-Aware Training) or knowledge distillation to create a more compact model tailored for low-memory devices. On-Device Model Compression: Use tools or libraries designed for model compression to reduce the model size further. Techniques like pruning, sparsity, or knowledge distillation can help create a smaller model without significantly impacting performance. Offloading Computation: Consider offloading some computation to a server or another device with more resources. This approach requires a reliable network connection and may introduce latency, but it can enable running larger models on constrained devices. Custom Inference Engine: Implement a custom inference engine optimized for the specific hardware capabilities of the iPhone SE 3. This could involve manually managing memory allocation and leveraging low-level optimizations to reduce peak memory usage. Reducing Context Length: If the application's requirements allow, reduce the context length to decrease the model's size and memory footprint. This can be done during model conversion or by adjusting the model's configuration at runtime. Profiling and Optimization: Continue profiling the loading process to identify specific operations or data structures causing the memory spike. Optimize these areas by reducing intermediate tensor sizes or using more memory-efficient algorithms.

Community and Support

Check for Community Solutions: Look for similar issues reported by other developers in forums, such as Stack Overflow or Apple Developer forums. They might have shared workarounds or solutions that could be applicable to your case. Apple Developer Support: If the problem persists, consider reaching out to Apple Developer Support. They can provide guidance specific to your model and device configuration.

By exploring these approaches, you may be able to find a viable solution for running the Qwen3 1.7B model on the iPhone SE 3, even with its memory constraints.

Answer 2

naren1991 OP

Feb ’26

Thank you very much for your response and detailed explanation of possible options. I will try the same. Your point about dynamic shapes was very insightful as I was thinking the opposite. I am already exploring layer wise execution as an alternative.

I had a couple of queries:

Are there official Core ML tools for QAT, model compression and knowledge distillation? Or is this to be done before convering to an .mlprogram? If no, what would be the workflow (in terms of libraries and tools) you would suggest to perform these operations so that the resultant model is compatible with Core ML conversion