Running large models like Qwen3 1.7B on devices with limited resources, such as the iPhone SE 3 with 4 GB of RAM, is indeed challenging. Let's address your questions and explore potential approaches to mitigate the issues you're facing: Observations and Possible Causes Memory Spike During Load: Dynamic Shape Usage: While dynamic shapes can help manage memory allocation during inference by not preallocating for the full context length, the initial loading process often requires more fixed memory to parse the model, set up internal structures, and potentially convert certain operations to formats supported by the hardware. Intermediate Conversion: Core ML might internally convert some operations from INT4 to INT8 or FP16 if the underlying hardware (BNNS/ANE) doesn't natively support INT4 operations. This can cause temporary spikes in memory usage during loading. Memory Limitations: The iPhone SE 3's 4 GB RAM is quite limited for large models, especially if additional overhead from the operatin
Topic:
Machine Learning & AI
SubTopic:
Core ML
Tags: