Metal recommendedMaxWorkingSetSize vs actual RAM on iPhone (LLM load fails)

Context

I’m deploying large language models on iPhone using llama.cpp. A new iPhone Air (12 GB RAM) reports a Metal MTLDevice.recommendedMaxWorkingSetSize of 8,192 MB, and my attempt to load Llama-2-13B Q4_K (~7.32 GB weights) fails during model initialization.

Environment

Device: iPhone Air (12 GB RAM)

iOS: 26

Xcode: 26.0.1

Build: Metal backend enabled llama.cpp

App runs on device (not Simulator)

What I’m seeing

MTLCreateSystemDefaultDevice().recommendedMaxWorkingSetSize == 8192 MiB

Loading Llama-2-13B Q4_K (7.32 GB) fails to complete. Logs indicate memory pressure / allocation issues consistent with the 8 GB working-set guidance.

Smaller models (e.g., 7B/8B with similar quantization) load and run (8B Q4_K provide around 9 tokens/second decoding speed).

Questions

Is 8,192 MB an expected recommendedMaxWorkingSetSize on a 12 GB iPhone?

What values should I expect on other 2025 devices including iPhone 17 (8 GB RAM) and iPhone 17 Pro (12 GB RAM)

Is it strictly enforced by Metal allocations (heaps/buffers), or advisory for best performance/eviction behavior?

Can a process practically exceed this for long-lived buffers without immediate Jetsam risk?

Any guidance for LLM scenarios near the limit?

Metal recommendedMaxWorkingSetSize vs actual RAM on iPhone (LLM load fails)
 
 
Q