Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome

Hi all,

I've been working on a pure-Swift port of Google's Gemma 4 text decoder
that plugs into mlx-swift-lm as a sidecar model registration. Sharing it here in case anyone else hit the same wall I did, and to get feedback
from the MLX team and the community before I propose anything upstream.

Repo: https://github.com/yejingyang8963-byte/Swift-gemma4-core

Why

As of mlx-swift-lm 2.31.x, Gemma 4 isn't supported out of the box.
The obvious workaround — reusing the Gemma 3 text implementation with a patched config — fails at weight load because Gemma 4 differs from
Gemma 3 in several structural places. The chat-template path through
swift-jinja 1.x also silently corrupts the prompt, so the model loads
but generates incoherent text.

What's in the package

  • A from-scratch Swift implementation of the Gemma 4 decoder
    (Configuration, Layers, Attention, MLP, RoPE, DecoderLayer)
  • Per-Layer Embedding (PLE) support — the shared embedding table that
    feeds every decoder layer through a gated MLP as a third residual
  • KV sharing across the back half of the decoder, threaded through the
    forward pass via a donor table with a single global rope offset
  • A custom Gemma4ProportionalRoPE class for the partial-rotation rope
    type that initializeRope doesn't currently recognize
  • A chat-template bypass that builds the prompt as a literal string
    with the correct turn markers and encodes via tokenizer.encode(text:),
    matching Python mlx-lm's apply_chat_template byte-for-byte

Measured on iPhone (A-series, 7.4 GB RAM)

Model: mlx-community/gemma-4-e2b-it-4bit

  • Warm load: ~6 s
  • Memory after load: 341–392 MB
  • Time to first token (end-to-end, 333-token system prompt): 2.82 s
  • Generation throughput: 12–14 tok/s

What I'd love feedback on

  1. Is the sidecar registration pattern the right way to extend
    mlx-swift-lm with new model families, or is there a more idiomatic path I missed?
  2. The chat-template bypass works but feels like a workaround. Is the
    right long-term fix in swift-jinja, in the tokenizer, or somewhere
    else entirely?
  3. Anyone running into the same PLE / KV-sharing issues on other
    Gemma-family checkpoints? I'd like to make sure the implementation
    generalizes beyond E2B before tagging a 0.2.0.

Happy to open a PR against mlx-swift-lm if the maintainers think any
of this belongs upstream. Thanks for reading.

Sharing a Swift port of Gemma 4 for mlx-swift-lm — feedback welcome
 
 
Q