gemma-3-4b-it-qat-4bit-mobile

Paper: On-Device Multimodal LLM Optimization: Fitting Gemma 3 into 2 GB

Aggressively optimized version of gemma-3-4b-it-qat-4bit for iPhone/iPad (8 GB RAM). Reduces model size from 2.8 GB to 2.1 GB with split weights for text-only lazy loading, significantly lower runtime memory, and reduced thermal output.

Optimizations Applied

Step	Optimization	Effect
1	Vocabulary pruning 262K → 144K tokens	-170 MB disk, token_map remapping
2	Vision fc2 bf16 → 4-bit (pad 4304 → 4352)	-191 MB disk
3	Remove text layers 31, 32, 33 (34 → 31 layers)	-159 MB disk, faster inference
4	Image resolution 896 → 672	~3x less vision attention compute
6	MLP neuron pruning (layers 14-30, -25%)	-188 MB disk, faster MLP forward
7	Weight split (language + vision)	Text-only: skip 231 MB vision weights

Architecture

Text model:
  vocab_size: 262,208 (token_map → 144,257 compact embeddings)
  hidden_size: 2560
  intermediate_size: 10240 (layers 0-13) / 7680 (layers 14-30)
  num_hidden_layers: 31
  num_attention_heads: 8 (GQA, 4 KV heads)
  head_dim: 256
  quantization: 4-bit, group_size=64

Vision model (SigLIP):
  hidden_size: 1152
  intermediate_size: 4352 (padded from 4304, fc2 4-bit quantized)
  num_hidden_layers: 27
  image_size: 672
  patch_size: 14
  mm_tokens_per_image: 144

Model Files

File	Size	Description
`language_model.safetensors`	1.9 GB	Text model weights (embeddings + 31 transformer layers)
`vision_model.safetensors`	231 MB	Vision tower + multi-modal projector
`model.safetensors.index.json`	-	Weight file index for split loading
`config.json`	-	Model configuration with `vocab_pruning` and `per_layer_intermediate_sizes`

Requirements

This model uses token_map and per-layer intermediate sizes. The inference engine must:

Token map: Read vocab_pruning.compact_vocab_size (144,257) from config.json. Initialize embedding with compact size. Load language_model.model.embed_tokens.token_map (int32[262208]) and remap: embedding(token_map[input_ids]).
Per-layer MLP sizes: Read text_config.per_layer_intermediate_sizes (31-element array). Initialize each transformer block's MLP with the corresponding intermediate size instead of the global intermediate_size.
Split weights: Load via model.safetensors.index.json which maps weight keys to language_model.safetensors or vision_model.safetensors. For text-only inference, only language_model.safetensors is needed.

Usage

Swift (swift-gemma-cli)

A native Swift CLI for running this model on Apple Silicon, with full support for token_map and per-layer intermediate sizes.

git clone https://github.com/AtomGradient/swift-gemma-cli.git
cd swift-gemma-cli
swift build -c release

# Text generation (loads only language_model.safetensors ~1.9 GB)
swift run -c release gemma-cli <model-path> \
  --prompt "Hello, how are you?" --max-tokens 100 --temperature 0.0

# Image understanding (loads both files ~2.1 GB)
swift run -c release gemma-cli <model-path> \
  --image photo.jpg \
  --prompt "Describe this image in detail." --max-tokens 200 --temperature 0.0

Benchmarks (Apple Silicon)

Metric	Original	This Model
Disk size	2.8 GB	2.1 GB (1.9G + 231M)
Peak memory (text)	2910 MB	2231 MB
Peak memory (image)	~5500 MB	4358 MB
Prompt speed (text)	109 t/s	120 t/s
Generation speed (text)	90 t/s	110 t/s
Prompt speed (image)	54 t/s	184 t/s
Generation speed (image)	27 t/s	104 t/s
Image understanding	Correct	Correct
Text quality	Perfect	Minor degradation on contractions

Quality Notes

Image understanding is fully preserved: correctly identifies objects, colors, composition
Text generation has minor quality degradation on English contractions (e.g., "I's" instead of "I'm") due to layer removal and neuron pruning — this is a trade-off for the 25% size reduction and significant speed improvement
For higher text quality with the same image quality, use gemma-3-4b-it-qat-4bit-lite (2.3 GB, no neuron pruning)