gemma-3-4b-it-qat-4bit-mobile

Paper: On-Device Multimodal LLM Optimization: Fitting Gemma 3 into 2 GB

Aggressively optimized version of gemma-3-4b-it-qat-4bit for iPhone/iPad (8 GB RAM). Reduces model size from 2.8 GB to 2.1 GB with split weights for text-only lazy loading, significantly lower runtime memory, and reduced thermal output.

Optimizations Applied

Step Optimization Effect
1 Vocabulary pruning 262K → 144K tokens -170 MB disk, token_map remapping
2 Vision fc2 bf16 → 4-bit (pad 4304 → 4352) -191 MB disk
3 Remove text layers 31, 32, 33 (34 → 31 layers) -159 MB disk, faster inference
4 Image resolution 896 → 672 ~3x less vision attention compute
6 MLP neuron pruning (layers 14-30, -25%) -188 MB disk, faster MLP forward
7 Weight split (language + vision) Text-only: skip 231 MB vision weights

Architecture

Text model:
  vocab_size: 262,208 (token_map → 144,257 compact embeddings)
  hidden_size: 2560
  intermediate_size: 10240 (layers 0-13) / 7680 (layers 14-30)
  num_hidden_layers: 31
  num_attention_heads: 8 (GQA, 4 KV heads)
  head_dim: 256
  quantization: 4-bit, group_size=64

Vision model (SigLIP):
  hidden_size: 1152
  intermediate_size: 4352 (padded from 4304, fc2 4-bit quantized)
  num_hidden_layers: 27
  image_size: 672
  patch_size: 14
  mm_tokens_per_image: 144

Model Files

File Size Description
language_model.safetensors 1.9 GB Text model weights (embeddings + 31 transformer layers)
vision_model.safetensors 231 MB Vision tower + multi-modal projector
model.safetensors.index.json - Weight file index for split loading
config.json - Model configuration with vocab_pruning and per_layer_intermediate_sizes

Requirements

This model uses token_map and per-layer intermediate sizes. The inference engine must:

  1. Token map: Read vocab_pruning.compact_vocab_size (144,257) from config.json. Initialize embedding with compact size. Load language_model.model.embed_tokens.token_map (int32[262208]) and remap: embedding(token_map[input_ids]).

  2. Per-layer MLP sizes: Read text_config.per_layer_intermediate_sizes (31-element array). Initialize each transformer block's MLP with the corresponding intermediate size instead of the global intermediate_size.

  3. Split weights: Load via model.safetensors.index.json which maps weight keys to language_model.safetensors or vision_model.safetensors. For text-only inference, only language_model.safetensors is needed.

Usage

Swift (swift-gemma-cli)

A native Swift CLI for running this model on Apple Silicon, with full support for token_map and per-layer intermediate sizes.

git clone https://github.com/AtomGradient/swift-gemma-cli.git
cd swift-gemma-cli
swift build -c release

# Text generation (loads only language_model.safetensors ~1.9 GB)
swift run -c release gemma-cli <model-path> \
  --prompt "Hello, how are you?" --max-tokens 100 --temperature 0.0

# Image understanding (loads both files ~2.1 GB)
swift run -c release gemma-cli <model-path> \
  --image photo.jpg \
  --prompt "Describe this image in detail." --max-tokens 200 --temperature 0.0

Benchmarks (Apple Silicon)

Metric Original This Model
Disk size 2.8 GB 2.1 GB (1.9G + 231M)
Peak memory (text) 2910 MB 2231 MB
Peak memory (image) ~5500 MB 4358 MB
Prompt speed (text) 109 t/s 120 t/s
Generation speed (text) 90 t/s 110 t/s
Prompt speed (image) 54 t/s 184 t/s
Generation speed (image) 27 t/s 104 t/s
Image understanding Correct Correct
Text quality Perfect Minor degradation on contractions

Quality Notes

  • Image understanding is fully preserved: correctly identifies objects, colors, composition
  • Text generation has minor quality degradation on English contractions (e.g., "I's" instead of "I'm") due to layer removal and neuron pruning — this is a trade-off for the 25% size reduction and significant speed improvement
  • For higher text quality with the same image quality, use gemma-3-4b-it-qat-4bit-lite (2.3 GB, no neuron pruning)

Base Model

gemma-3-4b-it-qat-4bit

License

Same as the base model. See Gemma Terms of Use.

Downloads last month
6
Safetensors
Model size
0.6B params
Tensor type
BF16
·
U32
·
I32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AtomGradient/gemma-3-4b-it-qat-4bit-mobile