gemma-3-4b-it-qat-4bit-mobile
Paper: On-Device Multimodal LLM Optimization: Fitting Gemma 3 into 2 GB
Aggressively optimized version of gemma-3-4b-it-qat-4bit for iPhone/iPad (8 GB RAM). Reduces model size from 2.8 GB to 2.1 GB with split weights for text-only lazy loading, significantly lower runtime memory, and reduced thermal output.
Optimizations Applied
| Step | Optimization | Effect |
|---|---|---|
| 1 | Vocabulary pruning 262K → 144K tokens | -170 MB disk, token_map remapping |
| 2 | Vision fc2 bf16 → 4-bit (pad 4304 → 4352) | -191 MB disk |
| 3 | Remove text layers 31, 32, 33 (34 → 31 layers) | -159 MB disk, faster inference |
| 4 | Image resolution 896 → 672 | ~3x less vision attention compute |
| 6 | MLP neuron pruning (layers 14-30, -25%) | -188 MB disk, faster MLP forward |
| 7 | Weight split (language + vision) | Text-only: skip 231 MB vision weights |
Architecture
Text model:
vocab_size: 262,208 (token_map → 144,257 compact embeddings)
hidden_size: 2560
intermediate_size: 10240 (layers 0-13) / 7680 (layers 14-30)
num_hidden_layers: 31
num_attention_heads: 8 (GQA, 4 KV heads)
head_dim: 256
quantization: 4-bit, group_size=64
Vision model (SigLIP):
hidden_size: 1152
intermediate_size: 4352 (padded from 4304, fc2 4-bit quantized)
num_hidden_layers: 27
image_size: 672
patch_size: 14
mm_tokens_per_image: 144
Model Files
| File | Size | Description |
|---|---|---|
language_model.safetensors |
1.9 GB | Text model weights (embeddings + 31 transformer layers) |
vision_model.safetensors |
231 MB | Vision tower + multi-modal projector |
model.safetensors.index.json |
- | Weight file index for split loading |
config.json |
- | Model configuration with vocab_pruning and per_layer_intermediate_sizes |
Requirements
This model uses token_map and per-layer intermediate sizes. The inference engine must:
Token map: Read
vocab_pruning.compact_vocab_size(144,257) from config.json. Initialize embedding with compact size. Loadlanguage_model.model.embed_tokens.token_map(int32[262208]) and remap:embedding(token_map[input_ids]).Per-layer MLP sizes: Read
text_config.per_layer_intermediate_sizes(31-element array). Initialize each transformer block's MLP with the corresponding intermediate size instead of the globalintermediate_size.Split weights: Load via
model.safetensors.index.jsonwhich maps weight keys tolanguage_model.safetensorsorvision_model.safetensors. For text-only inference, onlylanguage_model.safetensorsis needed.
Usage
Swift (swift-gemma-cli)
A native Swift CLI for running this model on Apple Silicon, with full support for token_map and per-layer intermediate sizes.
git clone https://github.com/AtomGradient/swift-gemma-cli.git
cd swift-gemma-cli
swift build -c release
# Text generation (loads only language_model.safetensors ~1.9 GB)
swift run -c release gemma-cli <model-path> \
--prompt "Hello, how are you?" --max-tokens 100 --temperature 0.0
# Image understanding (loads both files ~2.1 GB)
swift run -c release gemma-cli <model-path> \
--image photo.jpg \
--prompt "Describe this image in detail." --max-tokens 200 --temperature 0.0
Benchmarks (Apple Silicon)
| Metric | Original | This Model |
|---|---|---|
| Disk size | 2.8 GB | 2.1 GB (1.9G + 231M) |
| Peak memory (text) | 2910 MB | 2231 MB |
| Peak memory (image) | ~5500 MB | 4358 MB |
| Prompt speed (text) | 109 t/s | 120 t/s |
| Generation speed (text) | 90 t/s | 110 t/s |
| Prompt speed (image) | 54 t/s | 184 t/s |
| Generation speed (image) | 27 t/s | 104 t/s |
| Image understanding | Correct | Correct |
| Text quality | Perfect | Minor degradation on contractions |
Quality Notes
- Image understanding is fully preserved: correctly identifies objects, colors, composition
- Text generation has minor quality degradation on English contractions (e.g., "I's" instead of "I'm") due to layer removal and neuron pruning — this is a trade-off for the 25% size reduction and significant speed improvement
- For higher text quality with the same image quality, use
gemma-3-4b-it-qat-4bit-lite(2.3 GB, no neuron pruning)
Base Model
License
Same as the base model. See Gemma Terms of Use.
- Downloads last month
- 6
Quantized
Model tree for AtomGradient/gemma-3-4b-it-qat-4bit-mobile
Base model
OpenGVLab/InternVL3-1B-Pretrained