Step-3.5-128B-A11B | Technical Specifications
| Attribute | Detail |
|---|---|
| Base Model | Step-3.5-Flash |
| Architecture | Sparse Mixture-of-Experts (SMoE) |
| Model Type | Causal Language Model |
| Total Parameters | 128B |
| Active Parameters | 11B (per token) |
| Compression Ratio | 40% (Expert Pruning) |
| Pruning Strategy | Pruning observation (REAP) |
| Calibration Set | lkevincc0/glm47-math-code-calibration-1024 |
Local Deployment
Step 3.5 Flash is optimized for local inference and supports industry-standard backends including vLLM, SGLang, Hugging Face Transformers, and llama.cpp.
vLLM
We recommend using the latest nightly build of vLLM.
1. Install vLLM
# via Docker
docker pull vllm/vllm-openai:nightly
# or via pip (nightly wheels)
pip install -U vllm --pre \
--index-url https://pypi.org/simple \
--extra-index-url https://wheels.vllm.ai/nightly
2. Launch the server
Note: Full MTP3 support is not yet available in vLLM. We are actively working on a Pull Request to integrate this feature, which is expected to significantly enhance decoding performance.
- For FP8 model
vllm serve <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p5-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--hf-overrides '{"num_nextn_predict_layers": 1}' \
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
--trust-remote-code \
--quantization fp8
- For BF16 model
vllm serve <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p5-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--hf-overrides '{"num_nextn_predict_layers": 1}' \
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' \
--trust-remote-code
You can also refer to the Step-3.5-Flash recipe.
SGLang
1. Install SGLang
# via Docker
docker pull lmsysorg/sglang:dev-pr-18084
# or from source (pip)
pip install "sglang[all] @ git+https://github.com/sgl-project/sglang.git"
2. Launch the server
- For BF16 model
sglang serve --model-path <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p5-flash \
--tp-size 8 \
--tool-call-parser step3p5 \
--reasoning-parser step3p5 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-multi-layer-eagle \
--host 0.0.0.0 \
--port 8000
- For FP8 model
sglang serve --model-path <MODEL_PATH_OR_HF_ID> \
--served-model-name step3p5-flash \
--tp-size 8 \
--ep-size 8 \
--tool-call-parser step3p5 \
--reasoning-parser step3p5 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--enable-multi-layer-eagle \
--host 0.0.0.0 \
--port 8000
Reference
- Hugging Face: stepfun-ai/Step-3.5-Flash
- Optimization Tech: CerebrasResearch/reap
- Downloads last month
- 145