MiniMax-M2.5 REAP-19 (19% Pruned)

Support This Work

Pruning large MoE models requires substantial GPU resources (multi-H100 clusters). If you find these models useful, consider buying me a coffee to help offset rental costs and enable further releases. Your support makes this work possible!

Overview

This repository contains a REAP-pruned variant of the MiniMax-M2.5 Mixture-of-Experts (MoE) language model with 19% of experts removed while maintaining strong performance.

REAP (Router Expert Activation Pruning) is a structured pruning technique that identifies and removes under-utilized experts based on activation patterns. This achieves:

Reduced model size and memory footprint
Faster inference and lower cost
Maintained active parameters per token
Full compatibility with HuggingFace Transformers

REAP Variant Selection

Choose the variant that best fits your deployment constraints:

Model	Pruned	Kept	Size Reduction	Performance Trade-off
REAP-19	19	81%	Moderate	Small
REAP-29	29%	71%	Significant	Moderate
REAP-39	39%	61%	Large	Noticeable
REAP-50	50%	50%	Very Large	Significant

Repository Links:

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Akicou/MiniMax-M2-5-REAP-19"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)

prompt = "Explain quantum entanglement in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Memory-Efficient Loading

For systems with limited GPU memory:

# 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True,
    trust_remote_code=True
)

# 4-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
    trust_remote_code=True
)

Quantized GGUF Versions

Quantized GGUF variants optimized for llama.cpp, Ollama, and similar backends are in preparation in collaboration with mradermacher. Planned formats include Q4_K_M, Q5_K_M, Q6_K, and Q8_0.

🔬 Pruning Methodology

REAP Framework

Pruning was performed using the REAP framework (implementation: Akicuo/reap) with the following configuration:

Calibration Settings:

Dataset: Mixed-domain calibration corpus (150 samples per category)
Distance Metric: Cosine similarity
Loading Precision: 4-bit for memory efficiency during pruning
Selection Strategy: Router activation frequency analysis

Process:

Collect expert activation statistics across calibration dataset
Compute similarity scores between experts
Identify and rank experts by utilization
Prune lowest-activated experts while maintaining coverage
Validate structural integrity and export pruned model

For full pruning commands, hyperparameters, and reproducibility details, see the Akicou/reap repository.

⚖️ Performance Characteristics

What Changes:

✅ Reduced model size (fewer total experts)
✅ Faster inference (less expert routing overhead)
✅ Lower memory requirements
⚠️ Slight reduction in capability on edge cases

What Stays the Same:

✅ Active parameters per token (same compute per inference)
✅ Model architecture and API compatibility
✅ Tokenizer and input/output formats

Trade-offs: These models exchange a small amount of capability for significantly improved efficiency. Higher pruning rates (19 < 30%) may show more noticeable quality differences on complex or specialized tasks.

Note: Formal benchmarks are not provided due to resource constraints. Community evaluation contributions are welcome!

🛠️ Use Cases

Ideal for:

🏠 Running large language models on consumer GPUs
💻 Local development and testing
🌐 Edge deployment and on-device inference
💰 Cost-sensitive production environments
🔬 Research on efficient model architectures

Consider the full model if:

You have abundant GPU resources
Maximum quality is critical
Working on highly specialized domains

📚 Citation

If you use these pruned models in your research or applications, please cite both the original REAP paper and the base model:

REAP Citation

@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Base Model Citation

@misc{minimax2025m25,
  title={MiniMax-M2.5: A State-of-the-Art Mixture-of-Experts Language Model},
  author={MiniMaxAI},
  year={2025},
  howpublished={\url{https://huggingface.co/MiniMaxAI/MiniMax-M2.5}}
}

🙏 Acknowledgments

Original Model: MiniMaxAI for developing MiniMax-M2.5
REAP Framework: Cerebras Research for the pruning methodology
Community: HuggingFace and the open-source AI community

💖 Support This Work

Pruning large MoE models requires substantial computational resources (multi-GPU H100 clusters). If you find these models useful:

☕ Buy me a coffee to help offset GPU rental costs
⭐ Star the GitHub repository
📢 Share with others who might benefit
🐛 Report issues and contribute improvements

Your support enables continued development and release of efficient model variants!

📞 Contact & Feedback

Issues & Requests: Open an issue on GitHub
Discussions: Use the HuggingFace Community tab above
Custom Pruning: Reach out for specific pruning ratios or other MoE models

Feedback, bug reports, and collaboration inquiries are always welcome!

📄 License

This model inherits the MIT license from the original MiniMax-M2.5 model. See LICENSE for details.

Made with ❤️ by Akicou | Powered by REAP

🤗 Model Hub | 💻 GitHub | ☕ Support

Downloads last month: 80

Safetensors

Model size

184B params

Tensor type

F32

F8_E4M3

Model tree for Akicou/MiniMax-M2-5-REAP-19

Base model

MiniMaxAI/MiniMax-M2.5

Quantized

(39)

this model

Quantizations

2 models

Paper for Akicou/MiniMax-M2-5-REAP-19

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 12