gemma-3-27b-it-abliterated-FP8
gemma-3-27b-it-abliterated-FP8 is an FP8-Dynamic compressed variant of Maxime Labonne’s gemma-3-27b-it-abliterated model. This version applies FP8 dynamic quantization while preserving the layerwise abliteration technique that minimizes refusal behavior across Gemma 3’s deep architecture. The result is a highly capable 27B instruction-tuned model with improved hardware efficiency and reduced memory footprint.
FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs – FP8 W8A8. Quantization W8A8 FP8-dynamic recipe – examples.
Model Overview
mlabonne/gemma-3-27b-it-abliterated is an experimental uncensored 27B-parameter instruction-tuned language model derived from Google’s gemma-3-27b-it.
It introduces a novel layerwise abliteration technique that:
- Independently computes refusal directions from hidden states in each of the model’s 60+ layers
- Targets key attention modules such as
down_proj,o_proj, and feedforward components - Applies a 1.5× refusal weight scaling to eliminate safety refusals
- Preserves >90% acceptance rate and coherent generation capabilities
- Outperforms traditional residual stream–based removal methods on Gemma 3’s resilient architecture
FP8-Dynamic Compression
This FP8 edition:
- Uses BF16 · FP8 (F8_E4M3) precision formats
- Applies dynamic FP8 quantization for improved inference throughput
- Reduces VRAM consumption significantly compared to full BF16
- Maintains strong generation quality and reasoning stability
Designed for deployment on Hopper and compatible GPU architectures supporting FP8.
Quick Start with Transformers
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "prithivMLmods/gemma-3-27b-it-abliterated-FP8"
model = Gemma3ForConditionalGeneration.from_pretrained(
model_id, device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
Intended Use
- Behavioral research and refusal-mechanism analysis
- High-capacity instruction-following experiments
- Long-form reasoning and detailed generation tasks
- Research on quantization effects in large uncensored LLMs
Limitations & Risks
Critical Note: This model minimizes built-in refusal mechanisms.
- May generate explicit or controversial outputs
- Requires responsible and ethical use
- FP8 requires compatible GPU architectures
- 27B parameter size still demands substantial VRAM even with compression
- Downloads last month
- -
Model tree for prithivMLmods/gemma-3-27b-it-abliterated-FP8
Base model
google/gemma-3-27b-pt