1

gemma-3-27b-it-abliterated-FP8

gemma-3-27b-it-abliterated-FP8 is an FP8-Dynamic compressed variant of Maxime Labonne’s gemma-3-27b-it-abliterated model. This version applies FP8 dynamic quantization while preserving the layerwise abliteration technique that minimizes refusal behavior across Gemma 3’s deep architecture. The result is a highly capable 27B instruction-tuned model with improved hardware efficiency and reduced memory footprint.

FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs – FP8 W8A8. Quantization W8A8 FP8-dynamic recipe – examples.

Model Overview

mlabonne/gemma-3-27b-it-abliterated is an experimental uncensored 27B-parameter instruction-tuned language model derived from Google’s gemma-3-27b-it.

It introduces a novel layerwise abliteration technique that:

  • Independently computes refusal directions from hidden states in each of the model’s 60+ layers
  • Targets key attention modules such as down_proj, o_proj, and feedforward components
  • Applies a 1.5× refusal weight scaling to eliminate safety refusals
  • Preserves >90% acceptance rate and coherent generation capabilities
  • Outperforms traditional residual stream–based removal methods on Gemma 3’s resilient architecture

FP8-Dynamic Compression

This FP8 edition:

  • Uses BF16 · FP8 (F8_E4M3) precision formats
  • Applies dynamic FP8 quantization for improved inference throughput
  • Reduces VRAM consumption significantly compared to full BF16
  • Maintains strong generation quality and reasoning stability

Designed for deployment on Hopper and compatible GPU architectures supporting FP8.

Quick Start with Transformers

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "prithivMLmods/gemma-3-27b-it-abliterated-FP8"

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id, device_map="auto"
).eval()

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Intended Use

  • Behavioral research and refusal-mechanism analysis
  • High-capacity instruction-following experiments
  • Long-form reasoning and detailed generation tasks
  • Research on quantization effects in large uncensored LLMs

Limitations & Risks

Critical Note: This model minimizes built-in refusal mechanisms.

  • May generate explicit or controversial outputs
  • Requires responsible and ethical use
  • FP8 requires compatible GPU architectures
  • 27B parameter size still demands substantial VRAM even with compression
Downloads last month
-
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/gemma-3-27b-it-abliterated-FP8

Quantized
(23)
this model