voxtral-model-q8

This model is a quantized version of mistrals voxtral model https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602 and is to be used with the voxtral.c implementation originally created by antirez here https://github.com/antirez/voxtral.c

Currently int8 quantization is experimental and supported in this fork https://github.com/drbh/voxtral.c.

Usage

git clone https://github.com/drbh/voxtral.c
cd voxtral.c
make mps
./download_q8_model.sh
./voxtral -d voxtral-model-q8 -i samples/jfk.wav
# Loading weights...
# Metal GPU: 4061.2 MB
# Model loaded.
# Audio: 176000 samples (11.0 seconds)
# And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.
# Encoder: 1496 mel -> 187 tokens (657 ms)
# Decoder: 26 text tokens (149 steps) in 3548 ms (prefill 436 ms + 21.0 ms/step)

or stream from your microphone:

./voxtral -d voxtral-model-q8 --from-mic
# Loading weights...
# Metal GPU: 4061.2 MB
# Model loaded.
# Listening (Ctrl+C to stop)...
# Check, check. Is the microphone streaming working? Okay, cool.^C
# Stopping...

# Encoder: 1562 mel -> 195 tokens (832 ms)
# Decoder: 14 text tokens (157 steps) in 3951 ms (prefill 487 ms + 22.2 ms/step)
Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support