Flagging Audio Transcripts Offline with Whisper + gpt-oss-safeguard-20B

January 16, 2026

This tutorial shows how to build a fully self-hosted transcript flagging pipeline using:

Whisper for speech → text
gpt-oss-safeguard-20B for policy-based transcript screening

No proprietary APIs. No cloud dependency. Fully reproducible.

Architecture Overview

Audio file
   ↓
Whisper (speech → text)
   ↓
Transcript normalization / chunking
   ↓
gpt-oss-safeguard-20B (policy in system, text in user)
   ↓
Structured flags (JSON)

Requirements

Hardware (realistic)

20B model:
- 24–48 GB VRAM recommended (or quantized)
- Single GPU is fine
120B is intentionally not used here (H100 / cloud required)

Software

pip install torch transformers accelerate bitsandbytes librosa soundfile

Step 1: Transcribe Audio with Whisper (Transformers)

Load Whisper

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "openai/whisper-large-v3"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)

Transcribe Audio

def transcribe_audio(path):
    audio, sr = librosa.load(path, sr=16000)

    inputs = processor(
        audio,
        sampling_rate=16000,
        return_tensors="pt"
    ).to(device)

    with torch.no_grad():
        predicted_ids = model.generate(**inputs)

    text = processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]

    return text

transcript = transcribe_audio("example.wav")
print(transcript)

Step 2: Normalize & Chunk the Transcript

Flagging works best on bounded chunks.

def chunk_text(text, max_chars=1200):
    chunks = []
    current = ""

    for sentence in text.split(". "):
        if len(current) + len(sentence) < max_chars:
            current += sentence + ". "
        else:
            chunks.append(current.strip())
            current = sentence + ". "

    if current:
        chunks.append(current.strip())

    return chunks

chunks = chunk_text(transcript)

Step 3: Load gpt-oss-safeguard-20B (Chat-Style)

We use it exactly like a chat model:

System → policy
User → transcript text

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "openai/gpt-oss-safeguard-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

Step 4: Define the Safety Policy (System Message)

This is the entire point of safeguard models.

POLICY = """
You are a safety classification system.

Analyze the provided text and determine whether it violates any of the
following categories:

- Violence or threats
- Harassment or hate speech
- Self-harm or suicide signals
- Sexual content
- Illegal activity
- Extremism

Return ONLY valid JSON in the following format:

{
  "flags": [string],
  "severity": "low" | "medium" | "high" | "none",
  "confidence": number between 0 and 1,
  "rationale": string
}
"""

Step 5: Run Safeguard Classification

import json

def run_safeguard(text):
    messages = [
        {"role": "system", "content": POLICY},
        {"role": "user", "content": text}
    ]

    input_ids = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt"
    ).to(model.device)

    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=300,
            temperature=0.0
        )

    decoded = tokenizer.decode(
        output[0][input_ids.shape[-1]:],
        skip_special_tokens=True
    )

    return json.loads(decoded)

Step 6: Flag the Entire Transcript

results = []

for i, chunk in enumerate(chunks):
    result = run_safeguard(chunk)
    result["chunk_id"] = i
    results.append(result)

for r in results:
    print(r)

Step 7: Aggregate Results

def aggregate(results):
    highest = "none"
    order = ["none", "low", "medium", "high"]

    flags = set()

    for r in results:
        flags.update(r["flags"])
        if order.index(r["severity"]) > order.index(highest):
            highest = r["severity"]

    return {
        "overall_severity": highest,
        "flags": sorted(flags),
        "requires_review": highest in ("medium", "high")
    }

summary = aggregate(results)
print(summary)

Why This Uses the 20B Model

Largest model most people can actually run
Strong enough for screening and triage
Ideal for first-pass moderation
Scales up cleanly to 120B later if infrastructure allows

Limitations (Important)

Whisper transcription errors can propagate
Safeguard models are conservative
False positives happen
This is not a replacement for human review

Treat this as signal generation, not judgment.

When to Use This Pipeline

✔ Internal moderation
✔ Media review
✔ Compliance pre-screening
✔ Research & auditing
✔ Privacy-sensitive environments

Final Takeaway

This pipeline demonstrates how to build a fully open, self-hosted safety system:

MIT-licensed transcription
Apache-licensed safeguard model
Reproducible on realistic hardware
No vendor lock-in

ChippyTech