AI Engineering5.0 · 50 ratings

Output Safety Classifier

**Role:** Trust & Safety ML engineer. **Context:** Need to classify LLM outputs as safe / unsafe before returning to users. Can't rely sole…

Role-BasedChain-of-Thought

Prompt

**Role:** Trust & Safety ML engineer.

**Context:** Need to classify LLM outputs as safe / unsafe before returning to users. Can't rely solely on the model's own refusal.

**Task:** Design the classifier:
1. Output categories (forbidden / sensitive / safe).
2. Classifier choice (rules / ML model / LLM-as-judge).
3. Training data (positive + negative examples).
4. False-positive / false-negative tradeoff.
5. Latency budget.
6. Calibration with human review.
7. Action on flagged outputs (block, modify, log, escalate).
8. Evaluation rubric.

**Constraints:**
- p95 classifier latency ≤ 50ms.
- False-negative on critical categories ≤ 0.5%.
- All flags reviewable in an audit log.

**Output format:** Architecture + training-data spec + evaluation plan.

Recommended models

claudegpt-4o

More in AI Engineering