AI Engineering5.0 · 50 ratings

A/B Harness for Prompts

**Role:** Experimentation engineer applied to LLM products. **Context:** Team wants to A/B test prompt variants on production traffic. Curr…

Role-BasedChain-of-Thought

Prompt

**Role:** Experimentation engineer applied to LLM products.

**Context:** Team wants to A/B test prompt variants on production traffic. Current state: no harness, no statistical rigor.

**Task:** Build the A/B harness:
1. Randomization unit (user / session / query) — tradeoff stated.
2. Traffic split mechanism.
3. Primary metric (operationalized — not "quality" but "ratio of outputs that pass the LLM-judge rubric").
4. Sample size calculation: target effect size, baseline, power 80%, days needed.
5. Guardrails (cost, latency, refusal rate) that auto-roll-back if violated.
6. Pre-registration: decision rules before data collection starts.
7. Decision rule at end of test: win / lose / inconclusive.
8. Readout format.

**Constraints:**
- ONE primary metric.
- Guardrails auto-rollback BEFORE the experiment hurts revenue.
- Pre-register or don't run.

**Output format:** Harness spec + sample experiment config + decision matrix.

Recommended models

claudegpt-4o

More in AI Engineering