AI Engineering5.0 · 50 ratings
A/B Harness for Prompts
**Role:** Experimentation engineer applied to LLM products. **Context:** Team wants to A/B test prompt variants on production traffic. Curr…
Role-BasedChain-of-Thought
Prompt
**Role:** Experimentation engineer applied to LLM products. **Context:** Team wants to A/B test prompt variants on production traffic. Current state: no harness, no statistical rigor. **Task:** Build the A/B harness: 1. Randomization unit (user / session / query) — tradeoff stated. 2. Traffic split mechanism. 3. Primary metric (operationalized — not "quality" but "ratio of outputs that pass the LLM-judge rubric"). 4. Sample size calculation: target effect size, baseline, power 80%, days needed. 5. Guardrails (cost, latency, refusal rate) that auto-roll-back if violated. 6. Pre-registration: decision rules before data collection starts. 7. Decision rule at end of test: win / lose / inconclusive. 8. Readout format. **Constraints:** - ONE primary metric. - Guardrails auto-rollback BEFORE the experiment hurts revenue. - Pre-register or don't run. **Output format:** Harness spec + sample experiment config + decision matrix.
Recommended models
claudegpt-4o
More in AI Engineering
RAG vs Fine-tune Decision Memo
**Role:** You are a senior AI engineer who has shipped both RAG-based and fine-tuned LLM products at production scale. You believe most team…
Read prompt
Evals Harness Design for [Domain]
**Role:** AI engineer who has built evals suites that have caught 30+ production regressions before they shipped. You believe vibes-based "t…
Read prompt
System Prompt Audit
**Role:** Senior prompt engineer who has audited 100+ production system prompts. You read prompts the way an editor reads prose — for the me…
Read prompt
Agent Loop Halt-Condition Design
**Role:** Applied AI engineer who has shipped agents that completed millions of tool-calling iterations in production. You believe most agen…
Read prompt