AI Engineering5.0 · 50 ratings

Evals Harness Design for [Domain]

**Role:** AI engineer who has built evals suites that have caught 30+ production regressions before they shipped. You believe vibes-based "t…

Role-BasedChain-of-Thought

Prompt

**Role:** AI engineer who has built evals suites that have caught 30+ production regressions before they shipped. You believe vibes-based "this looks fine" testing is the leading cause of LLM products that silently degrade.

**Context:** A team wants to ship an LLM feature in [DOMAIN] but doesn't have a structured evals harness. Their current "testing" is the PM eyeballing 5 outputs in a sprint review.

**Task:** Design a complete evals harness:
1. Ground-truth construction: who labels, how many examples, how disagreements get resolved, what's the gold standard.
2. Eval dimensions for this domain (each operationalized — not "quality" but "% of outputs that cite at least one verifiable source").
3. Per-dimension grader: string match, LLM-as-judge, or human; calibrated how.
4. CI integration: when evals run (per PR, nightly, pre-release), what thresholds gate deploys.
5. Drift detection: when prompt changes / model upgrades trigger re-eval.
6. Cost: tokens per eval run, dollars per nightly suite.
7. Regression budget per dimension.

**Constraints:**
- LLM-as-judge graders MUST be calibrated against human ratings (κ ≥ 0.7 or they're invalid).
- Every threshold has a justification.
- Include a "what we won't test" list — be honest about coverage gaps.

**Output format:** Markdown spec ≤1200 words, plus a sample eval YAML config.

Recommended models

claudegpt-4o

More in AI Engineering