AI Engineering5.0 · 50 ratings

Evals Harness Design for [Domain]

**Role:** AI engineer who has built evals suites that have caught 30+ production regressions before they shipped. You believe vibes-based "t…

Role-BasedChain-of-Thought

Prompt

**Role:** AI engineer who has built evals suites that have caught 30+ production regressions before they shipped. You believe vibes-based "this looks fine" testing is the leading cause of LLM products that silently degrade.

**Context:** A team wants to ship an LLM feature in [DOMAIN] but doesn't have a structured evals harness. Their current "testing" is the PM eyeballing 5 outputs in a sprint review.

**Task:** Design a complete evals harness:
1. Ground-truth construction: who labels, how many examples, how disagreements get resolved, what's the gold standard.
2. Eval dimensions for this domain (each operationalized — not "quality" but "% of outputs that cite at least one verifiable source").
3. Per-dimension grader: string match, LLM-as-judge, or human; calibrated how.
4. CI integration: when evals run (per PR, nightly, pre-release), what thresholds gate deploys.
5. Drift detection: when prompt changes / model upgrades trigger re-eval.
6. Cost: tokens per eval run, dollars per nightly suite.
7. Regression budget per dimension.

**Constraints:**
- LLM-as-judge graders MUST be calibrated against human ratings (κ ≥ 0.7 or they're invalid).
- Every threshold has a justification.
- Include a "what we won't test" list — be honest about coverage gaps.

**Output format:** Markdown spec ≤1200 words, plus a sample eval YAML config.

How to use this prompt

1
Copy the prompt above and paste it into ChatGPT, Claude, or Gemini — or open it in the visual Studio to edit each part on a canvas and run it with your own key.
2
Replace any bracketed placeholders with your specifics. The more concrete your context and constraints, the sharper the result — see the 5-part prompt structure.
3
Run it, then refine. Ask the model to critique and improve its own answer with self-critique prompting.

Techniques in this prompt

Role-Based

Assigns the model an expert persona so it adopts the right vocabulary, depth, and standards for the task.

Learn this technique

Chain-of-Thought

Asks the model to reason step by step before answering — ideal for multi-step, logical, or analytical tasks.

Learn this technique

Recommended models

claudegpt-4o

Build on this prompt

Open it in the visual Studio to wire it into a full workflow with your own API key — or learn the craft behind prompts like this.

Open in Studio How to prompt AI correctly

Evals Harness Design for [Domain]

Prompt

How to use this prompt

Techniques in this prompt

Recommended models

Build on this prompt

More in AI Engineering

RAG vs Fine-tune Decision Memo

System Prompt Audit

Agent Loop Halt-Condition Design

Vector DB Schema Migration Plan