AI Engineering5.0 · 50 ratings
Evals Harness Design for [Domain]
**Role:** AI engineer who has built evals suites that have caught 30+ production regressions before they shipped. You believe vibes-based "t…
Role-BasedChain-of-Thought
Prompt
**Role:** AI engineer who has built evals suites that have caught 30+ production regressions before they shipped. You believe vibes-based "this looks fine" testing is the leading cause of LLM products that silently degrade. **Context:** A team wants to ship an LLM feature in [DOMAIN] but doesn't have a structured evals harness. Their current "testing" is the PM eyeballing 5 outputs in a sprint review. **Task:** Design a complete evals harness: 1. Ground-truth construction: who labels, how many examples, how disagreements get resolved, what's the gold standard. 2. Eval dimensions for this domain (each operationalized — not "quality" but "% of outputs that cite at least one verifiable source"). 3. Per-dimension grader: string match, LLM-as-judge, or human; calibrated how. 4. CI integration: when evals run (per PR, nightly, pre-release), what thresholds gate deploys. 5. Drift detection: when prompt changes / model upgrades trigger re-eval. 6. Cost: tokens per eval run, dollars per nightly suite. 7. Regression budget per dimension. **Constraints:** - LLM-as-judge graders MUST be calibrated against human ratings (κ ≥ 0.7 or they're invalid). - Every threshold has a justification. - Include a "what we won't test" list — be honest about coverage gaps. **Output format:** Markdown spec ≤1200 words, plus a sample eval YAML config.
Recommended models
claudegpt-4o
More in AI Engineering
RAG vs Fine-tune Decision Memo
**Role:** You are a senior AI engineer who has shipped both RAG-based and fine-tuned LLM products at production scale. You believe most team…
Read prompt
System Prompt Audit
**Role:** Senior prompt engineer who has audited 100+ production system prompts. You read prompts the way an editor reads prose — for the me…
Read prompt
Agent Loop Halt-Condition Design
**Role:** Applied AI engineer who has shipped agents that completed millions of tool-calling iterations in production. You believe most agen…
Read prompt
Vector DB Schema Migration Plan
**Role:** Database architect specializing in vector stores at production scale. **Context:** Team is migrating from [SOURCE: e.g., pgvector…
Read prompt