Shipping evals before prompts
A practical checklist for moving LLM features from demo to dependable product behavior.
Shipping evals before prompts
When teams rush straight to prompt tweaks, they optimize the wrong surface. The durable move is to define success in examples—then let prompts, retrieval, and tooling chase that bar.
Start with failure modes
List the ways the feature can hurt users: wrong numbers, missing citations, unsafe instructions. Turn each into a negative test you can run in CI.
Build a golden set
Curate 30–200 representative inputs with expected properties (citations present, tone, structured fields). Version them like code.
Measure, then edit
Only change prompts or architecture when a change improves the eval dashboard. Keep latency and cost next to quality so tradeoffs stay honest.
type EvalCase = {
id: string;
input: string;
expect: "citations" | "refusal" | "json-schema";
};What we ship with Base66
We help teams wire retrieval, guardrails, logging, and dashboards so improvements are visible, not guessed.