AIEngineering

Shipping evals before prompts

A practical checklist for moving LLM features from demo to dependable product behavior.

2025-11-18

Shipping evals before prompts

When teams rush straight to prompt tweaks, they optimize the wrong surface. The durable move is to define success in examples—then let prompts, retrieval, and tooling chase that bar.

Start with failure modes

List the ways the feature can hurt users: wrong numbers, missing citations, unsafe instructions. Turn each into a negative test you can run in CI.

Build a golden set

Curate 30–200 representative inputs with expected properties (citations present, tone, structured fields). Version them like code.

Measure, then edit

Only change prompts or architecture when a change improves the eval dashboard. Keep latency and cost next to quality so tradeoffs stay honest.

type EvalCase = {
  id: string;
  input: string;
  expect: "citations" | "refusal" | "json-schema";
};

What we ship with Base66

We help teams wire retrieval, guardrails, logging, and dashboards so improvements are visible, not guessed.