LLM Evals Became the New Unit Test • Rutvik Acharya

A lot of teams discovered the same uncomfortable truth: an LLM demo can look great and still be impossible to ship.

The first prompt works. The happy path works. The founder demo works. Then someone asks the model a slightly different question, uploads a weird document, or tries a real customer workflow, and the system quietly starts making things up.

That is why evaluation became one of the most important LLM engineering skills of the year. Not leaderboards. Not academic benchmarks. Practical evals: small, targeted tests that tell you whether your application is getting better or worse.

Why Traditional Metrics Were Not Enough#

NLP already had metrics before LLMs became mainstream: BLEU for translation, ROUGE for summarization, F1 for classification, MRR for retrieval.

Those metrics are still useful in narrow settings, but LLM applications introduced messier outputs:

The answer can be phrased many valid ways
The model might be partially correct
The model might use the right source but draw the wrong conclusion
The failure might be about tone, safety, formatting, or missing context

A chatbot that answers “yes” instead of “no” because it missed a negation is not a small formatting error. A support assistant that invents a refund policy is not captured by BLEU.

The evaluation problem moved from “is this string close to the reference?” to “did the system behave correctly for this task?”

The Evals Mindset#

The useful mental shift was to treat prompts and retrieval pipelines like code. If changing code requires tests, changing prompts should require tests too.

For an LLM app, an eval set usually contains:

Input: the user question or task
Context: documents, retrieved chunks, user metadata, or tool results
Expected behavior: what a good answer must include or avoid
Scoring method: exact check, human label, model-graded rubric, or task-specific script

The goal is not to create a perfect benchmark. The goal is to catch regressions.

Good evals are usually local, opinionated, and boring. They encode what your product needs, not what a public leaderboard rewards.

For example, a legal assistant and a marketing copy assistant might both use the same underlying model, but their evals should look completely different. The legal assistant should punish unsupported claims, missed negations, and missing citations. The marketing assistant may care more about brand voice, banned phrases, length, and preserving required claims.

That means an eval suite is part of the product specification. It makes fuzzy requirements concrete.

Four Evals That Actually Helped#

1. Golden Answer Tests#

The simplest eval is a set of examples with expected answers. This works best when the output is short and factual.

{
  "question": "What is the max refund window for annual plans?",
  "expected_contains": ["30 days"],
  "expected_not_contains": ["60 days", "any time"]
}

json

For many customer support and internal knowledge-base assistants, a few hundred of these tests catch a surprising number of failures.

The trick is to avoid making the match too brittle. Instead of requiring one exact response, check for required facts, prohibited claims, citation presence, or JSON validity.

2. Retrieval Evals#

For RAG systems, generation quality often hides retrieval problems. The model may answer poorly because it never saw the right document.

Retrieval evals separate the question:

Did the retriever return the right evidence?

Typical metrics include recall@k and MRR. If the correct document is not in the top 5 or top 10 results, the generator has already lost.

Question: "How do we rotate production API keys?"
Expected document: security/runbooks/api-key-rotation.md
Pass condition: expected document appears in top 5 results

text

This is one of the highest leverage debugging techniques. Before rewriting prompts, check whether the right context is even present.

3. LLM-as-Judge Rubrics#

LLM-as-judge evaluation became popular because many outputs are hard to score with exact matching. A judge model can compare an answer against a rubric:

Is the answer grounded in the supplied context?
Does it refuse when the answer is not available?
Does it include the required citation?
Does it avoid unsupported policy claims?

This is not perfect. Judge models can be biased, inconsistent, and overly forgiving. But for development loops, they are often good enough to flag suspicious changes.

The key is to make the rubric concrete.

Score 1 if the answer says the policy is available without evidence.
Score 0 if the answer says the policy is not present in the provided context.

text

Vague criteria like “quality” or “helpfulness” are too soft to be useful.

How To Write A Good Judge Prompt#

Model-graded evals live or die by the rubric. A vague judge prompt produces noisy scores that feel scientific but do not help you make decisions.

A better judge prompt is specific about role, evidence, scoring, and output format:

You are grading whether an answer is grounded in the provided context.
Return JSON only.

Score 1 if every factual claim in the answer is directly supported by the context.
Score 0 if the answer includes any factual claim not supported by the context.
Ignore writing style. Do not reward longer answers.

text

Then include examples of passing and failing answers. This is the same idea as prompt engineering, but applied to the evaluator.

There are still limits. Judge models may favor longer answers, agree with confident wording, or miss domain-specific mistakes. That is why model-graded evals should be calibrated against human review. Take 50 examples, score them manually, compare the judge, and see where it disagrees.

4. Format and Tool-Use Tests#

A lot of production LLM failures are not semantic. They are structural.

The model returns invalid JSON. It calls the wrong tool. It forgets a required field. It mixes user-facing prose into a machine-readable response.

These are the easiest evals to automate:

import json

def test_valid_json(output: str):
    parsed = json.loads(output)
    assert "action" in parsed
    assert parsed["action"] in ["refund", "escalate", "answer"]

python

If a workflow depends on structured output, test the structure before worrying about style.

5. Refusal and Abstention Tests#

For many applications, the most important answer is “I do not know.”

RAG systems should refuse when the answer is not in the provided context. Agents should refuse to call tools with missing required fields. Medical, legal, and financial assistants need especially clear boundaries.

{
  "question": "Can I expense a first-class international flight?",
  "context": "The travel policy only covers economy and premium economy.",
  "expected_behavior": "Say first-class travel is not covered. Do not invent exceptions."
}

json

This type of eval catches a subtle but dangerous failure: the model tries to be helpful by filling gaps. In many products, unsupported helpfulness is worse than a polite refusal.

6. Regression Tests From Real Incidents#

Every bad production answer is a gift, assuming you capture it. Add the input, retrieved context, output, and expected behavior to the eval set.

Over time, this creates a suite of “things we never want to break again.” It will be uneven and strange, because real users are uneven and strange. That is exactly why it is useful.

What To Put In The First Eval Set#

The best examples come from actual failures. Synthetic examples are fine for bootstrapping, but real logs contain the edge cases you did not imagine.

Good candidates:

Questions the system answered incorrectly
Ambiguous requests
Missing-context cases where the model should refuse
Negation-heavy questions
Inputs with exact IDs, product names, or dates
Adversarial requests that try to override instructions
Long documents with one relevant paragraph

Avoid stuffing the eval set with only easy examples. The point is not to get a high score. The point is to create a tripwire for regressions.

It also helps to separate evals by risk. A formatting failure and a factual safety failure should not be averaged into one vague score.

I like using buckets:

Critical: privacy leaks, unsafe advice, unsupported policy claims
Correctness: wrong answer, missed constraint, bad citation
Reliability: invalid JSON, wrong tool call, timeout
Quality: tone, verbosity, formatting, helpfulness

This prevents a system from looking good because it improved style while quietly getting worse at factuality.

Dataset Hygiene#

Eval sets rot. Products change, policies change, documents move, and user behavior shifts.

A healthy eval process treats the dataset as a maintained artifact:

Remove examples that no longer reflect the product
Update expected answers when policy changes
Keep training data and eval data separate
Track which model and prompt produced each failure
Version the eval set alongside prompt and retrieval changes

The separation from training data matters. If you fine-tune on your eval examples, the score becomes less meaningful. The model may memorize the examples without improving on the broader behavior.

For a small team, the simplest version is a checked-in JSONL file:

{"id":"refund_014","input":"Can I get a refund after 45 days?","expected":"must say the standard window is 30 days"}
{"id":"policy_missing_003","input":"What is the Brazil contractor policy?","expected":"must say the provided context does not contain the answer"}

json

This is not fancy, but it makes prompt changes reviewable.

The Development Loop#

The better LLM teams were using a loop that looked like this:

Save failing examples from real usage
Add them to an eval set
Change the prompt, retrieval strategy, model, or tool schema
Run evals before shipping
Review failures manually
Promote important failures into permanent tests

This feels slower at first. Then it becomes liberating. You can change the system with confidence because you have a way to measure the blast radius.

The most useful metric is not just the overall score. It is the diff:

baseline: prompt_v7 + model_a
candidate: prompt_v8 + model_a

critical failures: 2 -> 0
retrieval recall@5: 84% -> 88%
JSON validity: 99% -> 99%
average answer length: 210 tokens -> 340 tokens

text

That last line may look harmless, but it can matter. Longer answers cost more, take longer, and may bury the useful part. Evals should measure the behaviors you care about operationally, not only correctness.

The Takeaway#

Prompt engineering got the attention. Evaluation made it real engineering.

The teams that moved fastest were not the teams with the fanciest prompts. They were the teams that could tell when a prompt change broke refunds, when a chunking change hurt retrieval, and when a model upgrade improved style but damaged factuality.

For LLM apps, evals are the new unit tests. They are imperfect, annoying, and absolutely necessary.