MQ
llmsevaluationPython

A small harness for LLM regression tests

Designing an evaluation loop that treats prompts, datasets, and model settings as versioned artifacts.

2026-01-276 min readpublishedEasy

LLM evaluation should feel closer to database migration testing than dashboard vibes. A prompt change is a code change, and a model setting change is a runtime change.

score=accuracyλlatencyp95score = accuracy - \lambda \cdot latency_{p95}

The harness I like for early projects has four tables: test cases, expected behaviors, model runs, and human adjudications. That shape keeps the system honest without pretending every answer has a single canonical string.

type EvalCase = {
  id: string;
  prompt: string;
  rubric: string[];
  expectedFailureModes: string[];
};
Related project

See the retrieval evaluation lab for a project-shaped version of this workflow.