llmsevaluationPython
A small harness for LLM regression tests
Designing an evaluation loop that treats prompts, datasets, and model settings as versioned artifacts.
2026-01-276 min readpublishedEasy
LLM evaluation should feel closer to database migration testing than dashboard vibes. A prompt change is a code change, and a model setting change is a runtime change.
The harness I like for early projects has four tables: test cases, expected behaviors, model runs, and human adjudications. That shape keeps the system honest without pretending every answer has a single canonical string.
type EvalCase = {
id: string;
prompt: string;
rubric: string[];
expectedFailureModes: string[];
};See the retrieval evaluation lab for a project-shaped version of this workflow.