Evals, Scores, and Prompt Receipts


Back in the OpenRouter phase I caught a generated test bug by hand - the model wrote a comment correctly diagnosing that execFile rejects, then wrote code that ignored its own comment. I only spotted it because I was reading closely. That doesn’t scale. Phases 37 and 38 of kodr are about not having to squint: put a number on model output, and keep a receipt for every prompt.

kodr eval (phase 37)

kodr eval --suite path runs a suite of cases against a model and scores them. A suite is plain JSON: a name, a description, and cases, where each case has a prompt and a list of assertions. Three assertion types, all checked against the structured proposal the model returns:

  • files_exist - the listed paths show up in the proposal’s files or patches.
  • content_matches - a file’s content matches a regex.
  • tests_pass - write the proposal’s files to a temp dir and actually run the test command there, via the same verification runner the real flow uses.

A case scores passCount / totalCount, 0 to 1. That tests_pass assertion catches the execFile-comment bug class directly: if the generated tests are broken, they fail in the temp dir, and the assertion returns false. No squinting required.

The gnarly bit: node —test inside node —test

The eval’s own tests run under node --test. The tests_pass assertion then spawns another node --test in a temp dir. Node 24 added a guard that detects recursive test-runner usage and silently skips file discovery - so the inner run exited 0 even when the tests it was meant to run had failed. Which means the test “fails when the generated tests contain a bug” would itself always pass. A test that can’t fail is worse than no test.

The fix lives in the verification runner: strip NODE_TEST_CONTEXT and NODE_CHANNEL_FD from the child environment before spawning. Those are the vars that whisper “you are a test worker” to Node, and removing them lets the grandchild run stand on its own.

Both live runs scored zero, and that’s the point

I ran the todo-cli suite against a local qwen twice. Once the model saw the workspace already had the files and returned an empty proposal (nothing to check, 0/4). Once it returned a messages array in the wrong schema and extractProposal threw (recorded as a completion error, 0/4). Both zeros are the eval doing its job - two failure modes that would otherwise need a human reading transcripts, surfaced automatically. A score of 0 means “no usable output”, 0.5 means “half the assertions held”, 1.0 means “all green”. It’s a number you can trend across model versions.

Prompt versioning (phase 38)

Evals give you a score. Phase 38 gives you the thread back to what produced it. Before this, a run’s summary.json had no link to the prompt text behind it. Now every run carries a promptId, resolved in priority order:

--prompt-id slug      → use it directly
--prompt-file path    → slug derived from the filename
-p "inline prompt"    → first 8 hex of the SHA-256 of the text

The content hash is deterministic - same prompt, same id; a whitespace tweak, a new id. Files in the prompts/ stash auto-link to their runs by filename. Then kodr prompt-history <id> scans .kodr/runs/, finds every run sharing that id, and prints them chronologically with model, ok/fail, and the eval score if one was recorded:

2026-05-28T10:12:00.000Z  qwen/qwen3.6-35b-a3b  [ok]
2026-05-28T10:18:00.000Z  qwen/qwen3.6-35b-a3b  [ok]

Two small modules, kept apart on purpose: prompt-id.mjs is pure and synchronous (just hashing and string-munging), run-history.mjs is all async I/O. Neither drags the other’s concerns along.

Put together, 37 and 38 are the boring infrastructure that makes “is this prompt better than the last one?” a question with an answer instead of a vibe. You can’t improve what you won’t measure, and you can’t measure what you can’t trace.

Links: