Exercising the Harness


By phase 14 the harness had all its primitives: extraction, context, safe writes, verification, tools, healing, cycles. The next four phases are less about adding parts and more about poking the thing until it complains. So I’m batching them into one post.

Phase 15: an install, basically

I’ll be quick about this one because it is quick. npm run install-local writes a small shell shim into ~/.local/bin/kodr so I can run it from any directory instead of typing a path. There’s a --dir and --name if you want it somewhere else. The tests install into a temp bin and check --version. That’s the whole phase. Ergonomics, not architecture.

Phase 16: replay and comparison

This one earns its keep. The repo exists to learn about local models, and you can’t learn much from a run you’ve already thrown away.

So two things:

  • kodr replay <run-dir> parses the saved artifacts from a previous run, no model call needed. The run already wrote everything to disk - replay just reads it back.
  • Model comparison runs the same prompt across a list of model ids and writes a summary to .kodr/comparison.json, with metadata appended to process/experiments.jsonl.

The point is being able to argue about model behaviour from saved artifacts instead of from memory. “Qwen did better than Llama on this” is a much better sentence when there’s a JSON file behind it. And because comparison reuses one prompt across models, the only variable is the model - which is the whole point of a comparison.

The tests do this without touching a real model: replay is parsed from fixtures, and the comparison runs at least two fake models. Same trick as the recorder from way back in phase 3 - fake the model, test everything around it.

Phase 17: the part where I review my own work

Fourteen phases of building primitives in isolation, each one neat on its own. Phase 17 was a focused security review of the whole thing, and - surprise - the gaps were all at the seams.

The findings, roughly:

  • Reads could escape the workspace. read_file, --prompt-file, and replay paths now reuse the same workspace jail the safe-writes gate already used. And existing symlink targets are rejected, so a read or write can’t follow a link out of the repo. The jail was always there for writes; it just wasn’t applied everywhere it needed to be.
  • Healing applied repairs too eagerly. One-shot healing is now dry-run by default and only verifies repaired files after an explicit apply. The one repair is still the limit; it just no longer happens behind your back.
  • Skills bypassed the context budget. Markdown skills now have per-skill and total byte caps, and loaded skills get delimited as untrusted workspace Markdown. A SKILL.md is still just text the model might have written.
  • fetch_url trusted DNS. The bounded fetch tool now rejects resolved private and local addresses, not just obvious ones, and caps response bodies. Blocking localhost is easy; blocking a hostname that resolves to 127.0.0.1 is the actual job.
  • Replay errors were raw. Missing and corrupt artifacts now report explicit errors instead of a stack trace.

There’s also a chunk of plain honesty in here. The verification runner results now state the trust boundary out loud: commands are allowlisted and shell-free, but the npm scripts they run are trusted workspace code. And some later phases produced library primitives, not finished CLI commands - so the help text now says “implemented primitive” instead of dangling a command that doesn’t exist. Lying to yourself in your own --help output is a bad habit.

None of this was new features. It was reading my own roadmap with a suspicious eye and finding that “correct in isolation” and “correct wired together” are different claims.

Phase 18: build something and watch it break

This is the one I actually enjoyed, and the one with a point worth dwelling on.

An eval measures a model against a fixed answer key. Phase 18 is not that. The idea here is to exercise the system: pick a real app, have Kodr build it end to end, and see what falls over. Not “did the model score 8/10” but “what breaks when you actually try to use this for the thing it’s for”.

I sketched a candidate list - CLI todo app, Markdown blog generator, Express notes API, CSV expense analyzer, SQLite habit tracker, Markdown search, React Kanban board - and picked the CLI todo app first because it’s small enough to eyeball but still hits multi-file generation, persistence, argument parsing, and tests.

Then I let it run, and the breaking started:

  1. The app failed its own tests. The generated store tried to create the JSON file path as a directory, and the generated tests expected list() to return data while the implementation just logged it. A useful failure - that’s a real bug a real person would write.
  2. But it also exposed a harness gap. Generated examples live in subdirectories, and the verification runner could only run from the repo root. So Kodr grew --test-cwd path, which jails that path inside the workspace and runs the allowlisted commands from there. The app’s bug pointed at one of my bugs.
  3. Second run, new failure. Subproject verification now worked, and promptly caught the model using CommonJS require() in an ESM test file. Two fixes fell out: the prompt now explicitly forbids CommonJS globals, and Kodr now marks the whole run as failed when verification fails - so a caller doesn’t have to go spelunking in nested test output to notice the generation needs repair.

That loop - build, observe a failure, work out whether it’s the app’s fault or the harness’s, fix whichever it is - is the entire value of phase 18. Two runs of one small app turned up a path-handling bug, a missing verification feature, a prompt weakness, and a run-status bug. You don’t find those by adding more primitives. You find them by trying to use the ones you have.

The finished app lives in examples/todo-cli: ESM, Node built-ins, JSON persistence, positional CLI commands, native node:test coverage. Generated by the harness, debugged by the harness, and now a fixture I can re-run whenever I change something underneath it.

The shape of these four

Install, replay, harden, exercise. After fourteen phases of making things, this batch was about living with them - running it conveniently, learning from saved runs, distrusting my own seams, and dogfooding the whole stack on a real app until it told me where it hurt. Less satisfying to write up than a shiny new primitive. Considerably more honest.

Links: