kodr Edits kodr


The most honest test of a coding harness is whether it can edit itself. Phases 54-56 of kodr are three “self-dev” trials: ask the local model (qwen/qwen3.6-35b-a3b) to modify kodr’s own source, and treat every failure as a harness bug rather than a model excuse. The model improved across the three - but really, the prompts and the harness did.

The trials

Phase 54 (the rough one). First attempt to have the model edit kodr. It hallucinated a whole file and tried to apply it over the real one, and its test patch guessed assertion names and indentation that didn’t exist. It did produce one genuinely useful flag, --protect-existing, which blocks files[] overwrites of existing paths - catching that worst-case “hallucinate a full file, splat it over the real one” mode before any write lands.

Phase 55 (kodr registry). Harder - a multi-file change wiring a new command through dispatch. Three obstacles hit before the model even wrote code: the test command node --test <file> wasn’t allowlisted by the verification runner; writes were applied before the test command was validated, so an invalid command left a corrupted app.mjs needing git checkout; and a leftover untracked file from the failed run then tripped --protect-existing and caused doubled patches. With a clean tree, the final run applied 4 changes and passed - and the model nailed all three patches because the prompt handed it the exact literal lines to match. No guessing.

Phase 56 (--languages filter). Five patches across two files, no new files. The harness fixes from 55 landed first: pre-flight test-command validation (catch a bad --test before any write) and a git-aware --protect-existing (use git ls-files so untracked leftovers don’t block new files). All five patches applied. One failure - and it’s the interesting one.

The two-location bug

app.mjs has a structural split for every value-consuming flag: one place marks the flag as “consumes the next token”, a separate assignValue switch routes that value to the right field. The model added --languages to assignValue but missed the dispatch list, so the parser saw an unknown boolean flag and threw Unknown option: --languages. One-line fix.

That’s the recurring shape: the model reliably patches one of two coupled locations and misses the other. The fix isn’t more model - it’s naming both in the prompt: “this flag takes a value; add it to both the dispatch list and the assignValue switch.” Coupled edit sites are exactly where a small model needs to be told where to look.

The scorecard

PhaseTaskModelManual fixesRoot cause
54Add fields to inspectWorkspace1 correct2No file content in context; hallucinated a whole file
55kodr registry command3 patches + test0 (after fixes)Allowlist gap; stale leftover file
56--languages filter5 correct1Missed one of two coupled locations

By phase 56: zero context failures, zero hallucinations, correct multi-file reasoning. The remaining gap is purely “which spots in a big file must stay in sync” - a prompting problem, not a reasoning one. What works now: --inspect-context puts the right function bodies in front of the model, exact search strings kill indentation guessing, git-aware --protect-existing tells new files from tracked ones, and pre-flight validation catches bad test args before anything touches disk. Every one of those is a fix a failed self-dev run dragged out of the harness. The model editing the tool is how the tool gets better.

Links: