The most honest test of a coding harness is whether it can edit itself. Phases 54-56 of kodr are three “self-dev” trials: ask the local model (qwen/qwen3.6-35b-a3b) to modify kodr’s own source, and treat every failure as a harness bug rather than a model excuse. The model improved across the three - but really, the prompts and the harness did.
The trials
Phase 54 (the rough one). First attempt to have the model edit kodr. It hallucinated a whole file and tried to apply it over the real one, and its test patch guessed assertion names and indentation that didn’t exist. It did produce one genuinely useful flag, --protect-existing, which blocks files[] overwrites of existing paths - catching that worst-case “hallucinate a full file, splat it over the real one” mode before any write lands.
Phase 55 (kodr registry). Harder - a multi-file change wiring a new command through dispatch. Three obstacles hit before the model even wrote code: the test command node --test <file> wasn’t allowlisted by the verification runner; writes were applied before the test command was validated, so an invalid command left a corrupted app.mjs needing git checkout; and a leftover untracked file from the failed run then tripped --protect-existing and caused doubled patches. With a clean tree, the final run applied 4 changes and passed - and the model nailed all three patches because the prompt handed it the exact literal lines to match. No guessing.
Phase 56 (--languages filter). Five patches across two files, no new files. The harness fixes from 55 landed first: pre-flight test-command validation (catch a bad --test before any write) and a git-aware --protect-existing (use git ls-files so untracked leftovers don’t block new files). All five patches applied. One failure - and it’s the interesting one.
The two-location bug
app.mjs has a structural split for every value-consuming flag: one place marks the flag as “consumes the next token”, a separate assignValue switch routes that value to the right field. The model added --languages to assignValue but missed the dispatch list, so the parser saw an unknown boolean flag and threw Unknown option: --languages. One-line fix.
That’s the recurring shape: the model reliably patches one of two coupled locations and misses the other. The fix isn’t more model - it’s naming both in the prompt: “this flag takes a value; add it to both the dispatch list and the assignValue switch.” Coupled edit sites are exactly where a small model needs to be told where to look.
The scorecard
| Phase | Task | Model | Manual fixes | Root cause |
|---|---|---|---|---|
| 54 | Add fields to inspectWorkspace | 1 correct | 2 | No file content in context; hallucinated a whole file |
| 55 | kodr registry command | 3 patches + test | 0 (after fixes) | Allowlist gap; stale leftover file |
| 56 | --languages filter | 5 correct | 1 | Missed one of two coupled locations |
By phase 56: zero context failures, zero hallucinations, correct multi-file reasoning. The remaining gap is purely “which spots in a big file must stay in sync” - a prompting problem, not a reasoning one. What works now: --inspect-context puts the right function bodies in front of the model, exact search strings kill indentation guessing, git-aware --protect-existing tells new files from tracked ones, and pre-flight validation catches bad test args before anything touches disk. Every one of those is a fix a failed self-dev run dragged out of the harness. The model editing the tool is how the tool gets better.
Links:
- Phase docs: 55-self-dev-registry-command, 56-self-dev-language-filter
- Kodr blog: 55, 56
- The CLI surface: src/app.mjs