This is where kodr sits today - phase 100, a tidy round number that is absolutely not an ending. These two phases are a fitting place for the blog to catch up to the present: one cleans up a lie that had sat in the codebase since phase 53, and the other turns the project’s central question into a measurable score.
Phase 99: speak the protocol the tools actually speak
The external inspector registry had phantom entries since phase 53. Four language-server names sat there claiming --json flags that do not exist - gopls --json, rust-analyzer --json, typescript-language-server --json are not real interfaces. Phase 53 used fake commands in tests to validate the registry shape; the real tools were never checked. And the auto-inspection defaults made it worse - every bare run now probed those commands, so on any machine with gopls installed, kodr spawned real binaries with bogus arguments on every prompt. Wasted spawns at best, multi-second stalls at worst.
Phase 99 fixes it properly: not by hunting for better flags (there are none) but by speaking the protocol these tools genuinely implement - Language Server Protocol over stdio. Language servers aren’t one-shot JSON CLIs; they’re long-running processes exchanging Content-Length-framed JSON-RPC, with a real lifecycle (initialize → initialized → didOpen/symbols/diagnostics → shutdown/exit). src/lsp-client.mjs implements that from scratch on Node 24 builtins, no npm dependencies - a framing decoder that tolerates split and coalesced messages, a per-request timeout and per-run budget, partial results kept and the rest falling back to the built-in index. The four invented CLI entries become real LSP invocations (gopls, pyright-langserver --stdio, rust-analyzer, typescript-language-server --stdio), the default registry ships zero CLI entries, and a hygiene test makes the invented-flags regression impossible to quietly reintroduce.
It’s off by default, and for a sharp reason: enabling LSP can execute repository code - rust-analyzer runs build scripts and proc-macros, gopls invokes the go toolchain. So --lsp is opt-in, config can name allowed servers but arbitrary command strings are rejected by name (a cloned repo’s config must never choose what binary kodr runs), and model output never reaches the LSP process - the adapter is driven off the file walker, not model instructions. A bare run with no --lsp spawns nothing. The bogus spawns are gone and locked out.
Phase 100: make “can it edit?” a number
Every eval up to now was greenfield - empty directory, model generates, harness checks what appeared. But the real question for a coding harness is whether it can edit an existing codebase, and that had only ever been tested by hand. Phase 100 ships a brownfield edit eval suite: eight fixture repos, each a small broken codebase with one planted defect across JS, TS, Python, Go, and Rust - a function returning a + b + 1, a rename that must propagate, a missing CLI flag, a stale test. Every fixture’s test fails before the fix and passes after a correct edit, and expectFailingBaseline verifies that invariant before any model turn - if a fixture starts passing unexpectedly, it’s flagged invalid.
The new assertion types check workspace state, not just the proposal: file_modified/file_unchanged via SHA-256 baselines, files_absent (guarding the Nemotron pattern of creating utils.mjs at the root instead of editing tests/utils.mjs), and content_absent (confirming old identifiers were renamed away). Crucially, cases run the full pipeline via runPrompt - tools, apply, verification, heal - and score the workspace on disk, not the raw proposal object. Toolchain probing skips cases needing python3/go/cargo on machines without them, and --record appends to an append-only JSONL so every run accumulates a comparison history.
The failures it shook out are the kind I’ve come to enjoy: a circular import between app.mjs and eval-runner.mjs (fixed by dependency-injecting runPrompt), and a streaming-vs-fake-server mismatch where stream: 'auto' routed to SSE the fake server doesn’t speak, so the model “ran” but wrote nothing and every file_modified assertion failed. And my favourite: the cversion check wouldn’t accept 0.0.100 because roadmapVersion’s regex matched exactly two digits - \d{2} silently skipping phase 100. A round number the code itself couldn’t count to.
Onward
That’s the blog caught up - one hundred phases, from a CLI skeleton and a probe to a multi-agent, sandboxed, self-evaluating harness that can edit a brownfield repo and measure whether it did it right. The roadmap isn’t empty; it never is. But it’s a good place to stand and look back at the climb. Thanks for reading along. There’s always more to do.
Links:
- Phase docs: 99-optional-lsp-adapter, 100-brownfield-edit-eval-suite
- Kodr blog: 99, 100
- The LSP client: src/lsp-client.mjs and the eval runner: src/eval-runner.mjs