Five Bugs, a Real Diff, and the Token Bill

A maintenance batch, mostly. No big feature - but phase 39 of kodr is the most instructive kind of phase, because four of its five fixes passed the existing test suite and were still wrong. Then 40 makes the thing you review readable, and 41 finally shows you the bill.

Phase 39: five bugs behind green tests

Every one of these is plausible, not contrived. That’s the lesson.

1. The SSRF guard had a hole the size of a redirect. The fetch_url tool already did the careful thing - parse the URL, reject loopback and private literals, resolve the hostname. But fetch() follows redirects by default, so all that validation only covered the URL we were handed. The model passes a clean public URL, the server answers 302 Location: http://169.254.169.254/latest/meta-data/, and fetch dutifully follows it into the cloud metadata endpoint. Classic redirect bypass. The fix is blunt and correct: redirect: 'manual', reject any 3xx. For a local-first tool there’s no reason to follow redirects at all - refusing them is the smaller surface.

2. --stream silently threw away tool calls. The streaming branch stitched together delta.content and ignored delta.tool_calls, then hard-coded finish_reason: 'stop'. So kodr run --tools --stream looked fine but the model could never actually call a tool - the harness saw “stop”, treated the empty text as the answer, and moved on. The silent kind of broken. The streamed tool-call protocol is fiddly (first fragment has the id and name, later fragments append argument chunks by index), so the reader now accumulates fragments properly and reports tool_calls when any were collected.

3. kodr run demanded /models even when you named the model. Every run called listModels() - GET /models - then promptly used the default anyway. Point kodr at a minimal OpenAI-compatible server that doesn’t implement /models (plenty of llama.cpp setups don’t) and every run died before sending a single prompt. Fix: only discover a model when you don’t already have one.

4. parseArgs rejected legitimate values. The guard if (!value || value.startsWith('--')) was meant to catch a missing value, but it also rejected -p "" and -p "--literal text". For a tool whose main input is free-form prose, refusing dash-prefixed text is a real limit. The right check is just “is there a next token at all”. The tradeoff - kodr run -p --json now reads --json as the prompt - is the correct one for a prompt-driven CLI.

5. Streaming discarded usage. The SSE reader never captured the usage chunk, so streamed runs reported zero tokens and couldn’t enforce --max-tokens. Servers only emit stream usage if you ask, so kodr now sends stream_options: { include_usage: true } and carries the final usage onto the response. Which sets up phase 41.

The takeaway keeps repeating in this project: tests prove the paths you thought of; the bugs live in the paths you didn’t. A review that asks “what if the other side misbehaves” found all five in an afternoon.

A bonus finding: the harness was right, the persona was wrong

Exercising the streaming changes, I asked kodr to generate an Express notes API against a local model. It streamed perfectly, captured 22k tokens of usage (zero before fix 5), skipped /models, stayed in dry-run - and returned files: [] with a note about “reading roadmap.md to identify the first unchecked phase.” The model wasn’t writing the app. It was role-playing me. Run inside the kodr repo, the context pack handed it AGENTS.md, and it adopted the maintainer persona and tried to follow my workflow. Re-run from a clean temp dir, it produced the real thing: six files, all 8 HTTP tests passing. A self-hosting agent repo is a hostile context for generation, because its own process docs read as instructions. Generation belongs in a clean workspace.

Phase 40: a diff worth reading

The old makeDiff emitted a pseudo-diff - every old line as -, every new line as +. For a one-line change in a 200-line file, that’s 400 lines of noise hiding the actual change. And since dry-run is the default, the diff is exactly what you review before approving a write. So it became a real line-level unified diff, with @@ hunk headers and space-prefixed context:

--- math.mjs
+++ math.mjs
@@ -1,6 +1,10 @@
-function add(a,b){
-  return a+b
+function add(a, b) {
+  return a + b;
 }
 
-export {add}
+function sub(a, b) {
+  return a - b;
+}
 
+export { add, sub }

Zero dependencies, on purpose - kodr stays Node-built-ins-only. It’s a ~30-line LCS line diff plus the genuinely fiddly hunk assembly (grouping nearby changes, padding with three lines of context, counting spans). I picked plain LCS over git’s Myers algorithm deliberately: Myers is marginally more “human” in pathological cases and a lot more code, for a benefit a learning repo never feels. There’s a hard bound too - over 2,000 lines either side, it falls back to the old whole-file dump rather than risk an OOM. A named limit beats a silent crash.

Phase 41: show me the bill

Loop budgets have tracked tokens since phase 33, and 39 made streaming capture them - but none of it was visible. Phase 41 puts usage where you actually look: a structured usage field in summary.json, a real breakdown in the CLI (Tokens: 1,234 (prompt 900 / completion 334) Cost: $0.0021), and tokens=N in prompt-history.

One small design call I like: when the server sends no usage, the field is null, not a zero-filled object. Zero would read as “this run genuinely used zero tokens”; null is the honest “no data”. Different shapes for different truths.

Links:

Phase docs: 39-network-and-streaming-hardening, 40-real-unified-diffs, 41-token-usage-reporting
The diff and write gate: src/safe-writes.mjs
Streaming and usage capture: src/model-client.mjs
Budget and usage totals: src/loop-budgets.mjs