Probing Local Models and Building the Test Rig

Continuing from Learning by Commit. The kodr repo is a zero-dependency AI code harness I’m building to learn by doing. Last post was the intro - here’s where it actually starts doing something.

Phase 02: Poke the model

Before anything useful, you need proof the thing talks back.

kodr probe is a small connectivity check. It calls GET /models, picks a model, fires a tiny POST /chat/completions, and exits. Flags: --base-url, --model, --api-key, --timeout-ms, --json. Every run writes artifacts to .kodr/runs/<timestamp>/ because trusting terminal output alone is naïve - models are weird and you want receipts.

Smoke-tested against LM Studio with nvidia/nemotron-3-nano-omni loaded. It passed. Artifacts written. Moving on.

The forced discipline here is the artifact-first approach. It sounds over-engineered for a probe command, but local models are unpredictable - you need to be able to diff a working run against a broken one without squinting at logs.

Phase files: phases/02-lmstudio-probe.md, blog/02-lmstudio-probe.md

Phase 03: Stop calling LM Studio in tests

The probe test used a lightweight fake server - a node:http thing that implements just enough of the OpenAI API to let tests run without a real model. Phase 03 made that reusable infrastructure.

The fake server implements:

GET /v1/models
POST /v1/chat/completions

It also ships with a recorder. Every request logs method, URL, redacted headers, parsed body, response status, response body, and timing. Tests get concrete evidence about both sides of the exchange - not just “it worked” but “here’s exactly what was sent and what came back.”

The recorder redacts authorization so you can assert header behaviour without leaking secrets into test output. Small thing, but it matters when you’re sharing examples.

The framing here: model calls fail in ways that are easy to misattribute. Is it a CLI bug? An endpoint shape mismatch? Model weirdness? Caller parsing? The recorder removes ambiguity because you have the whole conversation written down.

Phase files: phases/03-fake-model-server-and-recorder.md, blog/03-fake-model-server-and-recorder.md

Phase 04: `kodr run`

With connectivity proven and tests not requiring LM Studio, kodr run was safe to add.

Takes a prompt via -p/--prompt or --prompt-file. Writes four artifacts per run:

prompt.md - exact input
response.md - stitched assistant response
summary.json - stable metadata (no timestamps, so you can diff two runs)
raw-response.json - every raw completion body

The summary omits timestamps deliberately. You want to compare runs by model, prompt size, response size, and finish reasons - not wonder whether a one-second timestamp difference means anything.

There’s also auto-continuation: if finish_reason is "length", kodr asks the model to continue and stitches the next assistant turn directly onto the previous response text. Raw artifacts preserve every response body so the stitching can be audited.

This is the first piece that feels like a real harness rather than a probe. You can now point it at a prompt file and get reproducible, inspectable output.

Phase files: phases/04-prompt-runs-and-artifacts.md, blog/04-prompt-runs-and-artifacts.md

Repo: paulkohler/kodr

Next up: defensive JSON extraction - because local models will only give you clean JSON responses when you look at them, once you turn away…

Probing Local Models and Building the Test Rig

Phase 02: Poke the model

Phase 03: Stop calling LM Studio in tests

Phase 04: kodr run

Phase 04: `kodr run`