Knowing the Model, Shrinking the Conversation


Two phases of kodr that pair up nicely: one teaches the harness what a given model can actually handle, and the other uses that to stop a long session from drowning a small model.

Phase 69: model configuration is harness behaviour

Kodr’s model defaults were scattered across the CLI - the default LM Studio URL, the default Qwen model, the long local timeout, separate assumptions about tool support and JSON behaviour. Fine with one main local model; fragile the moment runs started mixing Qwen, Nemotron, OpenRouter planners, and model-specific context windows. Phase 69 gathers it into a model profile registry.

Built-in profiles cover the default qwen/qwen3.6-35b-a3b, nvidia/nemotron-3-nano-omni, and Ollama/OpenRouter wildcards; projects override via .kodr/model-profiles.json, or you point KODR_MODEL_PROFILES at a file. A profile records model id, provider, base URL, context window, completion reserve, timeout, native tool-call support, and the recommended response-envelope mode. Kodr attaches the active profile to run summaries and subagent metadata, so a failed run can be inspected with the same capability context that shaped it. Two defaults moved behind the profile: timeouts now come from the active profile unless --timeout-ms is set, and session compaction defaults derive from context window minus completion reserve. The change to packing stayed deliberately conservative - profiles can reduce the cap for small windows, but the full token-budget assembly was its own phase, so this one didn’t balloon into a context-packer rewrite.

The lesson worth keeping: model configuration is harness behaviour, not a cosmetic setting. Local models differ enough that context budget, timeout, tool support, and output expectations have to be explicit and artifacted - otherwise you’re debugging a failure without knowing the constraints it ran under.

Phase 70: compaction without lying

Session continuation sent the complete prior conversation back every turn - simple and faithful, and eventually unusable for a small local model. Phase 70 adds deterministic compaction. When a continued transcript exceeds the character budget, kodr keeps the frozen system prompt and the newest user-led turns, and replaces the older ones with an extractive summary pulled from existing artifacts: user intent, constraints, changed file paths, remaining tasks, verification failures, key tool output, decisions.

It uses characters, not claimed tokens - there’s no provider-neutral tokenizer, so the default is an honest 48,000-char budget that --session-context-chars makes explicit and testable. The artifact split is the careful bit:

  • conversation.json - what’s actually sent to the model.
  • conversation-raw.json - the complete, untouched chain.
  • session-summary.json - the extractive summary and compaction metadata.

Future continuations prefer the raw transcript, so kodr never progressively summarizes an already-summarized conversation - the lossy version is for the model, the full version stays for browsing and debugging. Two safety details I want to call out. The summary is injected as explicitly untrusted historical user context, never a system message - prior user, assistant, tool, and artifact text must not get promoted to higher instruction priority just because they got compacted. And kodr never truncates the current user turn to fit; if the frozen prompt plus the live request are themselves over budget, it records overflowChars so the breach is visible rather than silently chopping what you just asked.

Links: