Ask a local model for JSON and you will get JSON. Eventually. First you have to dig it out of the prose it wrapped around it, peel off a markdown fence, and fix the raw newline it stuffed inside a string value. JSON.parse does not care about your feelings here.
This is phase 05 of kodr, my zero-dependency learning harness. The rule from the start has been “treat model output as untrusted”, and nowhere does that bite harder than parsing. So before connecting any of this to file writes, I built a pure extractor module - no model calls, no filesystem, just text in and a value out.
Yes, structured output is a thing these days, but models all differ - this is a way to make sure we handle the very low bar.
What actually breaks
Remember we are working with the small models here, not a frontier model. The fun part of this phase was that I did not invent the failure cases. I captured them. Phase 04 already saves raw model output as artifacts, so the test fixtures are real things qwen handed me during smoke runs:
- prose before and after the JSON (“Sure! Here is the JSON you asked for:”)
- the whole thing fenced in a
```jsonblock - braces inside string values, which trip up any naive bracket counter
- raw newlines and tabs inside JSON strings, which are illegal but extremely common
- escaped markdown backticks that the model helpfully mangled
Each of those is now a test. The point of the test suite is not to prove the parser works - it is to preserve the exact shapes of garbage the parser is meant to survive.
The approach
Two strategies, tried in order:
- Fenced blocks first. If there is a
```json(or bare```) fence, grab what is inside. Models reach for fences constantly, and when they do, it is the cleanest signal you will get. - Brace-walk the raw text. Find the first
{or[, then walk forward tracking string state so braces inside quoted values do not throw off the depth count. Escapes respected, quotes respected.
Then, before handing any candidate to JSON.parse, run a repair pass: raw newlines, carriage returns and tabs inside strings get escaped properly, and escaped backticks get converted back to literal ones. Try to parse. If it throws, fall through to the next candidate and collect the error. Only if everything fails do you raise.
The shape that mattered most was the string-state tracking in the brace walk. A plain “count the braces” loop looks correct until a model returns {"note": "use the }} syntax"} and your depth counter goes haywire. You have to know when you are inside a string.
Keeping it boring
The discipline that made this phase easy was not doing the tempting thing. No applying proposals, no dry-run logic, no writes - that all belongs to a later phase. This module takes a string and returns a value or throws. That is the entire contract. A pure function with a pile of nasty fixtures is about as pleasant as parsing untrusted text ever gets.
Links:
- Phase doc: phases/05-defensive-json-extraction.md
- Kodr post: blog/05-defensive-json-extraction.md
- The extractor: src/json-extractor.mjs