Structured Outputs Made LLMs Easier To Ship
Why schemas, tool calls, and constrained decoding changed how production LLM apps are built.
The biggest gap between an LLM demo and an LLM product is usually not intelligence. It is reliability.
A demo can return a paragraph. A product often needs an object:
{
"intent": "refund_request",
"priority": "medium",
"needs_human": false
}jsonThat object might route a support ticket, call an API, update a CRM, or trigger a workflow. If the model returns almost-valid JSON, adds a friendly sentence before the object, or forgets a required field, the downstream system breaks.
Structured outputs changed the practical interface to LLMs. Instead of treating the model as a text box, you treat it as a component that must satisfy a contract.
The Problem With “Just Return JSON”#
Early LLM applications often used prompts like this:
Return valid JSON and nothing else.textThat works until it does not.
Common failures:
- A trailing comma makes the JSON invalid
- The model wraps the JSON in Markdown
- A field is missing
- A string appears where a number is required
- The model invents a new enum value
- The answer includes helpful prose outside the object
You can patch around this with parsing retries and regex cleanup, but that is a fragile foundation. The more important the workflow, the less acceptable “probably valid” becomes.
Schemas Create A Contract#
The first improvement is to define the output shape as a schema. Even if the model is still generating text, the application has something concrete to validate.
{
"type": "object",
"required": ["intent", "priority", "needs_human"],
"properties": {
"intent": {
"type": "string",
"enum": ["refund_request", "bug_report", "sales_question", "other"]
},
"priority": {
"type": "string",
"enum": ["low", "medium", "high"]
},
"needs_human": {
"type": "boolean"
}
}
}jsonThis does two useful things. It tells the model what shape to produce, and it gives your code a strict way to reject bad output.
Tool Calls Are Structured Outputs With Intent#
Tool calling is the same idea applied to actions. The model does not merely answer. It chooses a function and supplies arguments.
{
"name": "create_refund_case",
"arguments": {
"order_id": "ORD-9812",
"reason": "duplicate charge",
"priority": "medium"
}
}jsonThis is a major architectural shift. The LLM becomes a planner or router, while deterministic code performs the real action.
That separation matters:
- The model decides what should happen
- The application validates whether it is allowed
- The tool executes with normal logging, permissions, and retries
- The final response explains the result to the user
The model should not be trusted to “do” the side effect directly. It should propose a structured action that code can inspect.
Separate Extraction From Decision-Making#
One design mistake is asking a single model call to extract facts and make a business decision at the same time.
For example:
{
"refund_approved": true,
"reason": "Customer is within the refund window"
}jsonThis looks convenient, but it hides two different tasks:
- Extract the purchase date, plan type, and requested refund reason
- Apply the refund policy to those extracted facts
The second step should often be deterministic code. If the policy says refunds are allowed for 30 days, the application can calculate that. The model should extract the date; code should compare it.
A safer design:
{
"purchase_date": "March 14",
"plan_type": "annual",
"refund_reason": "duplicate charge",
"missing_fields": []
}jsonThen the application decides eligibility. This reduces the model’s authority and makes failures easier to debug.
Constrained Decoding Reduces The Surface Area#
Validation catches bad outputs after generation. Constrained decoding tries to prevent impossible outputs during generation.
If the output must be valid JSON, the decoder can restrict the next token choices to tokens that keep the JSON valid. If a field is an enum, the model can be constrained to one of the allowed values.
This does not make the model semantically correct. It can still choose the wrong intent or fill a field with the wrong value. But it removes an entire class of syntax failures.
That is a good trade: let the model make judgment calls, but do not let it break the parser.
The Retry Loop Still Matters#
Even with schemas, your application needs a recovery path.
A practical loop:
- Ask the model for structured output
- Validate against the schema
- If validation fails, send the error back to the model once
- Validate again
- If it still fails, route to fallback logic
The validation error should be specific:
The field priority must be one of: low, medium, high.
You returned: urgent.
Return the corrected JSON object only.textThis is much better than retrying the same prompt and hoping. The model gets actionable feedback, and the application keeps control.
Designing Better Schemas#
Bad schemas make models worse. If the schema is too broad, the model has too much room to improvise. If it is too narrow, every edge case becomes an error.
Useful patterns:
- Prefer enums for closed categories
- Use
nullexplicitly when a value may be missing - Separate
confidencefrom the actual decision - Include
reasoning_summaryonly if a human will inspect it - Keep user-facing text separate from machine fields
For example, do not overload one field:
{
"status": "maybe refund but ask support"
}jsonUse separate fields:
{
"intent": "refund_request",
"eligible": null,
"needs_human": true,
"missing_information": ["purchase_date"]
}jsonThe second object is easier to validate, route, and test.
Where Structured Outputs Fail#
Structured outputs are not magic. They can create a false sense of reliability because the object looks clean.
A perfectly valid object can still be wrong:
{
"intent": "refund_request",
"priority": "low",
"needs_human": false
}jsonIf the user is angry about a duplicate enterprise charge, that priority may be dangerously wrong.
You still need evals:
- Does the model choose the right intent?
- Does it ask for missing information?
- Does it refuse unsafe actions?
- Does it call tools only when required arguments exist?
- Does it preserve exact IDs and dates?
The structure makes failures easier to detect, but it does not remove the need to measure them.
The Takeaway#
Structured outputs made LLM applications feel less like prompt demos and more like software systems.
The core idea is simple: text is for humans, structure is for machines. When an LLM needs to drive code, give it a schema, validate aggressively, constrain where possible, and keep business rules outside the model.
That is how you turn a probabilistic model into a component that can live inside a deterministic application.