Two Paths to Structured Data
Vibe articled by Opus 4.5
Same problem, two approaches. Not sure which wins yet.
The problem: messy bank statement PDFs into structured JSON. Every transaction, every memo character-perfect, balances that reconcile. Sounds trivial until you're staring at 47 different PDF layouts.
Approach 1: Raw LLM Power
Give a frontier model the PDF, a schema, let it rip:
# The "just ask nicely" approach
def system_prompt
<<~PROMPT
parse this document perfectly confirming to the schema
and not missing a single character.
PROMPT
end
service = llm_service.with_schema(LINES_SCHEMA).with_temperature(0.15)
service.add_message role: :user, content: RubyLLM::Content.new(nil, pdf_attachment)
reply = service.simply_complete
Betting on raw capability. Gemini 2.5 Pro sees the PDF, understands the schema, outputs JSON directly. Token by token. One shot.
Verification is external - reconcile the math, check start_balance + sum(transactions) = end_balance, retry if not:
transitions do
auto from: :parsing, to: :reconciling
auto from: :reconciling, to: :verifying, if: :tallies_match?
auto from: :reconciling, to: :parsing, if: :retries_available? # try again
end
Works often. But failures are weird - truncated memos, hallucinated dates. The model generates tokens, not logic.
Approach 2: JIT Code Generation
Instead of outputting the answer, ask the model to write a program that outputs the answer:
# The "write code, then run it" approach
messages: [
{
role: 'user',
content: [
{ type: 'container_upload', file_id: pdf_file_id },
{ type: 'container_upload', file_id: schema_file_id },
{ type: 'text', text: <<~PROMPT
Please use the PDF skill to analyze the document, then write
a Python script using code execution to:
1. Read and extract all transaction data from the PDF
2. Parse dates, amounts, and memos carefully
3. Generate JSON that conforms exactly to the schema
4. Save the final JSON to 'output.json'
Iterate on your Python script until it produces perfect output.
PROMPT
}
]
}
]
The model writes Python, runs it, sees output, fixes bugs, runs again. Code execution in a sandboxed container with PDF parsing libraries. The model acts as a coding agent for this task.
# Handle pause_turn: Claude iterates on its script
while current_response.stop_reason == 'pause_turn' && continue_count < max_continues
messages << { role: 'assistant', content: current_response.content }
current_response = client.beta.messages.create(
model: MODEL,
container: container_id, # same execution environment
tools: [{ type: 'code_execution_20250825', name: 'code_execution' }],
messages: messages
)
end
Exploiting the Verifier-Generator Gap
LLMs are better at verifying than generating. Both approaches exploit this, differently.
Raw LLM: external verification loop. Generate, verify with deterministic checks, retry on failure. Model never sees its mistakes.
Code execution: internal verification loop. Generate code, run it, see output, fix the script. Model becomes both generator and verifier.
Generating JSON token-by-token means predicting characters without seeing the whole structure. Can't go back and fix. But writing code allows:
- Write a parser that handles edge cases explicitly
- Run the code and see actual output
- Debug based on concrete errors
- Iterate until the output matches expectations
The Tradeoffs
| Raw LLM | Code Execution | |
|---|---|---|
| Speed | Fast (single call) | Slow (iterations) |
| Cost | Cheaper per attempt | More expensive |
| Reliability | Brittle on edge cases | More deterministic |
| Failure mode | Silent (wrong data) | Loud (exceptions) |
| Self-correction | No | Yes |
What I'm Actually Seeing
Both agents share the same verification pipeline. The difference is how they get to the first draft.
Code execution feels more robust. Failures are understandable (PDF library can't handle format). Raw LLM failures are mysterious (why truncate that memo?).
But code execution is more complex. More API calls, more state, more orchestration failure modes.
The Bigger Picture
Maps to a broader pattern. Terminal computer use excitement isn't just bash access - it's letting models write and execute code as a thinking tool.
- Give every agent run an associated container runtime
- Give said agent tools: read file, edit file, run bash command
- Let the agent go wild - chaining tools, generating code JIT
Agent-directed loops + terminal + coding agent = broad coverage. Models have enormous skill collections in distribution.
For structured extraction: trust the model to output correct JSON directly, or trust it to write correct code that outputs JSON?
Betting on the latter. Models are better at recognizing correct output than producing it directly. Code execution lets them iterate toward correctness.
The Punchline
Both approaches use the same verification harness:
agent_pattern do
steps initial: :idle do
step :parsing, Accounting::Steps::ParsingStep # or CodeExecutionParsingStep
step :reconciling, Accounting::Steps::ReconcileStep
step :verifying, Accounting::Steps::VerifyStep
step :done
end
end
Philosophical difference: "model is smart enough to get it right" vs. "model is smart enough to write code that gets it right."
Lack of 9's propagates through longer trajectories. Verify rigorously. Question is whether you verify model output or model code output.
Don't know which wins yet. Know which one lets me sleep better.
