Two Paths to Structured Data - Original Context

Vibe articled by Opus 4.5

Same problem, two approaches. Not sure which wins yet.

The problem: messy bank statement PDFs into structured JSON. Every transaction, every memo character-perfect, balances that reconcile. Sounds trivial until you're staring at 47 different PDF layouts.

Approach 1: Raw LLM Power

Give a frontier model the PDF, a schema, let it rip:

# The "just ask nicely" approach
def system_prompt
  <<~PROMPT
    parse this document perfectly confirming to the schema 
    and not missing a single character.
  PROMPT
end

service = llm_service.with_schema(LINES_SCHEMA).with_temperature(0.15)
service.add_message role: :user, content: RubyLLM::Content.new(nil, pdf_attachment)
reply = service.simply_complete

Betting on raw capability. Gemini 2.5 Pro sees the PDF, understands the schema, outputs JSON directly. Token by token. One shot.

Verification is external - reconcile the math, check start_balance + sum(transactions) = end_balance, retry if not:

transitions do 
  auto from: :parsing, to: :reconciling
  auto from: :reconciling, to: :verifying, if: :tallies_match?
  auto from: :reconciling, to: :parsing, if: :retries_available?  # try again
end

Works often. But failures are weird - truncated memos, hallucinated dates. The model generates tokens, not logic.

Approach 2: JIT Code Generation

Instead of outputting the answer, ask the model to write a program that outputs the answer:

# The "write code, then run it" approach
messages: [
  {
    role: 'user',
    content: [
      { type: 'container_upload', file_id: pdf_file_id },
      { type: 'container_upload', file_id: schema_file_id },
      { type: 'text', text: <<~PROMPT
          Please use the PDF skill to analyze the document, then write 
          a Python script using code execution to:
          1. Read and extract all transaction data from the PDF
          2. Parse dates, amounts, and memos carefully
          3. Generate JSON that conforms exactly to the schema
          4. Save the final JSON to 'output.json'
          
          Iterate on your Python script until it produces perfect output.
        PROMPT
      }
    ]
  }
]

The model writes Python, runs it, sees output, fixes bugs, runs again. Code execution in a sandboxed container with PDF parsing libraries. The model acts as a coding agent for this task.

# Handle pause_turn: Claude iterates on its script
while current_response.stop_reason == 'pause_turn' && continue_count < max_continues
  messages << { role: 'assistant', content: current_response.content }
  current_response = client.beta.messages.create(
    model: MODEL,
    container: container_id,  # same execution environment
    tools: [{ type: 'code_execution_20250825', name: 'code_execution' }],
    messages: messages
  )
end

Exploiting the Verifier-Generator Gap

LLMs are better at verifying than generating. Both approaches exploit this, differently.

Raw LLM: external verification loop. Generate, verify with deterministic checks, retry on failure. Model never sees its mistakes.

Code execution: internal verification loop. Generate code, run it, see output, fix the script. Model becomes both generator and verifier.

Generating JSON token-by-token means predicting characters without seeing the whole structure. Can't go back and fix. But writing code allows:

Write a parser that handles edge cases explicitly
Run the code and see actual output
Debug based on concrete errors
Iterate until the output matches expectations

The Tradeoffs

	Raw LLM	Code Execution
Speed	Fast (single call)	Slow (iterations)
Cost	Cheaper per attempt	More expensive
Reliability	Brittle on edge cases	More deterministic
Failure mode	Silent (wrong data)	Loud (exceptions)
Self-correction	No	Yes

What I'm Actually Seeing

Both agents share the same verification pipeline. The difference is how they get to the first draft.

Code execution feels more robust. Failures are understandable (PDF library can't handle format). Raw LLM failures are mysterious (why truncate that memo?).

But code execution is more complex. More API calls, more state, more orchestration failure modes.

The Bigger Picture

Maps to a broader pattern. Terminal computer use excitement isn't just bash access - it's letting models write and execute code as a thinking tool.

Give every agent run an associated container runtime
Give said agent tools: read file, edit file, run bash command
Let the agent go wild - chaining tools, generating code JIT

Agent-directed loops + terminal + coding agent = broad coverage. Models have enormous skill collections in distribution.

For structured extraction: trust the model to output correct JSON directly, or trust it to write correct code that outputs JSON?

Betting on the latter. Models are better at recognizing correct output than producing it directly. Code execution lets them iterate toward correctness.

The Punchline

Both approaches use the same verification harness:

agent_pattern do 
  steps initial: :idle do 
    step :parsing, Accounting::Steps::ParsingStep  # or CodeExecutionParsingStep
    step :reconciling, Accounting::Steps::ReconcileStep
    step :verifying, Accounting::Steps::VerifyStep
    step :done
  end
end

Philosophical difference: "model is smart enough to get it right" vs. "model is smart enough to write code that gets it right."

Lack of 9's propagates through longer trajectories. Verify rigorously. Question is whether you verify model output or model code output.

Don't know which wins yet. Know which one lets me sleep better.