What I Learned Building an LLM-Powered Sandboxed Code Execution System

Introduction

Over the past few weeks, I’ve been deep in the weeds building a proof-of-concept that uses an LLM agent to autonomously write and execute Python code inside isolated containers. The goal: take raw data in, get a polished, self-contained report out — fully automated, fully secure.

It was one of those projects that looks straightforward on a whiteboard and then humbles you the moment you open a terminal. Here are the most important things I learned along the way.

1. “Sandbox as Tool” Is a Better Pattern Than “Agent in Sandbox”

The first architectural decision I had to make was: where does the agent live relative to the sandbox?

The intuitive answer is to put everything inside the container — agent, code, data, all of it. But this creates a serious problem: your LLM API key ends up inside an isolated container. That expands your attack surface in ways you don’t want.

The better pattern is “Sandbox as Tool”: the agent runs on the host, and the sandbox is just one of several tools it can call. The container only ever receives code to execute — no secrets, no direct LLM API access, no external network calls. The agent stays on the host where it can be monitored, rate-limited, and controlled.

This feels like a subtle distinction, but it has massive downstream implications for security architecture, observability, and cost management.

2. A Clean Sandbox Abstraction Pays for Itself Immediately

I ended up evaluating multiple different sandbox providers. What saved me enormous amounts of time was defining a minimal, common interface upfront:

execute(command)
upload_file(local_path, container_path)
download_file(container_path, local_path)
cleanup()

Every provider implemented those four methods. Switching between them became a one-line config change. The agent code, the prompts, the tools — all of them remained completely provider-agnostic.

If you’re building anything like this, define your abstraction boundary before you start integrating providers. The refactoring cost if you don’t is brutal.

3. Prompt Engineering Is Your Biggest Cost Lever — By Far

This one surprised me more than anything else. My initial load tests came back with costs roughly 10x higher than they needed to be. The model itself wasn’t the problem. The prompts were.

Breaking it down, the waste was almost entirely from:

The agent not understanding that each code execution call starts in a fresh process with no variable persistence. It kept referencing variables from previous calls that no longer existed — causing error-retry loops.
The agent generating the same artefacts multiple times across separate calls instead of batching them.
Excessive verification steps — running redundant checks repeatedly at massive context sizes for trivial confirmations.
No prompt caching — the full system prompt was being re-sent on every single LLM call.

After rewriting the system prompt to explicitly address the stateless execution model, enforce batching rules, and enabling prompt caching (which gives a significant discount on cached tokens), cost dropped by 89% with no change in output quality.

The lesson: treat your system prompt as production code. It needs the same rigor as the infrastructure around it.

4. Tell the LLM Exactly What It Cannot Assume

LLMs are trained to be helpful, which sometimes means they fill in gaps with plausible-but-wrong assumptions. In an agentic coding context, this is dangerous.

In my case, the model assumed code state persisted between execution calls — a completely reasonable assumption in interactive Python (Jupyter-style), but dead wrong in an architecture where each execution call spawns a fresh process. The fix was being brutally explicit in the system prompt:

“CRITICAL: Each execute_code call is a fresh process. No variables, no imports, no state persists between calls.”

More broadly: don’t rely on the model inferring constraints from context. If there’s a hard rule about how your tools work, state it explicitly. Ambiguity at the prompt level becomes expensive retries at runtime.

5. Separate Deterministic Presentation from LLM-Generated Content

One of the best architectural decisions I made was splitting the output into two distinct layers:

LLM-generated body fragment: The agent writes the inner content — analysis sections, charts, data tables, narrative summary. This is where dataset-specific meaning lives.
Deterministic wrapper: A fixed code layer injects consistent branding, layout, CSS tokens, and a footer disclaimer. It never changes based on what the LLM does.

This two-layer approach solved several problems at once:

Branding stays consistent regardless of what the model generates
The model doesn’t waste tokens trying to maintain a full-page shell
Layout bugs get fixed once, in one place
The final output is self-contained with no external asset dependencies

If you’re using LLMs to generate any kind of structured document, this pattern is worth stealing.

6. Simpler Agents Often Win for Bounded Tasks

I also compared a classic ReAct agent against a more sophisticated “deep agent” framework designed for open-ended coding tasks — the kind of thing you’d use for an autonomous coding assistant.

The deep agent had more built-in capabilities: planning middleware, sub-agent orchestration, automatic context management. But for a well-defined pipeline with a clear start and end state, all of that overhead became a liability:

A significant portion of input tokens were consumed by planning calls that added no value to a task with a known structure
The system prompt was 3x larger, meaning more tokens on every call
More LLM calls, more failures, higher cost — and the output was actually less detailed

The simpler ReAct agent — cheaper, faster, more focused — won on every metric that mattered.

The lesson isn’t that sophisticated agent frameworks are bad. It’s that they’re designed for open-ended tasks where planning and recovery are genuinely needed. If your task is bounded and well-structured, a tightly scoped ReAct loop with a well-crafted prompt will almost always beat a general-purpose agent.

7. Cold Start Latency Is Solvable — But Understand What Actually Dominates

Cold-starting an isolated container takes time. Depending on your infrastructure, that can range from under a second (Firecracker microVMs) to several seconds (Kubernetes pod scheduling + runtime boot).

I explored two strategies to address this:

Warm pools: Pre-create a set of ready-to-claim containers. When a request comes in, claim one instantly and replenish the pool in the background. This dropped average init time by over 3x.
Snapshot-based images: Some cloud sandbox providers detect that you’re reusing the same environment definition and boot from a pre-built snapshot. First build is slow; every subsequent boot is fast.

Both approaches work well. But here’s the important nuance: agent execution dominates total runtime by an order of magnitude. In my benchmarks, sandbox init was measured in single-digit seconds. Agent execution was measured in minutes. Optimizing cold start feels good but moves the needle much less than optimizing the agent itself.

Fix the agent first. Then fix the infrastructure.

8. Watch Out for Binary Data Over Streaming Interfaces

This one is a pure gotcha for anyone building on container orchestration platforms. Some execution APIs stream data as UTF-8 strings. If you’re downloading binary files — like PNG charts embedded in HTML — non-UTF-8 bytes get silently corrupted in transit.

The fix is to base64-encode the output inside the container before streaming it, then decode on the client side:

# Inside container
tar cf - /output | base64

# On the client
import base64
data = base64.b64decode(raw_stream)

Simple fix, but it’s the kind of thing that only surfaces during load testing with real binary outputs — not during happy-path local dev runs. Always test file transfers with binary content. Don’t assume that because text files work, everything works.

9. Never Let the LLM Read the Full Dataset

This seems obvious in retrospect, but it’s worth stating clearly: do not let the agent read raw data into its context window.

A large CSV can be tens of megabytes of text. Even if it fit in context, the cost would be absurd and the useful signal-to-noise ratio would be terrible. The correct pattern:

Sample headers and a few rows to understand the schema
Write a script that processes the full dataset inside the sandbox
Only the output of that script — summaries, aggregations, stats — returns to the agent

The agent never needs to see the raw data. It needs to understand the structure, write code that processes the data, and reason about the outputs. Those are very different things — and treating them as different things is what makes the system scalable to any dataset size.

Closing Thoughts

The thing that struck me most about this project is how much the non-ML parts of the system matter. The sandbox architecture, the prompt design, the abstraction layers, the output structure — these are fundamentally software engineering problems, and they have more impact on the system’s cost, reliability, and quality than the choice of model.

LLMs are extraordinarily capable, but they’re also very literal. They do what you tell them. If your prompts are ambiguous, your costs will be high. If your architecture leaks assumptions, your agent will hallucinate workarounds. If your abstractions are clean, everything else becomes easier.

Build the boring parts well. The model will handle the rest.

Have questions or thoughts on any of these patterns? Drop them in the comments below.