{"id":169,"date":"2026-03-22T17:45:40","date_gmt":"2026-03-22T17:45:40","guid":{"rendered":"https:\/\/balamurali.in\/blog\/?p=169"},"modified":"2026-03-22T17:45:40","modified_gmt":"2026-03-22T17:45:40","slug":"what-i-learned-building-an-llm-powered-sandboxed-code-execution-system","status":"publish","type":"post","link":"https:\/\/balamurali.in\/blog\/learn-with-me\/what-i-learned-building-an-llm-powered-sandboxed-code-execution-system\/","title":{"rendered":"What I Learned Building an LLM-Powered Sandboxed Code Execution System"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p>Over the past few weeks, I&#8217;ve been deep in the weeds building a proof-of-concept that uses an LLM agent to autonomously write and execute Python code inside isolated containers. The goal: take raw data in, get a polished, self-contained report out \u2014 fully automated, fully secure.<\/p>\n\n\n\n<p>It was one of those projects that <em>looks<\/em> straightforward on a whiteboard and then humbles you the moment you open a terminal. Here are the most important things I learned along the way.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">1. &#8220;Sandbox as Tool&#8221; Is a Better Pattern Than &#8220;Agent in Sandbox&#8221;<\/h2>\n\n\n\n<p>The first architectural decision I had to make was: <em>where does the agent live relative to the sandbox?<\/em><\/p>\n\n\n\n<p>The intuitive answer is to put everything inside the container \u2014 agent, code, data, all of it. But this creates a serious problem: your LLM API key ends up inside an isolated container. That expands your attack surface in ways you don&#8217;t want.<\/p>\n\n\n\n<p>The better pattern is <strong>&#8220;Sandbox as Tool&#8221;<\/strong>: the agent runs on the host, and the sandbox is just one of several tools it can call. The container only ever receives code to execute \u2014 no secrets, no direct LLM API access, no external network calls. The agent stays on the host where it can be monitored, rate-limited, and controlled.<\/p>\n\n\n\n<p>This feels like a subtle distinction, but it has massive downstream implications for security architecture, observability, and cost management.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">2. A Clean Sandbox Abstraction Pays for Itself Immediately<\/h2>\n\n\n\n<p>I ended up evaluating multiple different sandbox providers. What saved me enormous amounts of time was defining a minimal, common interface upfront:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>execute(command)\nupload_file(local_path, container_path)\ndownload_file(container_path, local_path)\ncleanup()<\/code><\/pre>\n\n\n\n<p>Every provider implemented those four methods. Switching between them became a one-line config change. The agent code, the prompts, the tools \u2014 all of them remained completely provider-agnostic.<\/p>\n\n\n\n<p>If you&#8217;re building anything like this, define your abstraction boundary <em>before<\/em> you start integrating providers. The refactoring cost if you don&#8217;t is brutal.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Prompt Engineering Is Your Biggest Cost Lever \u2014 By Far<\/h2>\n\n\n\n<p>This one surprised me more than anything else. My initial load tests came back with costs roughly <strong>10x higher than they needed to be<\/strong>. The model itself wasn&#8217;t the problem. The <em>prompts<\/em> were.<\/p>\n\n\n\n<p>Breaking it down, the waste was almost entirely from:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The agent not understanding that each code execution call starts in a <strong>fresh process with no variable persistence<\/strong>. It kept referencing variables from previous calls that no longer existed \u2014 causing error-retry loops.<\/li>\n\n\n\n<li>The agent generating the same artefacts <strong>multiple times<\/strong> across separate calls instead of batching them.<\/li>\n\n\n\n<li><strong>Excessive verification steps<\/strong> \u2014 running redundant checks repeatedly at massive context sizes for trivial confirmations.<\/li>\n\n\n\n<li>No prompt caching \u2014 the full system prompt was being re-sent on every single LLM call.<\/li>\n<\/ul>\n\n\n\n<p>After rewriting the system prompt to explicitly address the stateless execution model, enforce batching rules, and enabling prompt caching (which gives a significant discount on cached tokens), cost dropped by <strong>89%<\/strong> with no change in output quality.<\/p>\n\n\n\n<p>The lesson: treat your system prompt as production code. It needs the same rigor as the infrastructure around it.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Tell the LLM Exactly What It Cannot Assume<\/h2>\n\n\n\n<p>LLMs are trained to be helpful, which sometimes means they fill in gaps with plausible-but-wrong assumptions. In an agentic coding context, this is dangerous.<\/p>\n\n\n\n<p>In my case, the model assumed code state persisted between execution calls \u2014 a completely reasonable assumption in interactive Python (Jupyter-style), but dead wrong in an architecture where each execution call spawns a fresh process. The fix was being brutally explicit in the system prompt:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><em>&#8220;CRITICAL: Each execute_code call is a fresh process. No variables, no imports, no state persists between calls.&#8221;<\/em><\/p>\n<\/blockquote>\n\n\n\n<p>More broadly: don&#8217;t rely on the model inferring constraints from context. If there&#8217;s a hard rule about how your tools work, state it explicitly. Ambiguity at the prompt level becomes expensive retries at runtime.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Separate Deterministic Presentation from LLM-Generated Content<\/h2>\n\n\n\n<p>One of the best architectural decisions I made was splitting the output into two distinct layers:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>LLM-generated body fragment:<\/strong> The agent writes the inner content \u2014 analysis sections, charts, data tables, narrative summary. This is where dataset-specific meaning lives.<\/li>\n\n\n\n<li><strong>Deterministic wrapper:<\/strong> A fixed code layer injects consistent branding, layout, CSS tokens, and a footer disclaimer. It never changes based on what the LLM does.<\/li>\n<\/ul>\n\n\n\n<p>This two-layer approach solved several problems at once:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Branding stays consistent regardless of what the model generates<\/li>\n\n\n\n<li>The model doesn&#8217;t waste tokens trying to maintain a full-page shell<\/li>\n\n\n\n<li>Layout bugs get fixed once, in one place<\/li>\n\n\n\n<li>The final output is self-contained with no external asset dependencies<\/li>\n<\/ol>\n\n\n\n<p>If you&#8217;re using LLMs to generate any kind of structured document, this pattern is worth stealing.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Simpler Agents Often Win for Bounded Tasks<\/h2>\n\n\n\n<p>I also compared a classic ReAct agent against a more sophisticated &#8220;deep agent&#8221; framework designed for open-ended coding tasks \u2014 the kind of thing you&#8217;d use for an autonomous coding assistant.<\/p>\n\n\n\n<p>The deep agent had more built-in capabilities: planning middleware, sub-agent orchestration, automatic context management. But for a well-defined pipeline with a clear start and end state, all of that overhead became a liability:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A significant portion of input tokens were consumed by <strong>planning calls that added no value<\/strong> to a task with a known structure<\/li>\n\n\n\n<li>The system prompt was <strong>3x larger<\/strong>, meaning more tokens on every call<\/li>\n\n\n\n<li>More LLM calls, more failures, higher cost \u2014 and the output was actually <em>less<\/em> detailed<\/li>\n<\/ul>\n\n\n\n<p>The simpler ReAct agent \u2014 cheaper, faster, more focused \u2014 won on every metric that mattered.<\/p>\n\n\n\n<p>The lesson isn&#8217;t that sophisticated agent frameworks are bad. It&#8217;s that they&#8217;re designed for open-ended tasks where planning and recovery are genuinely needed. If your task is bounded and well-structured, a tightly scoped ReAct loop with a well-crafted prompt will almost always beat a general-purpose agent.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Cold Start Latency Is Solvable \u2014 But Understand What Actually Dominates<\/h2>\n\n\n\n<p>Cold-starting an isolated container takes time. Depending on your infrastructure, that can range from under a second (Firecracker microVMs) to several seconds (Kubernetes pod scheduling + runtime boot).<\/p>\n\n\n\n<p>I explored two strategies to address this:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Warm pools:<\/strong> Pre-create a set of ready-to-claim containers. When a request comes in, claim one instantly and replenish the pool in the background. This dropped average init time by over 3x.<\/li>\n\n\n\n<li><strong>Snapshot-based images:<\/strong> Some cloud sandbox providers detect that you&#8217;re reusing the same environment definition and boot from a pre-built snapshot. First build is slow; every subsequent boot is fast.<\/li>\n<\/ul>\n\n\n\n<p>Both approaches work well. But here&#8217;s the important nuance: <strong>agent execution dominates total runtime by an order of magnitude<\/strong>. In my benchmarks, sandbox init was measured in single-digit seconds. Agent execution was measured in minutes. Optimizing cold start feels good but moves the needle much less than optimizing the agent itself.<\/p>\n\n\n\n<p><strong>Fix the agent first. Then fix the infrastructure.<\/strong><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Watch Out for Binary Data Over Streaming Interfaces<\/h2>\n\n\n\n<p>This one is a pure gotcha for anyone building on container orchestration platforms. Some execution APIs stream data as UTF-8 strings. If you&#8217;re downloading binary files \u2014 like PNG charts embedded in HTML \u2014 non-UTF-8 bytes get silently corrupted in transit.<\/p>\n\n\n\n<p>The fix is to base64-encode the output <em>inside the container<\/em> before streaming it, then decode on the client side:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Inside container\ntar cf - \/output | base64\n\n# On the client\nimport base64\ndata = base64.b64decode(raw_stream)<\/code><\/pre>\n\n\n\n<p>Simple fix, but it&#8217;s the kind of thing that only surfaces during load testing with real binary outputs \u2014 not during happy-path local dev runs. Always test file transfers with binary content. Don&#8217;t assume that because text files work, everything works.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Never Let the LLM Read the Full Dataset<\/h2>\n\n\n\n<p>This seems obvious in retrospect, but it&#8217;s worth stating clearly: <strong>do not let the agent read raw data into its context window.<\/strong><\/p>\n\n\n\n<p>A large CSV can be tens of megabytes of text. Even if it fit in context, the cost would be absurd and the useful signal-to-noise ratio would be terrible. The correct pattern:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Sample headers and a few rows to understand the schema<\/li>\n\n\n\n<li>Write a script that processes the full dataset <em>inside the sandbox<\/em><\/li>\n\n\n\n<li>Only the <em>output<\/em> of that script \u2014 summaries, aggregations, stats \u2014 returns to the agent<\/li>\n<\/ol>\n\n\n\n<p>The agent never needs to see the raw data. It needs to understand the structure, write code that processes the data, and reason about the outputs. Those are very different things \u2014 and treating them as different things is what makes the system scalable to any dataset size.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Closing Thoughts<\/h2>\n\n\n\n<p>The thing that struck me most about this project is how much the <em>non-ML<\/em> parts of the system matter. The sandbox architecture, the prompt design, the abstraction layers, the output structure \u2014 these are fundamentally software engineering problems, and they have more impact on the system&#8217;s cost, reliability, and quality than the choice of model.<\/p>\n\n\n\n<p>LLMs are extraordinarily capable, but they&#8217;re also very literal. They do what you tell them. If your prompts are ambiguous, your costs will be high. If your architecture leaks assumptions, your agent will hallucinate workarounds. If your abstractions are clean, everything else becomes easier.<\/p>\n\n\n\n<p><strong>Build the boring parts well. The model will handle the rest.<\/strong><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><em>Have questions or thoughts on any of these patterns? Drop them in the comments below.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Over the past few weeks, I&#8217;ve been deep in the weeds building a proof-of-concept that uses an LLM agent to autonomously write and execute Python code inside isolated containers&#8230;.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[8],"tags":[],"class_list":["post-169","post","type-post","status-publish","format-standard","hentry","category-learn-with-me"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/169","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/comments?post=169"}],"version-history":[{"count":1,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions"}],"predecessor-version":[{"id":171,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/posts\/169\/revisions\/171"}],"wp:attachment":[{"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/media?parent=169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/categories?post=169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/balamurali.in\/blog\/wp-json\/wp\/v2\/tags?post=169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}