SandboxAPI for AI agents

The canonical agent pattern

Most "AI runs code" demos break the moment the model needs to install a package, debug a traceback, or hold state across turns. With stateful sessions, that loop just works.

Agent generates code

The LLM writes Python that uses pandas to summarize a CSV.

Agent calls `session_execute`

SandboxAPI runs the code in an isolated gVisor sandbox. Returns ModuleNotFoundError: No module named 'pandas'.

Agent reads error, calls `session_install_packages`

{"manager":"pip","packages":["pandas"]} — installs in <3s using the cached index.

Agent retries — same session, same variables

The CSV the agent read in step 2 is still in memory. The retry succeeds. Output goes back to the user.

Why sessions matter: Without stateful sessions, the agent has to re-do every step from scratch on every retry. With sessions, the loop is fast enough to iterate dozens of times in seconds — exactly how a human would debug.

Code example: OpenAI tool calling + Python SDK

Here's the entire wiring you need. The agent picks the tool, the SDK calls SandboxAPI, the result goes back to the model. Wrap it in a loop and you have a code interpreter.

import os
from openai import OpenAI
from sandboxapi import SandboxAPI

client = OpenAI()
sb = SandboxAPI(api_key=os.environ["SANDBOX_API_KEY"])

# Open a session up front so state persists across tool calls
session = sb.sessions.create(language="python3", idle_ttl=600)

tools = [
  {
    "type": "function",
    "function": {
      "name": "execute_python",
      "description": "Run Python code in a stateful sandbox. Variables, files, packages persist across calls.",
      "parameters": {
        "type": "object",
        "properties": {"code": {"type": "string"}},
        "required": ["code"],
      },
    },
  },
  {
    "type": "function",
    "function": {
      "name": "install_packages",
      "description": "Install pip packages in the current session.",
      "parameters": {
        "type": "object",
        "properties": {"packages": {"type": "array", "items": {"type": "string"}}},
        "required": ["packages"],
      },
    },
  },
]

messages = [{"role": "user", "content": "Read the CSV at /tmp/sales.csv and report total revenue."}]

# Agent loop
while True:
    resp = client.chat.completions.create(
        model="gpt-4o", messages=messages, tools=tools,
    )
    msg = resp.choices[0].message
    messages.append(msg.model_dump())

    if not msg.tool_calls:
        print(msg.content)
        break

    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        if call.function.name == "execute_python":
            result = sb.sessions.execute(session.id, code=args["code"])
            content = result.stdout + (("\n" + result.stderr) if result.stderr else "")
        elif call.function.name == "install_packages":
            sb.sessions.install(session.id, manager="pip", packages=args["packages"])
            content = "Installed " + ", ".join(args["packages"])
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": content,
        })

sb.sessions.close(session.id)

Why SandboxAPI fits AI agents

Sessions — variables, files, and installed packages persist between tool calls. The agent doesn't re-do work.
Package install — the agent can pip install what it needs in <3s. No pre-baking the image.
Modern runtimes — Python 3.12, Node 22, .NET 9. Whatever your model trained on works.
gVisor isolation — when an agent generates code, you don't trust the code. We don't either.
MCP-native — Claude Desktop, Cursor, Cline, VS Code — drop in a JSON config, you're done.
Streaming output — show the agent's stdout in real time as long-running scripts execute.

Common patterns

Code interpreter for any LLM

Wrap the SDK in two tool definitions (execute_python, install_packages) and you have a fully-featured code interpreter for any model that supports function calling.

Self-correcting agent

When the LLM produces broken code, the traceback feeds back as the next message. With sessions, the fix doesn't lose context. Most coding tasks converge in 1–3 iterations.

Multi-language pipelines

The agent can drop down to Bash to inspect a file, switch to Python to analyze it, and produce a JSON output. Each language gets its own session — sessions are cheap.

Auto-grading agent demos

Building a "show me you can solve LeetCode" demo? Use execute_with_expected — pass expected_output, get a wrong_answer status. No string-comparison logic in your code.

Build AI agents that actually run code

The canonical agent pattern

Agent generates code

Agent calls `session_execute`

Agent reads error, calls `session_install_packages`

Agent retries — same session, same variables

Code example: OpenAI tool calling + Python SDK

Why SandboxAPI fits AI agents

Common patterns

Code interpreter for any LLM

Self-correcting agent

Multi-language pipelines

Auto-grading agent demos

Start building free

Build AI agents that actually run code

The canonical agent pattern

Agent generates code

Agent calls session_execute

Agent reads error, calls session_install_packages

Agent retries — same session, same variables

Code example: OpenAI tool calling + Python SDK

Why SandboxAPI fits AI agents

Common patterns

Code interpreter for any LLM

Self-correcting agent

Multi-language pipelines

Auto-grading agent demos

Start building free

Agent calls `session_execute`

Agent reads error, calls `session_install_packages`