For AI agent developers

Build AI agents that actually run code

The agent generates code. SandboxAPI runs it. The agent reads the output, fixes errors, installs missing packages, and iterates until the task is done. The same loop a human runs — at machine speed.

The canonical agent pattern

Most "AI runs code" demos break the moment the model needs to install a package, debug a traceback, or hold state across turns. With stateful sessions, that loop just works.

1

Agent generates code

The LLM writes Python that uses pandas to summarize a CSV.

2

Agent calls session_execute

SandboxAPI runs the code in an isolated gVisor sandbox. Returns ModuleNotFoundError: No module named 'pandas'.

3

Agent reads error, calls session_install_packages

{"manager":"pip","packages":["pandas"]} — installs in <3s using the cached index.

4

Agent retries — same session, same variables

The CSV the agent read in step 2 is still in memory. The retry succeeds. Output goes back to the user.

Why sessions matter: Without stateful sessions, the agent has to re-do every step from scratch on every retry. With sessions, the loop is fast enough to iterate dozens of times in seconds — exactly how a human would debug.

Code example: OpenAI tool calling + Python SDK

Here's the entire wiring you need. The agent picks the tool, the SDK calls SandboxAPI, the result goes back to the model. Wrap it in a loop and you have a code interpreter.

import os
from openai import OpenAI
from sandboxapi import SandboxAPI

client = OpenAI()
sb = SandboxAPI(api_key=os.environ["SANDBOX_API_KEY"])

# Open a session up front so state persists across tool calls
session = sb.sessions.create(language="python3", idle_ttl=600)

tools = [
  {
    "type": "function",
    "function": {
      "name": "execute_python",
      "description": "Run Python code in a stateful sandbox. Variables, files, packages persist across calls.",
      "parameters": {
        "type": "object",
        "properties": {"code": {"type": "string"}},
        "required": ["code"],
      },
    },
  },
  {
    "type": "function",
    "function": {
      "name": "install_packages",
      "description": "Install pip packages in the current session.",
      "parameters": {
        "type": "object",
        "properties": {"packages": {"type": "array", "items": {"type": "string"}}},
        "required": ["packages"],
      },
    },
  },
]

messages = [{"role": "user", "content": "Read the CSV at /tmp/sales.csv and report total revenue."}]

# Agent loop
while True:
    resp = client.chat.completions.create(
        model="gpt-4o", messages=messages, tools=tools,
    )
    msg = resp.choices[0].message
    messages.append(msg.model_dump())

    if not msg.tool_calls:
        print(msg.content)
        break

    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        if call.function.name == "execute_python":
            result = sb.sessions.execute(session.id, code=args["code"])
            content = result.stdout + (("\n" + result.stderr) if result.stderr else "")
        elif call.function.name == "install_packages":
            sb.sessions.install(session.id, manager="pip", packages=args["packages"])
            content = "Installed " + ", ".join(args["packages"])
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": content,
        })

sb.sessions.close(session.id)

Why SandboxAPI fits AI agents

Common patterns

Code interpreter for any LLM

Wrap the SDK in two tool definitions (execute_python, install_packages) and you have a fully-featured code interpreter for any model that supports function calling.

Self-correcting agent

When the LLM produces broken code, the traceback feeds back as the next message. With sessions, the fix doesn't lose context. Most coding tasks converge in 1–3 iterations.

Multi-language pipelines

The agent can drop down to Bash to inspect a file, switch to Python to analyze it, and produce a JSON output. Each language gets its own session — sessions are cheap.

Auto-grading agent demos

Building a "show me you can solve LeetCode" demo? Use execute_with_expected — pass expected_output, get a wrong_answer status. No string-comparison logic in your code.

Start building free

Basic tier is free forever — 500 executions a month, more than enough to ship a working agent demo. Pro adds sessions and package install.

Start with Basic (free) MCP integration guide