6 min read

The New Agents SDK Is Really About Owning the Runtime

Mehdi Rezaei
Mehdi
Author
Engineering
Software
Technology

The interesting part of OpenAI's April 15 Agents SDK update is not that agents can run shell commands. Plenty of systems could already bolt a terminal onto a model. The useful shift is that OpenAI is turning the runtime around the model into a first-class thing: files, tools, workspace layout, sandbox providers, memory, instructions, skills, patching, and state recovery are all becoming part of the same harness instead of a pile of custom glue.

That matters because most production agent work does not fail at the model call. It fails in the boring middle: where the agent cannot see the right files, writes output into the wrong place, loses state after a container expires, has too much access to secrets, or needs six slightly different adapters before it can do one useful task. The new SDK does not make those problems disappear, but it gives teams a more coherent place to solve them.

The runtime is now part of the product

A serious agent is not a chat completion with a few functions attached. It is closer to a worker process that can inspect evidence, produce artifacts, call tools, and survive more than one short turn. Once an agent can edit a repository, analyze a data room, generate files, or run checks, the execution environment becomes as important as the prompt.

The updated Agents SDK leans into that. OpenAI describes a model-native harness with sandbox-aware orchestration, filesystem tools, shell execution, apply-patch style editing, MCP, skills, AGENTS.md-style instructions, and configurable memory. The key word is harness. A harness is opinionated enough to give the model a predictable way to work, but it still leaves the application in control of which tools, files, and sandboxes are available.

Why sandbox support is the practical feature

Native sandbox execution is the part I would pay attention to first. Useful agents need a workspace where they can read inputs, install dependencies, run commands, edit files, and write final artifacts. Before this, every team had to decide how much of that to build: local shell wrappers, Docker adapters, provider-specific sandbox clients, file staging, mount logic, cleanup, retry behavior, and some improvised map of where the agent should put things.

The SDK now has a Manifest abstraction for describing the workspace contract. That sounds small, but it is exactly the kind of small that reduces production bugs. Inputs can be mounted as local files, directories, Git repos, or remote storage. Output directories can be declared. Paths are workspace-relative, which forces the task to be portable instead of depending on a developer's machine. The same agent definition can move from local development to Docker or a hosted provider with less rewriting.

For a full-stack team, this is the difference between a demo agent and an operational workflow. The demo says, 'Here is a prompt and a tool.' The production version says, 'Here is the repo snapshot, here are the allowed commands, here is where outputs go, here is what gets persisted, here is what cannot see credentials, and here is how the run resumes if the container dies.'

TypeScript
1import { Agent, Runner, shellTool } from '@openai/agents'
2
3const agent = new Agent({
4 name: 'Dependency triage agent',
5 model: 'gpt-5.4',
6 instructions: [
7 'Inspect the mounted repository.',
8 'Run only read-only package and test commands unless approval is granted.',
9 'Write findings to output/triage.md.',
10 ].join('\n'),
11 tools: [
12 shellTool({
13 environment: {
14 type: 'container_auto',
15 networkPolicy: { type: 'disabled' },
16 memoryLimit: '2g',
17 },
18 }),
19 ],
20})
21
22const result = await Runner.run(agent, 'Find the safest upgrade path for failing dependencies.')

The security argument is not optional

OpenAI is explicit that agent systems should be designed with prompt injection and exfiltration attempts in mind. That is the right default. If a model can read untrusted documents and run code, the system has to assume the documents may contain instructions trying to steal secrets, change behavior, or leak data through tool calls.

This is where separating harness from compute becomes more than architecture cleanliness. Credentials should stay in the orchestrating application when possible, not inside the environment where model-generated commands execute. The sandbox should receive the minimum data and permissions needed for the task. Network access should be disabled by default for many coding and document workflows, then opened with narrow allowlists only when the job really needs it.

What I would build differently now

If I were adding agent workflows to a SaaS product today, I would stop treating the sandbox as an implementation detail hidden below the agent abstraction. I would model it directly. Each workflow would have a workspace contract, a permissions profile, a network policy, an artifact contract, and a recovery strategy.

For example, a code review agent should not get your entire production environment. It needs a checkout, maybe package-manager cache, a narrow set of commands, and a way to produce a patch or report. A data extraction agent should get a mounted document set and an output directory, not broad network access and application secrets. A support investigation agent might need read-only logs and traces, but any mutation should go through a human-approved tool.

The article-worthy lesson is that agent quality is becoming an infrastructure problem as much as a prompt problem. Better prompts help, but they do not fix missing files, messy permissions, lost state, or ambiguous outputs. A boring workspace contract often improves reliability more than another paragraph of instructions.

The TypeScript gap is real

There is one important catch for JavaScript and TypeScript teams: the new harness and sandbox capabilities launch first in Python, with TypeScript support planned later. The JavaScript Agents SDK already has useful tool primitives, including shell tools that can run locally or in hosted containers, plus apply-patch and computer-use interfaces. But the newest sandbox-agent flow is not yet equally available across both languages.

That means a TypeScript-first team has a choice. If the agent runtime is central to the product right now, it may be worth running the agent worker in Python behind a clean API while keeping the web app and product backend in TypeScript. If the workflow is still early, you can keep using the JS SDK primitives and design your own workspace contract in a way that will map cleanly when the fuller TypeScript support lands.

Where teams will misuse this

The bad version of this trend is obvious: give the agent a big container, mount half the company, leave network open, and call it autonomy. That is not an architecture. It is a security incident waiting for a convincing PDF.

The better version is narrower. Use sandboxes to make boundaries explicit. Use manifests to make inputs and outputs predictable. Use hosted or provider-managed compute when isolation and cleanup matter. Keep secrets out of model-executed environments. Add approval gates for high-impact actions. Persist state because long-running work will fail eventually, not because durability looks good in a diagram.

The Agents SDK update is worth taking seriously because it points at where agent products are going. The winning systems will not be the ones with the most dramatic prompts. They will be the ones where the model has a well-shaped place to work, just enough power to complete the task, and clear boundaries when the work touches real systems. Owning that runtime is now part of building the product.

Share this article