How much can the MCP code execution pattern reduce token usage?

Cloudflare Code Mode collapses 2,500+ Cloudflare API endpoints from roughly 1.17 million tokens to about 1,000 tokens, a 99.9% reduction. Anthropic's Salesforce-to-Sheets demo shows 150,000 tokens of tool definitions reduced to 2,000, a 98.7% drop. Real workloads land between 95% and 99%.

What is the difference between MCP code mode and dynamic toolsets?

Code mode runs agent-written code against a typed API surface in a sandbox. Dynamic toolsets, the Speakeasy approach, expose meta tools that emit tool schemas on demand via embeddings, returning around 96% reduction without a sandbox. Code mode wins on the long tail; dynamic toolsets are simpler to operate.

How does Cloudflare's Code Mode MCP server work?

Cloudflare exposes the entire Cloudflare API through two tools: search filters the OpenAPI spec, execute runs JavaScript against a typed client inside an isolated Dynamic Worker sandbox. Cloudflare reports the pattern reduces a 1.17 million token enumeration to roughly 1,000 tokens of context.

When should I use the code execution pattern vs Tool Search?

Use code execution when you control the server, the API has hundreds or thousands of endpoints, and workflows compose multiple calls. Use Tool Search (with alwaysLoad off) for third-party MCP servers you cannot modify. Claude Code triggers Tool Search automatically when tool descriptions cross 10% of context.

What is the alwaysLoad option in Claude Code v2.1.121?

alwaysLoad is a per-server boolean in your MCP config. Setting it to true makes that server's tools skip Tool Search deferral and stay loaded immediately. Use it for small, high-frequency servers where a search round-trip costs more than the saved tokens. Set it on individual tools via _meta as well.

Can I build a code-mode MCP server in Python?

Yes. Use the official mcp Python SDK to register two tools, search and execute. search returns module and signature snippets from a typed api/ folder. execute parses the agent-written code with ast.parse and dispatches it to a real isolated runner like Firecracker, gVisor, or a Deno isolate. Do not run model code in-process.

Is the MCP code execution pattern safe?

Only with real isolation. You are running model-generated code, so treat it the way you would treat user-uploaded code. Production deployments use Firecracker, gVisor, Cloudflare Workers, or Deno isolates with no network egress beyond the API surface. In-process runners are a footgun and have led to sandbox escapes in adjacent ecosystems.

Model Context Protocol

MCP Code Execution Pattern: A Hands-On Claude Code Guide

Q: What is the MCP code execution pattern?

The MCP code execution pattern exposes a large API to an agent through two generic tools, search and execute, instead of registering one MCP tool per endpoint. The agent writes small programs that compose tool calls inside a sandboxed runtime. Anthropic and Cloudflare both ship implementations of the same idea.

The MCP code execution pattern exposes large APIs to an agent through two generic tools, search and execute, instead of one tool definition per endpoint. The agent writes code that composes calls; the model never sees thousands of tool schemas. Cloudflare shipped 2,500+ API endpoints in roughly 1,000 tokens this way. Here's how to apply it to your own servers.

May 3, 2026-

12 min read-Last updated: 2026-05-03

Claude CodeMCPToken OptimizationCode ExecutionAgent Architecture

Table of Contents

TL;DR

Three popular MCP servers can consume 143,000 of a 200,000-token context window before the agent reads its first user message, 72% of working memory eaten by tool descriptions.
The code execution pattern replaces one-tool-per-endpoint with two generic tools (search and execute) and a sandboxed runtime. Cloudflare reports 99.9% reduction; Anthropic's own demo shows 98.7%.
Three live approaches: code execution (Anthropic, Cloudflare), dynamic toolsets (Speakeasy), deferred loading (Claude Code Tool Search plus the new alwaysLoad option in v2.1.121). Pick by API size and operational appetite.
Production code-mode servers need real isolation. Firecracker, gVisor, Cloudflare Workers, or Deno isolates. Running model code in-process is unsafe. The Python SDK gives you the surface, not the sandbox.

What Is the MCP Code Execution Pattern?

The MCP code execution pattern presents an MCP server as a code API on a filesystem. The agent gets a sandboxed runtime and two generic tools. The first, search, lets the agent find a function signature without loading every schema. The second, execute, runs a small program that calls those functions and returns the result. The model never sees thousands of tool descriptions, only the few it asked for on the turn it needs them.

Compare that with the classic MCP shape. A normal MCP server registers, say, 80 tools. Each comes with a JSON Schema, a description, and parameter docs. Claude Code or any MCP client loads all of that into the system prompt at session start. If you have three of those servers connected, you're paying that cost three times before any work happens.

Anthropic's engineering blog put it cleanly in their code execution with MCP writeup: presenting tools as code on a filesystem allows models to read tool definitions on-demand, rather than reading them all up-front. Cloudflare's phrasing in the Code Mode for MCP launch is more direct: give agents an entire API in 1,000 tokens.

There are two roles in the pattern. The server needs to expose a discoverable code surface, usually an OpenAPI spec or a typed function library. The client needs a sandbox to run agent-written code in. Get either wrong and the pattern breaks. That's why most production examples come from companies with existing isolation infrastructure (Cloudflare Workers, Anthropic internal tooling) rather than from individual developers wiring it up the first time.

The 143,000-Token Problem

The pattern exists because MCP tool bloat became the dominant cost in agent workflows during early 2026. The number that gets cited most: three popular MCP servers (GitHub, Playwright, an IDE bridge) can consume 143,000 of a 200,000-token context window before the agent reads a single user message, per analysis in The New Stack's 10 strategies to reduce MCP token bloat (April 2026). That's 72% of working memory gone on tool descriptions that mostly never get called.

A single large MCP server commonly costs 10,000-17,000 tokens in descriptions alone, per BSWEN's tool overhead diagnostic (April 24, 2026). Multiply that by your average MCP client running three to seven servers and the math gets ugly fast.

The numbers from the four big writeups in the last two weeks:

Source	Workload	Before	After	Reduction
Cloudflare Code Mode	2,500+ API endpoints	~1.17M tokens	~1,000 tokens	99.9%
Anthropic Engineering	Salesforce to Sheets	150,000 tokens	2,000 tokens	98.7%
Speakeasy Dynamic Toolsets	400-tool server	~410,000 tokens	~8,000 tokens	96%
Atlassian mcp-compressor	Schema overhead per tool	Baseline	3-30%	70-97%
Claude Code Tool Search	7+ MCP servers	~51,000 tokens	~8,500 tokens	46.9%

These numbers measure different things. Tool Search defers loading on the client side without changing the server; dynamic toolsets replace static schemas with embedding-driven discovery; code mode replaces tool-per-endpoint entirely. They aren't in conflict, and the ceiling isn't the same. Code mode hits 99.9% because it eliminates per-tool description cost, not because it compresses them. Knowing which problem you're solving matters before you pick a fix.

How the Pattern Works End-to-End

A single agent turn under the code execution pattern looks like this. The user asks the agent to perform a multi-step task. The agent first calls search("create r2 bucket") against the server's typed surface. The server returns two or three candidate signatures plus minimal docs (a few hundred tokens). The agent reads those, then calls execute(code) with a short program that composes those signatures. The server runs the code in an isolated sandbox and returns the result. That's the whole shape.

The two-tool layout matters. search stays small (it returns text snippets, not schemas). execute takes arbitrary code as a string. The agent only sees the tool definitions it explicitly asked for, and only on the turn it needed them. By session end, total token cost on tool definitions usually lands under 5,000, regardless of how many endpoints the server actually exposes.

The sandbox shapes vary by implementation. Cloudflare uses Dynamic Workers (their existing Workers isolate, with no network egress outside the API surface). Anthropic's example treats the API as a filesystem of TypeScript files and runs in their internal runtime. The community implementation jx-codes/codemode-mcp uses a Deno isolate. They're the same pattern with different sandboxes.

One detail that surprises people: the agent's code can call multiple endpoints in one execute. A typical "create a DNS record, then a Worker route, then a R2 bucket" workflow is one tool round trip with code mode versus three with classic MCP. That latency win is real on long tasks; the trade-off is worth knowing about (more on that below).

Code Mode vs Dynamic Toolsets vs Deferred Loading

Three approaches surfaced in the April 2026 debate. They solve the same root problem (tool descriptions are too expensive) at different layers of the stack.

Approach	Server-side	Client-side	Reduction	Best for
Code execution / Code Mode	Sandbox plus `search` and `execute` tools.	Standard MCP client.	96-99.9%	Very large APIs (1,000+ endpoints), composable workflows.
Dynamic toolsets	Meta tools that emit schemas on demand via embeddings.	Standard MCP client.	90-96%	Mid-to-large servers without sandbox infrastructure.
Deferred loading	No changes required.	Claude Code defers tool definitions until query time.	30-50%	Third-party MCP servers you cannot modify.

Pick code execution when you control the server, the API is huge, and workflows compose multiple endpoints. Pick dynamic toolsets when your team can't operate a sandbox but you can index schemas with embeddings. Pick deferred loading when you're consuming third-party MCP servers you don't own. Claude Code does the last one for you automatically once tool descriptions cross 10% of context, no config required.

Speakeasy's counter-post ("you don't need code mode") makes a fair point: most servers don't need a sandbox to recover most of the savings. And earezki.com's analysis calls code mode "the long-tail escape hatch, not the front door," which I think is right. Reach for code execution when the API has thousands of endpoints or your workflows naturally chain calls. For mid-size APIs, dynamic toolsets get you 90% of the way with a tenth of the operational complexity.

Build a Minimal Code-Mode MCP Server in Python

Here's a sketch of a code-mode MCP server using the official mcp Python SDK. This isn't production code, but it's honest about the security boundary, which is the part most tutorials skip. Install the SDK first:

bashterminal

pip install mcp

Lay out the project so the typed API surface lives in its own folder. Each function in api/ is what the agent will search against and call from inside execute:

textmyserver/

myserver/
  server.py            # registers search + execute tools
  api/
    __init__.py
    list_users.py
    create_post.py
    update_post.py
  sandbox.py           # delegates to a real isolated runner

A representative api/list_users.py:

pythonmyserver/api/list_users.py

"""List users in the workspace.

Args:
    limit: Max users to return (1-100).
    role: Optional role filter ("admin", "member", "guest").

Returns: list[dict] with keys id, email, role.
"""
from typing import Optional

def list_users(limit: int = 50, role: Optional[str] = None) -> list[dict]:
    # Real implementation hits your backend here.
    ...

The server registers two tools. search walks the api/ folder and returns matching docstrings. execute ships the agent's source string to the sandbox:

pythonmyserver/server.py

"""Code-mode MCP server: two tools, sandboxed runtime."""
import inspect
import importlib
import pkgutil
from mcp.server.fastmcp import FastMCP

from . import api
from .sandbox import dispatch

mcp = FastMCP("codemode-example")

def _iter_api_modules():
    for _, name, _ in pkgutil.iter_modules(api.__path__):
        yield name, importlib.import_module(f"{api.__name__}.{name}")

@mcp.tool()
def search(query: str, limit: int = 5) -> list[dict]:
    """Find candidate functions in the api/ folder by docstring match."""
    q = query.lower()
    hits = []
    for name, mod in _iter_api_modules():
        for fname, func in inspect.getmembers(mod, inspect.isfunction):
            doc = (func.__doc__ or "").lower()
            if q in fname.lower() or q in doc:
                sig = str(inspect.signature(func))
                hits.append({
                    "module": name,
                    "function": fname,
                    "signature": sig,
                    "doc": (func.__doc__ or "").strip().split("\n\n")[0],
                })
    return hits[:limit]

@mcp.tool()
def execute(code: str, timeout_s: int = 30) -> dict:
    """Run a small program against the api/ surface inside an isolated runner.

    The string is delegated to a real sandbox (Firecracker, gVisor, Workers,
    Deno isolate). Do not run it in-process.
    """
    return dispatch(code, allowed_modules=["api"], timeout_s=timeout_s)

if __name__ == "__main__":
    mcp.run()

And the sandbox shim. This deliberately doesn't run code; it forwards the source to whatever isolated runner you operate. The comment block is the part you should read three times before shipping anything:

pythonmyserver/sandbox.py

"""Sandbox dispatcher.

DO NOT REPLACE THIS WITH AN IN-PROCESS RUNNER.
You are about to run model-generated code. Treat it the way you would
treat code from an unauthenticated public form.

Production options:
  - Firecracker microVM (Lambda-style isolation)
  - gVisor containers (kernel syscall filtering)
  - Cloudflare Workers (V8 isolates with no host filesystem)
  - Deno isolate with --allow-net=api.example.com only
  - WASM runtime (Wasmtime, Wasmer) with no filesystem access

The dispatcher below is a stub; plug in the runner you actually trust.
"""
import ast

def dispatch(code: str, allowed_modules: list[str], timeout_s: int) -> dict:
    # Parse the source for syntax errors and a basic safety screen,
    # then ship it to your real sandbox over a controlled channel.
    tree = ast.parse(code)
    _reject_obvious_unsafe_imports(tree, allowed_modules)
    # Replace the next line with a call to your actual isolated runner.
    raise NotImplementedError(
        "Wire dispatch() to Firecracker, gVisor, Workers, or Deno."
    )

def _reject_obvious_unsafe_imports(tree: ast.AST, allowed: list[str]) -> None:
    for node in ast.walk(tree):
        if isinstance(node, (ast.Import, ast.ImportFrom)):
            mod = node.module if isinstance(node, ast.ImportFrom) else \
                  node.names[0].name
            top = mod.split(".")[0] if mod else ""
            if top and top not in allowed:
                raise PermissionError(f"import {top!r} not in allowed list")

That's the whole shape. Two tools, a typed folder, a sandbox stub. The reference implementation in jx-codes/codemode-mcp wires the same idea to a Deno isolate if you want a working template. For my own MCP servers I've mostly stayed in the classic shape so far, like the Method CRM MCP server; code mode only earns its keep when the API is genuinely large.

Configure Claude Code to Consume Code-Mode Servers

Claude Code v2.1.121 (April 28, 2026) shipped two MCP controls that matter here. Both go in ~/.claude/settings.json or .mcp.json. The first is alwaysLoad, a per-server boolean that opts a server out of Tool Search deferral. The second is the _meta per-tool override that does the same thing tool-by-tool.

For small MCP servers you call constantly (say, a 6-tool internal server with under 5,000 tokens of definitions), Tool Search adds a round trip without saving meaningful context. Mark them alwaysLoad:

json~/.claude/settings.json

{
  "mcpServers": {
    "core-tools": {
      "type": "http",
      "url": "https://mcp.example.com/mcp",
      "alwaysLoad": true
    }
  }
}

For large code-mode servers (the entire Cloudflare API, GitHub's full API, or your own 2,000-endpoint internal platform), leave alwaysLoad unset. The two tools (search and execute) are tiny enough that Tool Search keeps them deferred until the agent actually needs to invoke something. The combination is what you want: small servers loaded eagerly, big servers loaded lazily.

json~/.claude/settings.json

{
  "mcpServers": {
    "core-tools": {
      "type": "http",
      "url": "https://mcp.example.com/mcp",
      "alwaysLoad": true
    },
    "platform-codemode": {
      "type": "http",
      "url": "https://platform.example.com/mcp"
    }
  }
}

For per-tool granularity, an MCP server can mark individual tools always-loaded by including "anthropic/alwaysLoad": true in the tool's _meta object on the server side. Verify either way with claude mcp list on the CLI or /mcp inside Claude Code. According to Joe Njenga's benchmarks on a 7-server setup, Tool Search alone takes the cost from 51,000 tokens to 8,500 (46.9% reduction) before any code-mode server is even in the mix.

Tool Search auto-triggers when MCP tool descriptions cross 10% of context. Code-mode servers stay below that line by design. The two features compose naturally: code mode keeps your big server's description cost flat, Tool Search defers everything else until needed. Pin the Claude Code version using the same approach I covered in Regression-Proof Claude Code Workflows so a future release doesn't silently change how alwaysLoad behaves.

When the Code Execution Pattern Hurts You

Honesty section. The numbers are exciting, the operational reality isn't. Five things to plan for before you commit.

Sandbox isolation is a real engineering bill. You're running model-generated code. Treat it like user-uploaded code from an anonymous form. Production deployments need real isolation: Firecracker, gVisor, Workers, Deno isolates, or WASM runtimes. In-process runners are a footgun and have led to sandbox escapes in adjacent ecosystems. The threat model overlaps with prompt injection in CI, which I covered in Hardening Claude Code GitHub Actions.

Debugging gets opaque. When a classic tool call fails, you see the failed call name and arguments. When code execution fails, you get a stack trace from inside a sandbox running model-written code. Tracing requires more structured logging on the server side, ideally one log line per api call from inside execute. Plan for it; don't ship without it.

Server complexity goes up. A classic MCP server is a list of functions. A code-mode server is a runtime, a sandbox, an OpenAPI ingest, a search index, and a security boundary you have to keep patched. The line of code count is small. The operational cost is not.

Latency floor. Two round trips per task (search then execute) versus one tool call. For long composable workflows the math wins because the agent does five things in one execute. For short, single-call tasks, it's slower than the classic shape. Measure your workload before deciding.

Auditability is harder. Code that composes five API calls in one execute looks like one tool invocation in your agent's top-level log. If you need granular per-call audit (compliance, cost attribution), instrument every api function from inside the sandbox and stream those events out separately. The agent log alone won't tell you what actually happened.

Code execution wins on token cost, loses on simplicity. Pick the trade based on your API surface area, not the headline percentage. For an internal API of 50 endpoints, dynamic toolsets or even just Tool Search will get you most of the way there with a fraction of the operational weight.

How This Composes With the Rest of Your MCP Stack

The pattern doesn't replace your existing MCP setup; it slots into it. Three integration points are worth knowing.

First, Atlassian's mcp-compressor (70-97% schema reduction) sits upstream of any approach. If you're running legacy MCP servers you can't rewrite, drop mcp-compressor in front and it shrinks the descriptions before the client sees them. It stacks with code mode on the servers you do control.

Second, the MCP spec proposal SEP-1576 (Mitigating Token Bloat in MCP) is in active discussion. The protocol may move toward optional code-mode hints in a future major version. Worth tracking if you're building MCP servers today; the patterns you adopt now should still hold once the spec catches up.

Third, if you're running a mixed setup (one code-mode internal server plus 2-3 third-party MCP servers), the right Claude Code config is small servers alwaysLoad: true, code-mode server alwaysLoad unset, and Tool Search handling the rest automatically. Cost-track that setup with the JSONL pipeline from Claude Code Cost Tracking so you can verify the savings actually land in your billing.

Code execution is one tool in the kit, not the kit. For my own MCP work, I still ship classic-shape servers like the Jenkins MCP and WordPress MCP because those APIs are small and the simplicity wins. I reach for code mode when the API surface is genuinely huge or workflows naturally chain calls. Pick the pattern your problem actually has.