AI Development

Gemini 3.5 Flash for Agentic Coding: A Claude Coder's Guide

Gemini 3.5 Flash is Google's new Flash-tier coding model, generally available since May 19, 2026. It scores 76.2% on Terminal-Bench 2.1 and 83.6% on MCP Atlas, beating Gemini 3.1 Pro on 11 of 15 benchmarks. Pricing is $1.50 input and $9 output per 1M tokens. For Claude Code users, it's the right model for tool-heavy agent loops, not a replacement for production code edits.

May 25, 2026
-
12 min read
-Last updated: 2026-05-25
Gemini 3.5 FlashAgentic CodingMCPClaude CodeGoogle AICost Optimization
TL;DR
  • What it is: Gemini 3.5 Flash (GA May 19, 2026) is a Flash-tier model that outperforms Gemini 3.1 Pro on agentic benchmarks while costing 25% less per token than the Pro tier.
  • Pricing reality: $1.50/$9 per 1M tokens looks cheap, but it's 3x the price of Gemini 3 Flash Preview and runs about 5.5x more expensive per full benchmark suite according to Artificial Analysis.
  • The thinking_level trap: the default dropped from high to medium. Copy-pasted code from gemini-3-flash-preview silently produces dumber outputs. For agentic coding, set thinking_level: "low" explicitly.
  • Where Flash wins: MCP tool orchestration (83.6% MCP Atlas, beats Claude Opus 4.7 by 4.5 points), parallel function calling, fast iterative agent loops.
  • Where Claude Code still wins: production codebase editing (Sonnet 4.6 leads SWE-Bench Verified), defensive code, long-context retrieval past 128k tokens.
  • Routing rule: keep Claude Code for Edit and Write tasks; route MCP-heavy planning and tool fan-out to Gemini 3.5 Flash via OpenRouter or a thin custom MCP server.

What is Gemini 3.5 Flash and What Changed on May 19, 2026

Gemini 3.5 Flash is a Flash-tier Gemini model that Google announced at I/O 2026 and shipped straight to GA on the same day. It is the first Flash-tier model to outperform the previous Pro tier on real agentic coding benchmarks. The launch lives on the official Google blog and the technical details on the Google DeepMind model card.

The model is available on the Gemini API, AI Studio, Antigravity CLI (the successor to Gemini CLI), Vertex AI, the Gemini app, AI Mode in Search, and now GitHub Copilot per the May 19 changelog. The context window is 1,048,576 input tokens with a 65,536 output cap.

Why this matters for a Claude Code user: the cheap model is now smart enough to handle production agent loops. That changes routing math, not loyalty. If you already run Sonnet 4.6 or Opus 4.7 inside Claude Code, you don't throw the stack away. You ask which subtasks now belong on a cheaper, faster Gemini call. The rest of this guide walks through how to make that call with real numbers.

Gemini 3.5 Flash Benchmarks: Where It Beats Gemini 3.1 Pro

Gemini 3.5 Flash wins 11 of 15 published benchmarks against Gemini 3.1 Pro, including the ones that matter most for agentic coding. The headline numbers from the Google DeepMind model card and the WaveSpeed roundup are below.

BenchmarkGemini 3.5 FlashGemini 3.1 ProClaude Opus 4.7GPT-5.5
Terminal-Bench 2.176.2%70.3%n/a78.2%
MCP Atlas83.6%78.2%79.1%75.3%
GDPval-AA (Elo)16561314n/a1769
SWE-Bench Pro55.1%n/a64.3%n/a
ARC-AGI-272.1%~77%n/a84.6%
128k retrievalregressed (-7.6 pts vs 3.1 Pro)baselinestrongstrong

The single most important number on that table for Claude Code users is the 83.6% MCP Atlas score. MCP Atlas measures how reliably a model chains multi-step tool calls without stalling on a malformed or out-of-order call. For anyone running an MCP-heavy stack, that score predicts task-completion rate more directly than SWE-bench does. The current Flash score beats Claude Opus 4.7 by 4.5 points and GPT-5.5 by 8.3 points.

The honest other side: Gemini 3.5 Flash regresses 7.6 points on 128k-token retrieval versus Gemini 3.1 Pro, and gives up 5 points on ARC-AGI-2 versus the prior Pro tier (12.5 points to GPT-5.5). If you have a million-token context refactor, or a problem that looks like ARC-style abstract reasoning, Flash is the wrong answer. Claude Code with Sonnet 4.6 or Gemini 3.1 Pro is a better fit for those workloads today.

Gemini 3.5 Flash Pricing: Cheap per Token, Expensive per Task

Gemini 3.5 Flash is $1.50 per 1M input tokens, $9 per 1M output tokens, and $0.15 per 1M cached input tokens (see OpenRouter for live pricing). On its face the Flash tier looks cheap. Per task it is not.

Simon Willison's May 19, 2026 analysis cites Artificial Analysis benchmark-suite costs: running their full evaluation cost $1,551.60 on Gemini 3.5 Flash versus $892.28 on Gemini 3.1 Pro. Cheaper per token, more expensive per workload, because thinking tokens persist across turns and agent loops chew more output tokens. NxCode reports a similar multiplier on their own eval workload: roughly 9x the cost of gemini-3-flash on equivalent jobs ($1,552 vs $278).

The pricing comparison that matters for routing:

ModelInput ($/1M)Output ($/1M)Cached input ($/1M)
Gemini 3.5 Flash$1.50$9.00$0.15
Gemini 3.1 Pro$2.50$15.00-
Gemini 3 Flash Preview (deprecated)$0.50$3.00-
Claude Sonnet 4.6$3.00$15.00$0.30
Claude Opus 4.7$5.00$25.00$0.50
GPT-5.5$1.25$10.00-

One trap to call out before the next section. GitHub Copilot launched Gemini 3.5 Flash with a 14x premium-request multiplier (GitHub Changelog, May 19 2026). A 300-request Copilot Pro quota becomes about 21 Flash calls before overage. If you already have Claude Code and an OpenRouter or AI Studio API key, calling Flash directly at roughly $0.015 per call is almost always cheaper than burning Copilot quota.

The thinking_level Default Trap That Breaks Copy-Pasted Code

Google replaced the integer thinking_budget parameter with a string enum thinking_level and quietly dropped the default from high to medium. Code copy-pasted from gemini-3-flash-preview still runs, but it produces measurably worse outputs unless you set the new field. The official notes live on Google AI Developers - What's new in Gemini 3.5.

The four values are minimal, low, medium (new default), and high. Google retuned low specifically for coding and tool-calling workloads. For agent loops with MCP tools, thinking_level: "low" is faster, cheaper, and on coding benchmarks roughly equivalent to medium. For hard reasoning, set high.

Before and after diff

pythonagent.py (broken silently after migration)
# Before - gemini-3-flash-preview
from google import genai
from google.genai import types

config = types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(thinking_budget=-1),  # was "dynamic" / high
    temperature=0.2,                                            # ignored by 3.5
    top_p=0.95,                                                 # ignored by 3.5
)
pythonagent.py (correct for gemini-3.5-flash)
# After - gemini-3.5-flash, explicit and tuned for agent loops
from google import genai
from google.genai import types

config = types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(thinking_level="low"),  # for MCP agent loops
    # for hard reasoning tasks, use thinking_level="high"
    # for latency-sensitive work, use thinking_level="minimal"
)

Two cleanup notes from the migration. temperature, top_p, and top_k are no longer recommended controls in the new SDK profile. Leaving them in your config is not an error, but they are silently ignored - delete them so the next reader of your code doesn't assume they still work. And inspect response.usage_metadata on your first run: thinking tokens now persist across multi-turn conversations, and the per-task token count for an agent loop can climb 30 to 50 percent versus the preview model.

Gemini 3.5 Flash vs Claude Code (Sonnet 4.6, Opus 4.7) for Coding

The short version: Flash wins agent orchestration and MCP tool chains. Claude Code wins repo-level edits and defensive code generation. Pick by task, not by model loyalty.

Task typeBest modelReason
MCP tool orchestration, parallel function callingGemini 3.5 Flash83.6% MCP Atlas, ~289 tok/sec, $1.50 input
Multi-file refactor in a real repoClaude Sonnet 4.6 in Claude CodeDefault Claude Code model; strong SWE-Bench Verified
ARC-style abstract reasoningClaude Opus 4.7 or GPT-5.5Flash gives up 5 pts ARC-AGI-2 vs prior Pro
Long-context retrieval beyond 128kGemini 3.1 Pro or Sonnet 4.6 (1M ctx)Flash regresses 7.6 pts on 128k retrieval
Cheap intermediate planning inside an agentGemini 3.5 FlashCached input at $0.15/1M is the lowest among frontier models
Production code review with defensive patchesClaude Sonnet 4.6Anthropic models add error handling more naturally

The defensive-code observation isn't hand-wavy. Multiple head-to-head reviews this month converge on the same pattern. MindStudio and BuildFastWithAI both report that Claude Opus 4.7 anticipates edge cases and adds error handling more naturally, while Gemini 3.5 Flash produces more concise code that occasionally skips defensive patterns. That maps to my own experience: I trust Sonnet 4.6 to write production patches; I lean on Flash to coordinate the 30 tool calls that fetch the inputs.

When to Route Tasks from Claude Code to Gemini 3.5 Flash

My default: I keep Claude Code with Sonnet 4.6 as the editor for anything that touches the repo. The Edit, Write, Glob, and Grep tools stay where they are. That is the production path and it doesn't need a different model today.

Where I route to Gemini 3.5 Flash is the supporting cast of tasks around the editor:

  • MCP-heavy planning subtasks where an agent fans out 10 to 100 tool calls to query an API, hit a database, or coordinate with another agent. The 83.6% MCP Atlas score shows up here as fewer retries and fewer stalled tool calls.
  • Long-running background tasks where speed beats defensive depth: linting summaries, log triage, doc generation, scheduled cron-style agents. Flash's ~289 tok/sec output throughput is roughly 4x what Opus 4.7 delivers.
  • Cheap intermediate planning steps inside a larger agent loop where Sonnet 4.6 is overkill. Use Flash to pick which tool to call next, then hand control back to Sonnet for the actual code change.
  • Parallel sub-agent fan-out like the 93 parallel agents in Antigravity's demo described in the NxCode developer guide. Cached input pricing at $0.15/1M makes the fan-out economically viable.

Three ways I actually route

The mechanics matter. These are the three patterns I use, in order of how often I reach for them.

  1. OpenRouter as a routing proxy. Configure Claude Code or any Claude SDK call to dispatch specific tool calls to google/gemini-3.5-flash on OpenRouter. You keep one API key, one billing surface, and you can swap models without code changes. The OpenRouter provider page lists current pricing and provider status.
  2. A thin custom MCP server that wraps client.models.generate_content with gemini-3.5-flash as an exposed tool, then mount it inside Claude Code via ~/.claude.json. The MCP code execution pattern post covers how that wiring works in practice.
  3. Antigravity CLI for hybrid teams. If your team already migrated from Gemini CLI per the Antigravity migration guide, Flash is the default model behind agy. Use Antigravity for parallel agents and keep Claude Code as your primary editor.

Build an MCP Agent with Gemini 3.5 Flash in 40 Lines of Python

The Google GenAI SDK has native MCP support. You hand the SDK a connected MCP ClientSession, and it auto-executes tool calls and feeds the responses back to the model in a loop until the agent finishes. The official reference lives on Google AI Developers - Function calling.

Install the SDKs

bashterminal
pip install "google-genai>=2.0" "mcp>=1.4"
export GEMINI_API_KEY="your-key-from-aistudio"

Working agent example

The script below connects to an MCP server, hands the session to Gemini 3.5 Flash with thinking_level="low", and runs a real triage prompt. Replace your_mcp_server with the module path to whatever MCP server you already run.

pythonmcp_agent.py
import asyncio
from google import genai
from google.genai import types
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client


async def main() -> None:
    server = StdioServerParameters(
        command="python",
        args=["-m", "your_mcp_server"],
    )

    async with stdio_client(server) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            client = genai.Client()
            response = await client.aio.models.generate_content(
                model="gemini-3.5-flash",
                contents=(
                    "Triage the 5 most recent open PRs in this repo. "
                    "For each, return: PR number, risk score (low/med/high), "
                    "and a one-line reason. Use the tools available."
                ),
                config=types.GenerateContentConfig(
                    thinking_config=types.ThinkingConfig(thinking_level="low"),
                    tools=[session],  # SDK auto-executes MCP tool calls
                ),
            )

            print(response.text)
            print(response.usage_metadata)


if __name__ == "__main__":
    asyncio.run(main())

Why every choice is what it is

  • thinking_level="low": Google retunedlow for code and tool-calling. It is faster, cheaper, and on coding benchmarks comparable to medium. The default medium would quietly inflate cost without improving the tool-call sequence.
  • tools=[session]: the SDK accepts an MCP ClientSession directly. It introspects the server's tool list, calls each tool when the model requests it, matches the FunctionResponse by id and name, and continues the loop until the model stops asking for tool calls.
  • response.usage_metadata: log this on every run. Inspect ThoughtsTokenCount. Thinking tokens persist across turns and can inflate input costs 30 to 50 percent on long agent loops, per the NxCode developer guide.
  • No temperature, no top_p: these parameters are silently ignored in Gemini 3.5. Leaving them in your config will confuse the next person to read it.

Gemini 3.5 Flash in Antigravity, GitHub Copilot, and the Raw API

Flash ships across four meaningful surfaces. The right one depends on what you already pay for and how you build.

SurfaceCost modelBest for
Raw Gemini API$1.50 / $9 per 1M (cached $0.15)Custom agents, MCP servers, routing layers
Antigravity CLI (agy)Free weekly cap, Pro $19.99/mo, Ultra $249.99/moHybrid teams on Google's stack
GitHub Copilot14x premium-request multiplierExisting Copilot users with light volume
OpenRouter$1.50 / $9 per 1M + small markupRouting inside Claude Code or multi-model proxies

Sources for the table: the Antigravity pricing page, the GitHub Copilot changelog, and the OpenRouter provider page.

One opinionated note: for a Claude Code user with even one active OpenRouter or AI Studio key, raw API plus OpenRouter is almost always cheaper than burning Copilot quota at the 14x multiplier. If you don't already pay for Copilot, the decision is easy. If you do, do the math once on your own workload before changing anything.

Limitations and Gotchas

The honest list. None of these are deal-breakers, but each one is worth knowing before you swap an existing agent over.

  • No Computer Use yet. Flash doesn't drive a browser. For browser-driving agents, use a Pro-tier Gemini or Claude with Computer Use.
  • Knowledge cutoff January 2025. Tool-augmented prompts and web search are the standard workarounds for fresh facts.
  • Text-only output. Multimodal input works. Output is text only - no image or audio generation.
  • 128k retrieval regressed. If you have million-token contexts and need exact-recall retrieval at scale, Sonnet 4.6 with its 1M context or Gemini 3.1 Pro are stronger picks.
  • Thought-token inflation. Thinking tokens persist across multi-turn conversations and can inflate input costs 30 to 50 percent on agent loops. Track ThoughtsTokenCount from response.usage_metadata.
  • thinking_level: medium is the silent default. Set it explicitly in every config. The previous high default is gone.
  • TPU capacity hiccups. Multiple developers reported 503 errors during the first week, with one comment on the Hacker News launch thread flagging frequent stalls. Build retry-with-backoff into any production caller.

Frequently Asked Questions

Related Reading