Running claude ultrareview in CI/CD: GitHub Actions Guide
Claude Code 2.1.120 (April 28, 2026) added claude ultrareview [target] as a non-interactive CLI subcommand that runs the cloud multi-agent review from CI scripts. It prints findings to stdout, supports --json for raw output, and exits 0 on completion or 1 on failure. Wiring it into GitHub Actions takes one OAuth token the standard action doesn't ship.
claude ultrareviewis the non-interactive form of/ultrareview. Same multi-agent review, same 5 to 10 minute runtime, same $5 to $20 per run.- Authentication is the trap. Ultrareview needs a Claude.ai OAuth token (
CLAUDE_CODE_OAUTH_TOKEN), not an API key. Generate one withclaude setup-tokenon a Pro or Max account. - Use
--jsonto pipebugs.jsonthroughjqand post a severity-tagged PR comment. Exit 1 fails the build; exit 0 passes regardless of findings. - For Team and Enterprise orgs that want every PR reviewed without writing YAML, the managed Code Review product is a better fit. The CLI subcommand wins on flexibility and self-hosted runners.
What claude ultrareview Does in CI
claude ultrareview [target] launches the same fleet of reviewer agents that power the interactive /ultrareview slash command. The review runs in a remote Anthropic sandbox, multiple agents explore the diff in parallel, each finding is independently reproduced and verified, and the result lands back as a list of bugs roughly 5 to 10 minutes later. The official docs describe ultrareview as offering "higher signal" and "broader coverage" than the local /review at the cost of a longer wait and a per-run charge.
What changed on April 28, 2026 is that this review now runs without an interactive Claude Code session. Per the v2.1.120 changelog, Anthropic added a CLI subcommand that "runs /ultrareview non-interactively from CI or scripts, prints findings to stdout (--json for raw output) and exits 0 on completion or 1 on failure." Same review, no terminal, no human keystroke required.
The subcommand takes three target forms:
# Diff between current branch and the default branch
claude ultrareview
# A specific GitHub PR (clones from github.com remote)
claude ultrareview 1234
# Diff against an arbitrary base branch
claude ultrareview origin/mainTwo flags matter for automation. --json swaps the human formatting for the raw bugs.json payload, which is what you actually want for jq downstream. --timeout takes a value in minutes and defaults to 30, which is also the wall clock you should bound your CI job at. Output rules also matter: progress and the live session URL go to stderr; findings go to stdout. That keeps stdout parseable.
Exit codes are tight and correct. Exit 0 means the review finished, including the case where no bugs were found. Exit 1 means the review failed to launch, the remote session errored, or the timeout elapsed. Exit 130 means you (or your runner) hit Ctrl-C, but the remote review keeps running anyway, so follow the session URL on stderr if you want to recover the result. None of these codes equate to a build pass or fail by themselves; you decide what counts as "blocking" by parsing the JSON.
The Claude.ai OAuth Token Gotcha (Read This First)
Ultrareview will not run with an Anthropic API key. The official docs spell it out: "Ultrareview requires authentication with a Claude.ai account because it runs on Claude Code on the web infrastructure. If you are signed in with an API key only, run /login and authenticate with Claude.ai first." That kills the usual GitHub Actions pattern of pasting ANTHROPIC_API_KEY into a secret and calling anthropics/claude-code-action@v1. The action authenticates with API keys; ultrareview rejects them.
The same docs page also rules out the cloud-provider routes you might reach for as a workaround: "Ultrareview is not available when using Claude Code with Amazon Bedrock, Google Cloud Vertex AI, or Microsoft Foundry, and it is not available to organizations that have enabled Zero Data Retention." So OIDC via Bedrock or Vertex AI, which I cover for the wider hardening picture in Hardening Claude Code GitHub Actions, is off the table for ultrareview specifically.
The fix is a long-lived OAuth token. On a Pro or Max account, run the following on your laptop:
# Pro/Max account, browser flow
claude setup-token
# Output (truncated):
# Visit https://claude.ai/oauth/authorize?... in your browser
# After approval, paste the code shown:
# > ********
# Token issued (valid 1 year):
# sk-ant-oat01-***The token format is sk-ant-oat01-..., distinct from the Console sk-ant-api03-... API keys. It's valid for one year, displayed once, and can't be retrieved again from any UI. Copy it straight into a GitHub repository secret named CLAUDE_CODE_OAUTH_TOKEN via Settings → Secrets and variables → Actions. Treat it like any other production secret: never log it, never commit it, don't put it in an org-wide variable accessible to every repo.
Two operational notes most teams miss. First, the token bills against the human account's extra usage allowance, so most teams use a shared ci-bot@company.com Pro account dedicated to CI. Document who owns the credentials and the recovery email in your runbook before you wire the workflow up. Second, there is no in-console rotation UI for OAuth tokens. Set a calendar reminder for day 350 to run claude setup-token again and update the secret yourself before the year-mark.
If you skip this section and try to use ANTHROPIC_API_KEY anyway, the failure surfaces as the CLI dropping you into a /login prompt that expects a browser, which then errors out in the headless runner. That's the most common reason first-time CI integrations fail on the very first run.
The Minimum Viable GitHub Actions Workflow
Below is the smallest workflow that does something useful: it triggers on PR open and ready-for-review (skipping every push), installs a pinned CLI, runs claude ultrareview --json against the PR number, and stashes the findings as a build artifact. Drop it at .github/workflows/ultrareview.yml and it just works.
name: Claude Ultrareview
on:
pull_request:
types: [opened, ready_for_review]
permissions:
pull-requests: write
contents: read
jobs:
ultrareview:
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
timeout-minutes: 25
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Install Claude Code CLI (pinned)
run: npm install -g @anthropic-ai/claude-code@2.1.120
- name: Verify install
run: claude --version
- name: Run claude ultrareview
env:
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
run: |
claude ultrareview --json --timeout 20 \
"${{ github.event.pull_request.number }}" > bugs.json
- name: Upload findings artifact
uses: actions/upload-artifact@v4
with:
name: ultrareview-findings
path: bugs.json
retention-days: 14A few choices in here are deliberate, not stylistic. fetch-depth: 0 grabs full git history so the bundled-tree path works on tiny PRs where shallow clones miss the merge base. The version pin (@2.1.120) matters because the subcommand is brand new and the CLI surface is actively moving; I covered the broader case for pinning in Regression-Proof Claude Code Workflows. The timeout-minutes: 25 at the job level paired with --timeout 20 on the subcommand gives the runner a five minute buffer to tear down on a stuck review.
The skip rule (if: github.event.pull_request.draft == false) is a single line that prevents you from spending $12 reviewing a draft you'll rewrite three times before it's ready. Combined with types: [opened, ready_for_review], the review fires once per PR at the moment review actually makes sense.
The first run end-to-end takes 6 to 11 minutes for a typical PR: 30 to 45 seconds of npm install, then 5 to 10 minutes for the remote review, then a few seconds to upload. On a GitHub-hosted Linux runner that charges around $0.008 per minute on private repos, the runner cost sits at roughly $0.05. The Anthropic charge dominates by two orders of magnitude.
Parse --json Findings and Post a PR Comment
The artifact upload from the minimum workflow is fine for offline triage, but most teams want findings posted directly on the PR. The bugs.json payload is shaped roughly like this (truncated to the keys you actually use):
{
"session_url": "https://claude.ai/code/session_01H...",
"summary": "Reviewed 12 files, 480 changed lines.",
"findings": [
{
"severity": "important",
"file": "src/auth/session.ts",
"line": 142,
"title": "Token refresh races with logout",
"body": "On concurrent logout/refresh, the refresh handler can resurrect a session that was just torn down..."
},
{
"severity": "nit",
"file": "src/auth/session.ts",
"line": 88,
"title": "parseExpiry silently returns 0 on malformed input",
"body": "..."
}
]
}severity takes one of three values: important, nit, and pre_existing. Important is what you want to gate on. Nits are useful as feedback but should not fail a build. Pre-existing bugs are bugs the reviewer found in code that wasn't touched by the PR; they're informational.
A small jq script turns the payload into a Markdown comment grouped by severity, then gh pr comment posts it with the GitHub CLI that ships preinstalled on every runner:
#!/usr/bin/env bash
set -euo pipefail
BUGS=${1:-bugs.json}
PR=${PR_NUMBER:-$GITHUB_PR_NUMBER}
IMPORTANT=$(jq '[.findings[] | select(.severity=="important")] | length' "$BUGS")
NIT=$(jq '[.findings[] | select(.severity=="nit")] | length' "$BUGS")
SESSION=$(jq -r .session_url "$BUGS")
{
echo "## Claude ultrareview"
echo ""
echo "**$IMPORTANT important** | **$NIT nits** | [session log]($SESSION)"
echo ""
jq -r '.findings | sort_by(.severity) | .[] |
"### [\(.severity | ascii_upcase)] \(.file):\(.line)\n**\(.title)**\n\n\(.body)\n"' "$BUGS"
} > comment.md
gh pr comment "$PR" --body-file comment.md
# Fail the build only on Important findings
if [ "$IMPORTANT" -gt 0 ]; then
echo "::error::Found $IMPORTANT important issues"
exit 1
fiTwo design choices worth pulling out. First, the build only fails on important findings. That keeps the workflow useful as a warning channel without becoming the team's nemesis on every typo-grade nit. If you want to gate harder, raise the threshold; if your team wants Claude as advisory only, drop the exit 1 line entirely and let every run pass. Second, the comment posts even on green runs because gh pr comment runs before the exit-code check. That tells reviewers "ultrareview ran and found nothing" instead of leaving them to wonder.
Wire it into the workflow with two extra steps:
- name: Post findings as PR comment
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
run: bash scripts/post-ultrareview-comment.sh bugs.jsonOne safety note: model output is untrusted input the moment it lands on a PR. Don't pipe bugs.json into eval, don't feed it to a downstream agent without sanitization, and assume the comment body could contain Markdown that tries to confuse a downstream reviewer agent. The same threat model I cover in Hardening Claude Code GitHub Actions applies here. Treat the JSON as text, not as instructions.
Cost Control and Frequency Guards
Anthropic's ultrareview docs quote a per-review range of "roughly $5 to $20 per review as extra usage". In practice my own runs cluster around $10-$14 on PRs of 200-600 changed lines. Pro and Max accounts get three free runs that expire May 5, 2026, which is genuinely tight if you're reading this on publication day. Spend them validating the workflow against real PRs in your repo, not testing the YAML on scratch branches.
A simple cost table at three volumes:
| PRs/month | At $12 avg | At $18 avg |
|---|---|---|
| 50 | $600 | $900 |
| 100 | $1,200 | $1,800 |
| 200 | $2,400 | $3,600 |
Five frequency guards move the needle far more than picking a cheaper tool would. The first three live as if: conditions on the job; the last two need a small step to compute the gate.
jobs:
ultrareview:
# 1. Skip drafts
# 2. Skip Dependabot, Renovate, and other bots
# 3. Skip if a "skip-ultrareview" label is present
# 4. Optional: only run when reviewers add a "needs-deep-review" label
if: |
github.event.pull_request.draft == false &&
github.actor != 'dependabot[bot]' &&
github.actor != 'renovate[bot]' &&
!contains(github.event.pull_request.labels.*.name, 'skip-ultrareview') &&
contains(github.event.pull_request.labels.*.name, 'needs-deep-review')
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- name: Skip tiny PRs
id: size
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
changed=$(gh pr diff "${{ github.event.pull_request.number }}" --name-only | wc -l)
echo "changed=$changed" >> "$GITHUB_OUTPUT"
[ "$changed" -ge 5 ] || echo "Skipping: only $changed files changed"
- name: Run claude ultrareview
if: steps.size.outputs.changed != '' && steps.size.outputs.changed >= 5
env:
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
run: |
claude ultrareview --json --timeout 20 \
"${{ github.event.pull_request.number }}" > bugs.jsonWith the label gate, you go from "every PR" to "PRs where a reviewer flagged the change as substantive". That cuts volume by 60-80% in my experience without losing the high-impact reviews. The diff-size threshold catches the long tail of one-line-fix PRs that don't need a 10-minute multi-agent sweep. The bot-actor skip alone has saved real teams from posting an $18 ultrareview comment on a Dependabot version bump.
Set an org-wide spend cap at claude.ai/admin-settings/usage before turning the workflow on for the whole engineering team. A bad week of mass-rebases can otherwise turn into a four-figure surprise on the Anthropic invoice. The same cost tracking patterns I covered in Claude Code Cost Tracking apply here for per-run accounting on your side.
When to Use the Managed Code Review Product Instead
Anthropic ships a separate product called Code Review that's easy to confuse with the CLI subcommand because both run multi-agent reviews. They're not the same thing. Code Review is a managed GitHub App available only to Team and Enterprise subscriptions. It posts findings as inline review comments, ships a check run with a severity table, costs "$15-25 in cost, scaling with PR size" per run, and triggers on PR open, every push, or via @claude review comments. No YAML required on your side.
| Dimension | claude ultrareview (CLI) | Code Review (managed) |
|---|---|---|
| Plan required | Pro / Max / Team / Enterprise | Team / Enterprise only |
| Auth | OAuth token in CI secret | GitHub App, no token in your repo |
| Trigger | Whatever your YAML says | PR open / every push / @claude review |
| Inline diff comments | You write the post step | Posted automatically |
| Cost per review | $5 to $20 | $15 to $25 |
| Customization | Anything jq and gh can do | REVIEW.md for severity rules and skip paths |
| Self-hosted runners | Yes | No (runs on Anthropic infra) |
Pick the CLI subcommand if you have a Pro or Max account, you want full control over trigger logic, you run on self-hosted runners, you want findings routed to non-GitHub destinations like Slack or Linear, or you're cost-sensitive and want exactly one review per PR open. Pick the managed Code Review if you're on Team or Enterprise, you want zero YAML, you want inline diff annotations and severity-tagged check runs out of the box, and REVIEW.md covers your tuning needs. Both bill as extra usage on the same Anthropic invoice; budgeting and spend caps are configured per organization.
One subtle point: the managed Code Review's "every push" mode multiplies cost by the number of pushes, which is invisible until it isn't. The CLI subcommand fires exactly when your YAML says it should, no more, no less. If your team rewrites PRs heavily before merge, the CLI form is the cheaper structure even before you account for the higher per-run cost of the managed product.
GitLab CI and the June 1 Copilot Billing Change
GitLab CI works with one structural change: the claude ultrareview <PR-number> form is GitHub-only because it clones from a github.com remote. On GitLab, the no-arg form bundles your local working tree and reviews against the default branch, which is what you want for an MR job anyway.
ultrareview:
image: node:20-alpine
stage: review
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
when: on_success
variables:
GIT_DEPTH: 0
before_script:
- apk add --no-cache git curl jq glab
- npm install -g @anthropic-ai/claude-code@2.1.120
- claude --version
script:
- claude ultrareview --json --timeout 20 > bugs.json
- |
IMPORTANT=$(jq '[.findings[] | select(.severity=="important")] | length' bugs.json)
jq -r '.findings | sort_by(.severity) | .[] |
"### [\(.severity | ascii_upcase)] \(.file):\(.line)\n\(.title)\n"' bugs.json > comment.md
glab mr note "$CI_MERGE_REQUEST_IID" --message "$(cat comment.md)"
[ "$IMPORTANT" -eq 0 ]
artifacts:
paths: [bugs.json]
when: alwaysStore CLAUDE_CODE_OAUTH_TOKEN as a masked, protected CI/CD variable under Settings → CI/CD → Variables. The glab CLI handles the MR comment posting; if you don't want the extra binary, the GitLab REST API works just as well with curl and a project-scoped token.
The timing on this whole topic is sharper than it looks. On April 27, 2026, GitHub announced that Copilot code review will start consuming GitHub Actions minutes on June 1, 2026 on private repos, alongside a broader move to credit-based billing that drew sharp pushback from developers ("you will get less but pay the same price", per Visual Studio Magazine). Teams that "got code review for free" until now have 28 days to evaluate alternatives.
The honest comparison: claude ultrareview on GitHub-hosted runners charges you both Anthropic for the review (~$12) and GitHub for the runner minutes (~$0.05 for a 6-minute Linux job). The Anthropic charge dominates by orders of magnitude; the runner cost is a rounding error. The Copilot billing concern is structural for any CI-hosted AI reviewer, including this one. The right cost lever is the frequency guards from the previous section, not switching providers.
Known Edges and the Hardening Checklist
A few rough edges are worth knowing about before you turn this on org-wide.
Empty findings on huge PRs. Issue #50029 reports an empty findings: [] array on PRs with several thousand changed files. The workaround is to either split the PR or use the no-arg form which bundles local state instead of cloning the PR. If your team routinely opens 1000-file PRs, ultrareview probably isn't the right tool for that PR class anyway.
Repository too large to bundle. The docs note that Claude Code prompts for PR mode if your bundled tree exceeds the upload size. Push your branch, open a draft PR, then run claude ultrareview <PR-number> so the remote sandbox clones from github.com directly.
Free runs expire May 5, 2026. Each Pro and Max account gets exactly three runs to evaluate the feature. Don't burn them on Dependabot PRs you weren't going to review anyway.
Cost surprise on busy weeks. A merge train with 30 substantive PRs in a day at $14 each is $420 nobody planned for. The spend cap setting linked from the cost-control section is the only real defense.
The 12-item production checklist I run through before turning the workflow on a new repo:
- Pin the CLI version (
@anthropic-ai/claude-code@2.1.120or later). - Use a dedicated Pro or Max account whose owner is documented in the runbook.
- Store
CLAUDE_CODE_OAUTH_TOKENas a repository secret, not an org-wide variable. - Restrict workflow
permissions:topull-requests: write, contents: readand nothing else. - Skip drafts and bot actors with
if:conditions. - Add a label gate or diff-size gate so cost only fires on substantive PRs.
- Pass
--timeout 20with a job-leveltimeout-minutes: 25to bound runner cost. - Pipe stdout to a file. Never trust the comment poster with raw model output (link to hardening guide).
- Fail the build only on
importantfindings, not nits. - Set an org-level extra-usage spend cap at
claude.ai/admin-settings/usage. - Set a calendar reminder to rotate the OAuth token at day 350.
- Subscribe the runbook owner to the Claude Code changelog for breaking changes to the subcommand surface.
None of these items are clever. They're the boring guardrails that turn a "cool, it works" demo into a workflow you can leave running for a year without surprises. Five days after v2.1.120 shipped, almost no one has these in writing yet. Yours can.
Frequently Asked Questions
Related Reading
The same PR comment poster used here inherits the Comment and Control threat model. Tool allowlists, OIDC, script caps, and the assembled hardened workflow.
Read the guideWhy the version pin in this workflow matters: Anthropic shipped three confirmed regressions in seven weeks. Pin the CLI, lock effort, allowlist models, fixture-test.
Read the guide