Claude Code

Running claude ultrareview in CI/CD: GitHub Actions Guide

Claude Code 2.1.120 (April 28, 2026) added claude ultrareview [target] as a non-interactive CLI subcommand that runs the cloud multi-agent review from CI scripts. It prints findings to stdout, supports --json for raw output, and exits 0 on completion or 1 on failure. Wiring it into GitHub Actions takes one OAuth token the standard action doesn't ship.

May 3, 2026
-
12 min read
-Last updated: 2026-05-03
Claude CodeGitHub ActionsCI/CDCode ReviewDevOps
TL;DR
  • claude ultrareview is the non-interactive form of /ultrareview. Same multi-agent review, same 5 to 10 minute runtime, same $5 to $20 per run.
  • Authentication is the trap. Ultrareview needs a Claude.ai OAuth token (CLAUDE_CODE_OAUTH_TOKEN), not an API key. Generate one with claude setup-token on a Pro or Max account.
  • Use --json to pipe bugs.json through jq and post a severity-tagged PR comment. Exit 1 fails the build; exit 0 passes regardless of findings.
  • For Team and Enterprise orgs that want every PR reviewed without writing YAML, the managed Code Review product is a better fit. The CLI subcommand wins on flexibility and self-hosted runners.

What claude ultrareview Does in CI

claude ultrareview [target] launches the same fleet of reviewer agents that power the interactive /ultrareview slash command. The review runs in a remote Anthropic sandbox, multiple agents explore the diff in parallel, each finding is independently reproduced and verified, and the result lands back as a list of bugs roughly 5 to 10 minutes later. The official docs describe ultrareview as offering "higher signal" and "broader coverage" than the local /review at the cost of a longer wait and a per-run charge.

What changed on April 28, 2026 is that this review now runs without an interactive Claude Code session. Per the v2.1.120 changelog, Anthropic added a CLI subcommand that "runs /ultrareview non-interactively from CI or scripts, prints findings to stdout (--json for raw output) and exits 0 on completion or 1 on failure." Same review, no terminal, no human keystroke required.

The subcommand takes three target forms:

bashterminal
# Diff between current branch and the default branch
claude ultrareview

# A specific GitHub PR (clones from github.com remote)
claude ultrareview 1234

# Diff against an arbitrary base branch
claude ultrareview origin/main

Two flags matter for automation. --json swaps the human formatting for the raw bugs.json payload, which is what you actually want for jq downstream. --timeout takes a value in minutes and defaults to 30, which is also the wall clock you should bound your CI job at. Output rules also matter: progress and the live session URL go to stderr; findings go to stdout. That keeps stdout parseable.

Exit codes are tight and correct. Exit 0 means the review finished, including the case where no bugs were found. Exit 1 means the review failed to launch, the remote session errored, or the timeout elapsed. Exit 130 means you (or your runner) hit Ctrl-C, but the remote review keeps running anyway, so follow the session URL on stderr if you want to recover the result. None of these codes equate to a build pass or fail by themselves; you decide what counts as "blocking" by parsing the JSON.

The Claude.ai OAuth Token Gotcha (Read This First)

Ultrareview will not run with an Anthropic API key. The official docs spell it out: "Ultrareview requires authentication with a Claude.ai account because it runs on Claude Code on the web infrastructure. If you are signed in with an API key only, run /login and authenticate with Claude.ai first." That kills the usual GitHub Actions pattern of pasting ANTHROPIC_API_KEY into a secret and calling anthropics/claude-code-action@v1. The action authenticates with API keys; ultrareview rejects them.

The same docs page also rules out the cloud-provider routes you might reach for as a workaround: "Ultrareview is not available when using Claude Code with Amazon Bedrock, Google Cloud Vertex AI, or Microsoft Foundry, and it is not available to organizations that have enabled Zero Data Retention." So OIDC via Bedrock or Vertex AI, which I cover for the wider hardening picture in Hardening Claude Code GitHub Actions, is off the table for ultrareview specifically.

The fix is a long-lived OAuth token. On a Pro or Max account, run the following on your laptop:

bashterminal
# Pro/Max account, browser flow
claude setup-token

# Output (truncated):
# Visit https://claude.ai/oauth/authorize?... in your browser
# After approval, paste the code shown:
# > ********
# Token issued (valid 1 year):
# sk-ant-oat01-***

The token format is sk-ant-oat01-..., distinct from the Console sk-ant-api03-... API keys. It's valid for one year, displayed once, and can't be retrieved again from any UI. Copy it straight into a GitHub repository secret named CLAUDE_CODE_OAUTH_TOKEN via Settings → Secrets and variables → Actions. Treat it like any other production secret: never log it, never commit it, don't put it in an org-wide variable accessible to every repo.

Two operational notes most teams miss. First, the token bills against the human account's extra usage allowance, so most teams use a shared ci-bot@company.com Pro account dedicated to CI. Document who owns the credentials and the recovery email in your runbook before you wire the workflow up. Second, there is no in-console rotation UI for OAuth tokens. Set a calendar reminder for day 350 to run claude setup-token again and update the secret yourself before the year-mark.

If you skip this section and try to use ANTHROPIC_API_KEY anyway, the failure surfaces as the CLI dropping you into a /login prompt that expects a browser, which then errors out in the headless runner. That's the most common reason first-time CI integrations fail on the very first run.

The Minimum Viable GitHub Actions Workflow

Below is the smallest workflow that does something useful: it triggers on PR open and ready-for-review (skipping every push), installs a pinned CLI, runs claude ultrareview --json against the PR number, and stashes the findings as a build artifact. Drop it at .github/workflows/ultrareview.yml and it just works.

yaml.github/workflows/ultrareview.yml
name: Claude Ultrareview

on:
  pull_request:
    types: [opened, ready_for_review]

permissions:
  pull-requests: write
  contents: read

jobs:
  ultrareview:
    if: github.event.pull_request.draft == false
    runs-on: ubuntu-latest
    timeout-minutes: 25
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Install Claude Code CLI (pinned)
        run: npm install -g @anthropic-ai/claude-code@2.1.120

      - name: Verify install
        run: claude --version

      - name: Run claude ultrareview
        env:
          CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
        run: |
          claude ultrareview --json --timeout 20 \
            "${{ github.event.pull_request.number }}" > bugs.json

      - name: Upload findings artifact
        uses: actions/upload-artifact@v4
        with:
          name: ultrareview-findings
          path: bugs.json
          retention-days: 14

A few choices in here are deliberate, not stylistic. fetch-depth: 0 grabs full git history so the bundled-tree path works on tiny PRs where shallow clones miss the merge base. The version pin (@2.1.120) matters because the subcommand is brand new and the CLI surface is actively moving; I covered the broader case for pinning in Regression-Proof Claude Code Workflows. The timeout-minutes: 25 at the job level paired with --timeout 20 on the subcommand gives the runner a five minute buffer to tear down on a stuck review.

The skip rule (if: github.event.pull_request.draft == false) is a single line that prevents you from spending $12 reviewing a draft you'll rewrite three times before it's ready. Combined with types: [opened, ready_for_review], the review fires once per PR at the moment review actually makes sense.

The first run end-to-end takes 6 to 11 minutes for a typical PR: 30 to 45 seconds of npm install, then 5 to 10 minutes for the remote review, then a few seconds to upload. On a GitHub-hosted Linux runner that charges around $0.008 per minute on private repos, the runner cost sits at roughly $0.05. The Anthropic charge dominates by two orders of magnitude.

Parse --json Findings and Post a PR Comment

The artifact upload from the minimum workflow is fine for offline triage, but most teams want findings posted directly on the PR. The bugs.json payload is shaped roughly like this (truncated to the keys you actually use):

jsonbugs.json
{
  "session_url": "https://claude.ai/code/session_01H...",
  "summary": "Reviewed 12 files, 480 changed lines.",
  "findings": [
    {
      "severity": "important",
      "file": "src/auth/session.ts",
      "line": 142,
      "title": "Token refresh races with logout",
      "body": "On concurrent logout/refresh, the refresh handler can resurrect a session that was just torn down..."
    },
    {
      "severity": "nit",
      "file": "src/auth/session.ts",
      "line": 88,
      "title": "parseExpiry silently returns 0 on malformed input",
      "body": "..."
    }
  ]
}

severity takes one of three values: important, nit, and pre_existing. Important is what you want to gate on. Nits are useful as feedback but should not fail a build. Pre-existing bugs are bugs the reviewer found in code that wasn't touched by the PR; they're informational.

A small jq script turns the payload into a Markdown comment grouped by severity, then gh pr comment posts it with the GitHub CLI that ships preinstalled on every runner:

bashscripts/post-ultrareview-comment.sh
#!/usr/bin/env bash
set -euo pipefail

BUGS=${1:-bugs.json}
PR=${PR_NUMBER:-$GITHUB_PR_NUMBER}

IMPORTANT=$(jq '[.findings[] | select(.severity=="important")] | length' "$BUGS")
NIT=$(jq '[.findings[] | select(.severity=="nit")] | length' "$BUGS")
SESSION=$(jq -r .session_url "$BUGS")

{
  echo "## Claude ultrareview"
  echo ""
  echo "**$IMPORTANT important** | **$NIT nits** | [session log]($SESSION)"
  echo ""
  jq -r '.findings | sort_by(.severity) | .[] |
    "### [\(.severity | ascii_upcase)] \(.file):\(.line)\n**\(.title)**\n\n\(.body)\n"' "$BUGS"
} > comment.md

gh pr comment "$PR" --body-file comment.md

# Fail the build only on Important findings
if [ "$IMPORTANT" -gt 0 ]; then
  echo "::error::Found $IMPORTANT important issues"
  exit 1
fi

Two design choices worth pulling out. First, the build only fails on important findings. That keeps the workflow useful as a warning channel without becoming the team's nemesis on every typo-grade nit. If you want to gate harder, raise the threshold; if your team wants Claude as advisory only, drop the exit 1 line entirely and let every run pass. Second, the comment posts even on green runs because gh pr comment runs before the exit-code check. That tells reviewers "ultrareview ran and found nothing" instead of leaving them to wonder.

Wire it into the workflow with two extra steps:

yaml.github/workflows/ultrareview.yml (additions)
      - name: Post findings as PR comment
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
        run: bash scripts/post-ultrareview-comment.sh bugs.json

One safety note: model output is untrusted input the moment it lands on a PR. Don't pipe bugs.json into eval, don't feed it to a downstream agent without sanitization, and assume the comment body could contain Markdown that tries to confuse a downstream reviewer agent. The same threat model I cover in Hardening Claude Code GitHub Actions applies here. Treat the JSON as text, not as instructions.

Cost Control and Frequency Guards

Anthropic's ultrareview docs quote a per-review range of "roughly $5 to $20 per review as extra usage". In practice my own runs cluster around $10-$14 on PRs of 200-600 changed lines. Pro and Max accounts get three free runs that expire May 5, 2026, which is genuinely tight if you're reading this on publication day. Spend them validating the workflow against real PRs in your repo, not testing the YAML on scratch branches.

A simple cost table at three volumes:

PRs/monthAt $12 avgAt $18 avg
50$600$900
100$1,200$1,800
200$2,400$3,600

Five frequency guards move the needle far more than picking a cheaper tool would. The first three live as if: conditions on the job; the last two need a small step to compute the gate.

yaml.github/workflows/ultrareview.yml (frequency guards)
jobs:
  ultrareview:
    # 1. Skip drafts
    # 2. Skip Dependabot, Renovate, and other bots
    # 3. Skip if a "skip-ultrareview" label is present
    # 4. Optional: only run when reviewers add a "needs-deep-review" label
    if: |
      github.event.pull_request.draft == false &&
      github.actor != 'dependabot[bot]' &&
      github.actor != 'renovate[bot]' &&
      !contains(github.event.pull_request.labels.*.name, 'skip-ultrareview') &&
      contains(github.event.pull_request.labels.*.name, 'needs-deep-review')
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }

      - name: Skip tiny PRs
        id: size
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          changed=$(gh pr diff "${{ github.event.pull_request.number }}" --name-only | wc -l)
          echo "changed=$changed" >> "$GITHUB_OUTPUT"
          [ "$changed" -ge 5 ] || echo "Skipping: only $changed files changed"

      - name: Run claude ultrareview
        if: steps.size.outputs.changed != '' && steps.size.outputs.changed >= 5
        env:
          CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
        run: |
          claude ultrareview --json --timeout 20 \
            "${{ github.event.pull_request.number }}" > bugs.json

With the label gate, you go from "every PR" to "PRs where a reviewer flagged the change as substantive". That cuts volume by 60-80% in my experience without losing the high-impact reviews. The diff-size threshold catches the long tail of one-line-fix PRs that don't need a 10-minute multi-agent sweep. The bot-actor skip alone has saved real teams from posting an $18 ultrareview comment on a Dependabot version bump.

Set an org-wide spend cap at claude.ai/admin-settings/usage before turning the workflow on for the whole engineering team. A bad week of mass-rebases can otherwise turn into a four-figure surprise on the Anthropic invoice. The same cost tracking patterns I covered in Claude Code Cost Tracking apply here for per-run accounting on your side.

When to Use the Managed Code Review Product Instead

Anthropic ships a separate product called Code Review that's easy to confuse with the CLI subcommand because both run multi-agent reviews. They're not the same thing. Code Review is a managed GitHub App available only to Team and Enterprise subscriptions. It posts findings as inline review comments, ships a check run with a severity table, costs "$15-25 in cost, scaling with PR size" per run, and triggers on PR open, every push, or via @claude review comments. No YAML required on your side.

Dimensionclaude ultrareview (CLI)Code Review (managed)
Plan requiredPro / Max / Team / EnterpriseTeam / Enterprise only
AuthOAuth token in CI secretGitHub App, no token in your repo
TriggerWhatever your YAML saysPR open / every push / @claude review
Inline diff commentsYou write the post stepPosted automatically
Cost per review$5 to $20$15 to $25
CustomizationAnything jq and gh can doREVIEW.md for severity rules and skip paths
Self-hosted runnersYesNo (runs on Anthropic infra)

Pick the CLI subcommand if you have a Pro or Max account, you want full control over trigger logic, you run on self-hosted runners, you want findings routed to non-GitHub destinations like Slack or Linear, or you're cost-sensitive and want exactly one review per PR open. Pick the managed Code Review if you're on Team or Enterprise, you want zero YAML, you want inline diff annotations and severity-tagged check runs out of the box, and REVIEW.md covers your tuning needs. Both bill as extra usage on the same Anthropic invoice; budgeting and spend caps are configured per organization.

One subtle point: the managed Code Review's "every push" mode multiplies cost by the number of pushes, which is invisible until it isn't. The CLI subcommand fires exactly when your YAML says it should, no more, no less. If your team rewrites PRs heavily before merge, the CLI form is the cheaper structure even before you account for the higher per-run cost of the managed product.

GitLab CI and the June 1 Copilot Billing Change

GitLab CI works with one structural change: the claude ultrareview <PR-number> form is GitHub-only because it clones from a github.com remote. On GitLab, the no-arg form bundles your local working tree and reviews against the default branch, which is what you want for an MR job anyway.

yaml.gitlab-ci.yml
ultrareview:
  image: node:20-alpine
  stage: review
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
      when: on_success
  variables:
    GIT_DEPTH: 0
  before_script:
    - apk add --no-cache git curl jq glab
    - npm install -g @anthropic-ai/claude-code@2.1.120
    - claude --version
  script:
    - claude ultrareview --json --timeout 20 > bugs.json
    - |
      IMPORTANT=$(jq '[.findings[] | select(.severity=="important")] | length' bugs.json)
      jq -r '.findings | sort_by(.severity) | .[] |
        "### [\(.severity | ascii_upcase)] \(.file):\(.line)\n\(.title)\n"' bugs.json > comment.md
      glab mr note "$CI_MERGE_REQUEST_IID" --message "$(cat comment.md)"
      [ "$IMPORTANT" -eq 0 ]
  artifacts:
    paths: [bugs.json]
    when: always

Store CLAUDE_CODE_OAUTH_TOKEN as a masked, protected CI/CD variable under Settings → CI/CD → Variables. The glab CLI handles the MR comment posting; if you don't want the extra binary, the GitLab REST API works just as well with curl and a project-scoped token.

The timing on this whole topic is sharper than it looks. On April 27, 2026, GitHub announced that Copilot code review will start consuming GitHub Actions minutes on June 1, 2026 on private repos, alongside a broader move to credit-based billing that drew sharp pushback from developers ("you will get less but pay the same price", per Visual Studio Magazine). Teams that "got code review for free" until now have 28 days to evaluate alternatives.

The honest comparison: claude ultrareview on GitHub-hosted runners charges you both Anthropic for the review (~$12) and GitHub for the runner minutes (~$0.05 for a 6-minute Linux job). The Anthropic charge dominates by orders of magnitude; the runner cost is a rounding error. The Copilot billing concern is structural for any CI-hosted AI reviewer, including this one. The right cost lever is the frequency guards from the previous section, not switching providers.

Known Edges and the Hardening Checklist

A few rough edges are worth knowing about before you turn this on org-wide.

Empty findings on huge PRs. Issue #50029 reports an empty findings: [] array on PRs with several thousand changed files. The workaround is to either split the PR or use the no-arg form which bundles local state instead of cloning the PR. If your team routinely opens 1000-file PRs, ultrareview probably isn't the right tool for that PR class anyway.

Repository too large to bundle. The docs note that Claude Code prompts for PR mode if your bundled tree exceeds the upload size. Push your branch, open a draft PR, then run claude ultrareview <PR-number> so the remote sandbox clones from github.com directly.

Free runs expire May 5, 2026. Each Pro and Max account gets exactly three runs to evaluate the feature. Don't burn them on Dependabot PRs you weren't going to review anyway.

Cost surprise on busy weeks. A merge train with 30 substantive PRs in a day at $14 each is $420 nobody planned for. The spend cap setting linked from the cost-control section is the only real defense.

The 12-item production checklist I run through before turning the workflow on a new repo:

  1. Pin the CLI version (@anthropic-ai/claude-code@2.1.120 or later).
  2. Use a dedicated Pro or Max account whose owner is documented in the runbook.
  3. Store CLAUDE_CODE_OAUTH_TOKEN as a repository secret, not an org-wide variable.
  4. Restrict workflow permissions: to pull-requests: write, contents: read and nothing else.
  5. Skip drafts and bot actors with if: conditions.
  6. Add a label gate or diff-size gate so cost only fires on substantive PRs.
  7. Pass --timeout 20 with a job-level timeout-minutes: 25 to bound runner cost.
  8. Pipe stdout to a file. Never trust the comment poster with raw model output (link to hardening guide).
  9. Fail the build only on important findings, not nits.
  10. Set an org-level extra-usage spend cap at claude.ai/admin-settings/usage.
  11. Set a calendar reminder to rotate the OAuth token at day 350.
  12. Subscribe the runbook owner to the Claude Code changelog for breaking changes to the subcommand surface.

None of these items are clever. They're the boring guardrails that turn a "cool, it works" demo into a workflow you can leave running for a year without surprises. Five days after v2.1.120 shipped, almost no one has these in writing yet. Yours can.

Frequently Asked Questions

Related Reading

Harden Claude Code GitHub Actions: Prompt Injection Defense

The same PR comment poster used here inherits the Comment and Control threat model. Tool allowlists, OIDC, script caps, and the assembled hardened workflow.

Read the guide
Regression-Proof Claude Code Workflows: Pin, Lock, Test

Why the version pin in this workflow matters: Anthropic shipped three confirmed regressions in seven weeks. Pin the CLI, lock effort, allowlist models, fixture-test.

Read the guide