Regression-Proof Claude Code Workflows: Pin, Lock, Test
Anthropic's April 23, 2026 postmortem confirmed three wrapper-level changes silently degraded Claude Code over seven weeks. The model never changed; the harness around it broke. To regression-proof your workflow against the next upstream surprise, pin the CLI version, lock effort with effortLevel, allowlist models, add a regression-detecting hook, and keep a fixture suite.
- Anthropic's April 23, 2026 postmortem confirmed three confounding wrapper changes degraded Claude Code over 7 weeks. The model never changed; the runtime around it did.
- Pin the CLI:
npm install -g @anthropic-ai/claude-code@2.1.117plus an~/.npmrclock so auto-upgrade can't silently move you to a broken version. - Lock
effortLevelin~/.claude/settings.json, allowlist models withavailableModels, and pin third-party deployments viamodelOverrides. - Run a Stop-hook canary against 3-5 fixture prompts after every session. Anthropic's evals missed it; yours don't have to.
What Anthropic's April 23 Postmortem Actually Said
On April 23, 2026, Anthropic published an engineering postmortem confirming what thousands of developers had been complaining about for weeks: Claude Code felt noticeably worse. The cause wasn't a model change. It was three separate harness-level bugs that compounded over seven weeks (March 4 to April 20).
The three changes Anthropic identified:
- Reasoning-effort downgrade. A default-effort change shipped March 4 and was reverted April 7. Affected Sonnet 4.6 and Opus 4.6.
- Thinking-cache bug. A prompt-caching optimization using the
clear_thinking_20251015API header withkeep:1shipped March 26 and was fixed April 10 in v2.1.101. Anthropic's own line: "Instead of clearing thinking history once, it cleared it on every turn for the rest of the session." - System-prompt verbosity cap. A "Length limits: keep text between tool calls to ≤25 words" instruction was added April 16 and reverted April 20 in v2.1.116. Affected Sonnet 4.6, Opus 4.6, and Opus 4.7.
The lived experience of the second bug is worth quoting directly. Per Anthropic's own writeup: "Claude would continue executing, but increasingly without memory of why it had chosen to do what it was doing." If you ever had a session where Claude felt confused after a long task in late March or early April, that's why.
Two things stand out for practitioners. First, Anthropic's internal evals missed all three. The postmortem cites stale-session-only reproduction, interference from a separate server-side message-queuing experiment, suppressed thinking displays masking the bug in CLI sessions, and the fact that the broader eval suite wasn't initially run on the system-prompt change. Second, Anthropic's recommended user-side mitigation was simply: "As of April 23, we're resetting usage limits for all subscribers." That's it. No flags. No configuration. No version pinning advice. The rest of this post is the gap.
The model itself was fine throughout. According to one of Anthropic's internal evals quoted in the postmortem, "one of these evaluations showed a 3% drop for both Opus 4.6 and 4.7, " which sounds small until you remember the wrapper around the model is what shipped to thousands of developers nightly.
The 24-Hour v2.1.119/v2.1.120 Incident
One day after the postmortem dropped, on April 24, Anthropic shipped v2.1.119 and v2.1.120 within 24 hours. Together those releases triggered eight community-filed regressions. The community-maintained v2.1.117 survival checklist on GitHub Gist documents them and recommends rolling back as the shortest path to a working session.
The eight bugs filed against the official repo:
| Issue | Symptom |
|---|---|
| #53044, #53041 | claude --resume crashes at startup (TypeError in session restoration) |
| #53031 | Silent routing of claude-opus-4-7 to the 1M-context variant - changes pricing and cache behaviour |
| #53038 | Resize-redraw UI duplication regression |
| #53028 | Auto-update stopped working silently |
| #53035 | /mcp menu freezes when --resume is used in WSL2 |
| #53040 | CLAUDE.md ignored even below 1/3 context |
| #53012 | sandbox.excludedCommands doesn't release network enforcement |
| #53015 | Worktree creation hangs on macOS 26.4 (git merge blocks stdin) |
This isn't the first time. In March 2026, v2.1.89 shipped a rate-limit regression that consumed quotas 3 to 50 times faster than usual. Teams without version pinning got blindsided overnight; the only fix was a rollback they hadn't prepared for.
The pattern is consistent. Ship dates aren't stability dates. The community has filed an open feature request, issue #22106, asking for first-party version-management tooling and auto-rollback alerts. Until that lands, the workflow is yours to build.
Pin the Claude Code CLI to a Known-Good Version
The first defense is the cheapest. Treat @anthropic-ai/claude-code like any other dependency: pin it to a version you've actually validated, and stop auto-upgrade from changing that on you between coffee and lunch.
As of late April 2026, the community-validated "known-good" version is v2.1.117, which is the build immediately before the v2.1.119/v2.1.120 fiasco.
# Roll back / pin to v2.1.117
npm uninstall -g @anthropic-ai/claude-code
npm install -g @anthropic-ai/claude-code@2.1.117
# Confirm what's actually loaded
claude --versionThat installs the version, but npm will happily upgrade you on the next global update unless you tell it not to. Add an explicit version pin to your user-level npm config:
@anthropic-ai/claude-code:version=2.1.117On WSL2 or other non-npm distributions, the CLI ships with its own installer. The equivalent rollback command per the community gist is:
claude install --force 2.1.117Once you're pinned, treat upgrades like any other dependency bump. Watch the releases page and the active issues list for at least 48 hours after a new minor version ships before adopting it. If your team lives in CI, drop the same pin into your onboarding script and your devcontainer image so every contributor lands on the same byte-identical CLI.
Two sanity checks I run after every install. First, claude --version confirms the binary. Second, inside Claude Code I run /status to confirm the active model and effort level, which is where the next defense lives.
Lock Effort Level in settings.json
The reasoning-effort downgrade in March was invisible to most users because most users never set effortLevel explicitly. They got the silently lowered default and felt their tasks getting dumber without knowing why. The fix is to stop trusting defaults.
Per Anthropic's model-config docs, effort levels are scoped per model:
- Opus 4.7:
low,medium,high,xhigh,max - Opus 4.6 and Sonnet 4.6:
low,medium,high,max - As of v2.1.117, the default effort is
xhighon Opus 4.7 andhighon Opus 4.6 and Sonnet 4.6. Pin it explicitly anyway.
The verified syntax is plain JSON in your user-level Claude Code config:
{
"model": "claude-opus-4-7",
"effortLevel": "xhigh"
}Precedence matters when something needs to override. Anthropic's docs spell it out: "The environment variable takes precedence over all other methods, then your configured level, then the model default." So a CI job can do this without touching the file:
# Force a one-off session at low effort (e.g., a fast lint job)
CLAUDE_CODE_EFFORT_LEVEL=low claude --print "review this diff"
# Or set it once for the shell
export CLAUDE_CODE_EFFORT_LEVEL=highFor teams, ship effortLevel in managed or policy settings so developers can't drift below the floor. If adaptive reasoning misbehaves after an update on Opus 4.6 or Sonnet 4.6, there's a fallback flag worth knowing:
# Revert Opus 4.6 / Sonnet 4.6 to the older fixed thinking budget
# (Does NOT apply to Opus 4.7, which always uses adaptive reasoning.)
export CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1
# Pair with an explicit max thinking token count
export MAX_THINKING_TOKENS=20000I keep effortLevel set in my user settings and refuse to rely on the picker's "auto" option. If the next regression is another silent default change, my workflow is unaffected. The same ~/.claude/settings.json I use for CLAUDE.md scoping and hooks holds this.
Allowlist Your Model Set With availableModels and modelOverrides
The aliases opus, sonnet, and haiku resolve to "the recommended model for your account type." The problem is that "recommended" can change without you noticing. Issue #53031 from the v2.1.119 incident was exactly this: Claude Code started silently routing claude-opus-4-7 to its 1M-context variant, which changed both the price and the cache hit rate without a config change on the user's end.
For direct-API users, the four-line fix:
{
"model": "claude-opus-4-7",
"availableModels": ["claude-opus-4-7", "claude-sonnet-4-6", "haiku"],
"effortLevel": "xhigh",
"env": {
"ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-opus-4-7",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-6"
}
}Three things working together. model sets the initial selection. availableModels stops the picker, --model flag, and ANTHROPIC_MODEL env var from accepting anything outside the list. The env block matters most: without it, a user who picks "Default" in the picker bypasses your model pin and gets whatever Anthropic's upstream default resolves to that day.
On AWS Bedrock, Google Vertex AI, or Azure Foundry, you need one more layer. Anthropic's docs explicitly warn: "Without pinning, Claude Code uses model aliases that resolve to the latest version. When Anthropic releases a new model that isn't yet enabled in a user's account, Bedrock and Vertex AI users see a notice and fall back to the previous version for that session, while Foundry users see errors." The April 2026 modelOverrides map is the answer:
{
"model": "claude-opus-4-7",
"availableModels": ["claude-opus-4-7", "claude-sonnet-4-6"],
"modelOverrides": {
"claude-opus-4-7": "arn:aws:bedrock:us-east-2:123456789012:application-inference-profile/opus-prod",
"claude-sonnet-4-6": "arn:aws:bedrock:us-east-2:123456789012:application-inference-profile/sonnet-prod"
}
}modelOverrides maps each Anthropic model ID to the provider-specific string Claude Code sends to Bedrock or Vertex. When a user picks "Opus 4.7" in the /model picker, Claude Code substitutes the ARN before calling out. New Anthropic releases don't change behaviour until you update this map.
This is the same allowlist-first philosophy I covered for tools in Hardening Claude Code GitHub Actions. Allowlists win because the failure mode of an unknown new entry is "blocked" rather than "automatically applied."
Add a Regression-Detecting Stop Hook
Anthropic's evals missed three regressions in seven weeks. Reading their status page won't catch the next one either. The right defense is your own canary: a small script that runs after every Claude Code session, replays a few golden-output prompts against fresh sessions, and pings you when the output drifts.
Wire it through the Stop hook in .claude/settings.json:
{
"hooks": {
"Stop": [
{
"hooks": [
{
"type": "command",
"command": "python .claude/hooks/check-regression.py"
}
]
}
]
}
}The hook script reads each fixture, runs an ephemeral Claude session against it, and diffs the output against the saved expectation. Drift triggers a webhook to Slack or Discord. Keep it dumb and short:
#!/usr/bin/env python3
"""Stop-hook canary: replay fixtures and alert on drift."""
import json
import os
import re
import subprocess
import sys
import urllib.request
from pathlib import Path
FIXTURES = Path(".claude/hooks/fixtures")
WEBHOOK = os.environ.get("REGRESSION_WEBHOOK_URL")
DRIFT_THRESHOLD = 0.30 # 30% token delta or missing required substring
def post_alert(message):
if not WEBHOOK:
print(f"[regression] {message}", file=sys.stderr)
return
data = json.dumps({"text": f":rotating_light: {message}"}).encode()
req = urllib.request.Request(WEBHOOK, data=data, headers={"Content-Type": "application/json"})
urllib.request.urlopen(req, timeout=5)
def run_fixture(prompt_file, expect_file):
prompt = prompt_file.read_text()
expected = expect_file.read_text().strip()
result = subprocess.run(
["claude", "--print", prompt],
capture_output=True, text=True, timeout=300,
)
output = result.stdout
if expected not in output:
post_alert(f"Fixture {prompt_file.name} drift: expected substring missing.")
return False
expected_tokens = len(re.findall(r"\w+", expected))
actual_tokens = len(re.findall(r"\w+", output))
if expected_tokens and abs(actual_tokens - expected_tokens) / expected_tokens > DRIFT_THRESHOLD:
post_alert(
f"Fixture {prompt_file.name} token delta {actual_tokens - expected_tokens} "
f"vs expected {expected_tokens} ({DRIFT_THRESHOLD * 100:.0f}% threshold)."
)
return False
return True
if __name__ == "__main__":
pairs = sorted(FIXTURES.glob("*.prompt.md"))
failures = []
for prompt_file in pairs:
expect_file = prompt_file.with_suffix("").with_suffix(".expect.md")
if not expect_file.exists():
continue
if not run_fixture(prompt_file, expect_file):
failures.append(prompt_file.name)
sys.exit(0 if not failures else 1)Two design notes. First, this runs after the user's real session, so it doesn't add latency to interactive work. Second, it uses claude --print for non-interactive replays, which is the right primitive for fixture testing. The full token accounting that powers the drift detector is the same JSONL stream I covered in Claude Code Cost Tracking; you can extend this script to read those logs for a more precise per-task budget signal.
The webhook target is yours to pick. I use a Slack channel called #claude-regressions with two people on watch. The point is the alert fires before you waste a workday wondering if your skills got worse.
Keep a Tiny Fixture Suite Per Repo
The hook is only as useful as the fixtures it runs. You don't need a full eval harness; three to five carefully chosen prompts cover the failure modes the postmortem revealed. Each fixture is two files: a prompt and an expected substring or short output snippet.
Here's the layout I use:
refactor.prompt.md
refactor.expect.md
test-gen.prompt.md
test-gen.expect.md
plan-mode.prompt.md
plan-mode.expect.md
long-horizon.prompt.md
long-horizon.expect.mdWhat each fixture catches:
- Refactor fixture. "Rename function
oldNametonewNameacross this 12-file module." Catches edit-tool regressions like the v2.1.119 ignored-CLAUDE.md bug. - Test-generation fixture. "Write tests for
utils/parse.ts." Expected output names a known-correct edge case. Catches reasoning-effort downgrades where Claude stops covering edge cases. - Plan-mode fixture. A multi-step planning prompt with a checkpoint phrase that should appear in the plan. Catches output-cap bugs like the "≤25 words" system-prompt change.
- Long-horizon fixture. A 1500-token CLAUDE.md plus a multi-step task. Tests directly for the postmortem's memory-loss symptom: Claude continuing to act without remembering why.
A real refactor fixture pair looks like this:
Rename the function `computeTotal` to `calculateTotal` across the
`src/billing/` directory. List every file changed in your final response.
Do not edit tests in `tests/snapshots/`.src/billing/cart.ts
src/billing/checkout.ts
src/billing/invoice.tsCommit these to the repo. They're self-documenting, run in under a minute total, and give you a 60-second binary check the next time someone says "Claude feels weird today." The whole apparatus is your own eval, independent of Anthropic's.
Re-record the expected outputs every time you intentionally bump the pinned version. That's your signal that the new version actually behaves the way you want, not just the way the changelog claims.
The Rollback Runbook (When Something Breaks Anyway)
Even pinned, you eventually upgrade. And eventually one of those upgrades misbehaves. Keep a five-step playbook ready so the morning you notice something off doesn't become a half-day of debugging.
- Confirm what's actually running. Run
claude --version. Inside Claude Code, run/statusand check the "effective model" field. v2.1.119's silent-1M-swap (#53031) only showed there. - Cross-check upstream signal. Open the releases page and the active issues list filtered by the version you're on. If someone else has filed your symptom in the last 24 hours, you're not imagining it.
- Roll back and lock.bashterminal
npm uninstall -g @anthropic-ai/claude-code npm install -g @anthropic-ai/claude-code@2.1.117 echo "@anthropic-ai/claude-code:version=2.1.117" >> ~/.npmrc - Run your fixture suite. The same canary script confirms the rollback restored expected behaviour. If a fixture still fails, the issue isn't the version - check your
settings.jsonnext. - File a regression issue. Open an issue against
anthropics/claude-codewith your fixture diff and reproducer. Subscribe to issue #22106 for the auto-rollback alert feature, so you'll get pinged when first-party tooling exists.
Total time from "something feels wrong" to "back to a known state": under five minutes if you have the playbook bookmarked. Without it: anywhere between hours and a day, depending on how many tasks you finish before noticing the wheels are off.
What This Approach Still Can't Fix
Honesty section. Pinning the CLI and locking effort fixes a class of problems, not all of them. Three you should plan for separately.
API-side regressions still bite. The thinking-cache bug from the postmortem was server-side: the wrong API header behaviour. Pinning your CLI gives you nothing against changes to the inference layer. If you need full control there, evaluate Claude Managed Agents or self-host through Bedrock with strict provisioned-throughput SLAs.
Subscription policy changes are out of scope. The April 23 postmortem ended with a usage-limit reset for all subscribers. That's a commercial change, not a wrapper change. No version pin will block the next one. Watch the Anthropic status page and your billing email.
Pinned versions go stale. Eventually v2.1.117 will lack a feature you actually want. The playbook isn't to pin forever; it's to schedule the upgrade on your terms, run the fixture suite against the new version, and re-record expected outputs once you confirm behaviour matches the changelog. That puts you in control of the timing.
Regression-proofing isn't permanence. It's reducing how often the wrapper around the model surprises you, and giving you a 60-second binary check when it does. That's good enough.
Frequently Asked Questions
Related Reading
The same allowlist-first philosophy applied to tools and triggers in CI. Tool allowlists, OIDC, script caps, and the assembled hardened workflow.
Read the guideThe JSONL log pipeline that powers per-task token tracking. Pair it with the regression hook in this post for a token-spike detector.
Read the guide