When AI builds itself, what's left for SRE/DevOps
Strange week to be working in infrastructure.
On June 4th, Anthropic published an essay called When AI builds itself saying, in plain English: “look, AI is already accelerating the development of AI itself, and it would be good for the world to have the option to pause this in a coordinated way.” Reuters covered it, The Economist asked whether AI will escape human control, and the BBC ran with it.
Five days later… that same Anthropic released Claude Fable 5, the public version of Mythos — their most capable model. The architecture is a curious thing: Fable 5 and Mythos 5 share the same underlying model, but Fable ships with classifiers that detect high-risk queries — offensive cybersecurity, biology/chemistry, model distillation — and, when they fire, hand the response off to Opus 4.8, the previous public model. Mythos 5, without the cyber guardrails, stays restricted to the ~200 vetted organizations of Project Glasswing.
“Folks, maybe we should pause. In the meantime, here’s the most powerful model we’ve ever shipped.” (With a seatbelt, airbag, and speed limiter, to be fair. But still.)
You can read that with cynicism (and plenty of people did — researchers like Mark Riedl called it a recursive self-improvement hype train). But regardless of intent, the numbers in the essay are what interest me most here, because they speak directly to our work.
The numbers that matter
Anthropic’s essay brings internal data I’d never seen published by a lab:
More than 80% of the code landing in Anthropic’s monorepo today is written by Claude. Each engineer is merging ~8x more code per day than in 2024. Claude Code’s success rate on open-ended problems (the ones without a clear spec — which is where we live) hit 76%, up 50 percentage points in six months. And METR measured that the length of tasks models complete on their own is doubling every ~4 months. In 2024 it was 4-minute tasks; today, 12+ hours.
There’s an example in the essay that got me: a routine upgrade started crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than cluster access. The model isolated an obscure debugging flag, reproduced the crash, and confirmed the fix in ~2 hours. A 2-3 day job.
If you’re an SRE, you know exactly what kind of incident that is. And you know how much it’s worth.
The problem: has technical knowledge become a commodity?
The question hanging in the air: if producing software gets this easy, what value is left in technical knowledge?
My short answer: execution knowledge is becoming a commodity, yes. Knowing HPA syntax, memorizing kubectl flags, writing the Terraform module — the price of that has cratered. Anthropic’s own essay admits that “the doing” (writing code, running the experiment) now costs almost nothing in human time.
And here it pays to be surgical with the words, because that sentence has a huge catch: cheap in human time is not cheap on the invoice. The cost of producing software didn’t disappear — it’s moving columns on the balance sheet: out of payroll and into the inference bill. Hold onto that detail, because it’s exactly the link between commoditization and the bubble conversation just below.
But there’s a detail hidden in the text that’s the key to everything: they cite Amdahl’s law — the total speedup of a system is limited by the fraction you didn’t speed up. Accelerating one part of the process just moves the bottleneck somewhere else. The first bottleneck to show up was human code review. And the answer from anyone using AI seriously — including Anthropic itself, which today runs every change in the codebase through an automated reviewer powered by Claude — was predictable: put AI to review the AI. Does it work? Partly (their reviewer would have caught ~1/3 of the bugs behind historical claude.ai incidents). But the bottleneck doesn’t vanish. It drops one more level, to a far more uncomfortable layer: who verifies the verifier — and who still understands what’s being shipped?
Man. That is literally the job description of an SRE: being the reliability function of a system that produces change faster than humans can keep up with.
”But what about the bubble?”
Fair. There’s a lot of talk that the bubble pops, AI’s economics never pencil out, and we go back to the human-dev-coding norm. And the skeptical data exists, it’s good, and I recommend reading it:
The METR RCT study (Becker et al., 2025) became the paper most cited by skeptics: experienced devs working on open source projects they knew well were 19% slower using AI in early 2025 — while thinking they were 20% faster. That perception gap is real and we should take it seriously.
Except there’s a plot twist: in the February 2026 follow-up, METR couldn’t run the experiment properly anymore. Why? Because 30-50% of the invited devs refused to work without AI, even when paid by the hour. The control group evaporated. The cohort that stayed showed the slowdown all but gone, and METR now says AI “probably yields a productivity gain in 2026.” The tool stopped being optional so fast it broke the study’s methodology. That, on its own, is a data point.
Now, the cost. Here the story has two curves that seem to contradict each other — and both are true at the same time.
Curve 1: the unit price of intelligence is plummeting. Epoch AI’s analysis shows drops of 9x to 900x per year per capability tier, and there’s a recent paper attributing most of that to algorithmic progress — not VC subsidy burning cash to buy market share.
Curve 2: the total bill is exploding. Remember the cost that migrated from payroll to the inference bill? It arrived, with interest. Uber rolled out Claude Code to ~5,000 engineers in December 2025, drove adoption with an internal AI-usage leaderboard (yes, they gamified token consumption), watched cost per engineer hit $500–2,000/month, and burned through its entire 2026 AI budget by mid-April. Four months. The response came in May: a $1,500/month cap per agentic tool and a per-employee consumption dashboard. Microsoft went more drastic: it revoked Claude Code licenses for an entire division (~5,000 engineers, the Windows and M365 folks) six months after enabling them, for the same reason. And the big picture, per TechCrunch: the per-token price dropped ~98% since 2022, but agentic workloads consume on the order of 30x more tokens per task than a simple flow, and enterprise bills shot up. It got so serious that the Linux Foundation announced last week the Tokenomics Foundation — a standards body to do for token spend what FinOps did for the cloud bill. (And Fable 5, to top it off, arrived at $10 per million input tokens and $50 per million output — double Opus 4.8. The frontier offers no discounts.)
So that’s it, the bubble pops and we go back to coding by hand? My read: be careful not to confuse a cost-governance crisis with technological unviability. Look at how the companies reacted: Uber’s COO complained publicly that he can’t tie token consumption to useful features — but Uber’s response wasn’t to turn the AI off. It was caps, dashboards, per-engineer accountability. Microsoft didn’t abandon AI; it swapped tools to control cost. It’s exactly the cycle our generation watched in the cloud: bill explodes → FinOps is born → optimize → consumption keeps growing, only now it’s governed. Nobody went back to their own datacenter because the AWS bill came in high; an entire discipline was born to tame the bill. The financial bubble (valuations, datacenter capex) may very well pop and take companies down with it. The falling unit cost of intelligence, no — that’s a technical trend, not a financial one. Going back to a world where only humans code would require reversing both curves at once, and neither of them is pointing backward.
The lock-in nobody signed
And here’s the concern I find the most underrated of all — and it’s not the Skynet from The Economist’s cover.
Faros analyzed two years of telemetry from 22,000 devs across more than 4,000 teams (real pipeline data, not a perception survey) and named the pattern: Acceleration Whiplash. Epics delivered per dev rose 66% and tasks 34% — there’s the jump everyone feels. Except, in the same dataset: median time-in-review for a PR rose 441%, incidents per PR rose ~243%, and 31% more PRs are landing with no review at all — not by policy, but because nobody can keep up with the volume. The entire delivery system was designed for code at human pace, and it’s being flooded by code at machine pace.
Addy Osmani named the bill that comes due later: comprehension debt — the growing gap between the code a team shipped and the code the team actually understands. Unlike technical debt, it’s invisible: the codebase looks healthy while comprehension rots underneath, until the day something breaks and nobody can narrate the logic. There’s already a paper on arXiv documenting the terminal stage of this in small teams: the system works, but the team can no longer maintain it without AI — to debug their own code, they go back to the AI that wrote it.
That’s the real lock-in of the era. It’s not vendor lock-in; it’s capability lock-in. The repository grows at a velocity and volume that has made manual maintenance mathematically unviable: there’s no “go back to coding by hand” button when 8x more change lands per day and a third of it nobody read. Remember METR’s 30-50% of devs who refused to work without AI, even when paid by the hour? The door has already closed behind us — and unlike cloud lock-in, this one has no migration plan.
For anyone in ops, the smell is familiar: it’s the 3 a.m. page where the stack trace is on the screen, the Grafana panel is glowing red, and nobody on the call can explain what that module is supposed to do. The new part is that this stopped being a legacy-system exception and became the natural state of brand-new codebases, generated last week.
The hope: what actually pays off for SRE/DevOps
In September 2025, DORA (~5,000 professionals) coined the thesis that organized this debate: AI is an amplifier. It doesn’t fix a bad team — it magnifies the strengths of good organizations and the dysfunctions of bad ones — and the biggest return doesn’t come from the tool, it comes from the quality of the internal platforms, the workflows, and the organizational system underneath it.
To be honest: that report was born old. The data was collected in June/July 2025, when 61% of respondents had never even touched an agentic workflow — and it was precisely after that the jump we’re living through arrived. But here’s the thing: the 2026 telemetry didn’t refute the amplifier thesis. It showed the amplifier at max volume. In the same Faros dataset that documents the review chaos, the industry’s median cycle time dropped from 11 days (2020) to under 7 (2026) — and the biggest driver of that drop was AI-assisted review and mature async practices. Translation: the teams that treated verification as infrastructure are pulling away in front, while the others drown in their own throughput. The distance between the two groups has never been bigger, and it’s made of platform, not model.
Internal platform. Verification infrastructure. System. Who builds that?
Connecting the dots, the knowledge that gains value from here on in our field, as I read it:
Judgment and direction. Anthropic’s own essay admits the remaining human comparative advantage is “research taste”: choosing which problem matters, which result to trust, when a path is a dead end. In ops that translates to: which incident matters, which alert is noise, which migration is worth the risk. The model runs the playbook; someone has to decide whether the playbook makes sense.
Verification as a property, not a step. “Put AI to review it” was the easy answer — and a necessary one, the numbers show. But review, human or model, is a step; reliability is a property of the pipeline. Whoever materializes what “correct” means in an executable way — policy as code, contract tests, progressive delivery with automatic rollback, observability that proves behavior in production — builds the only thing that scales alongside code generation. SRE was always about trusting the system, not the person. Now neither the one writing nor the one reviewing is a person anymore, and the question “who verifies the verifier” has only one sane answer: the pipeline.
Operating the agents — and their bill. LLM AIOps research has exploded — there’s a survey in ACM Computing Surveys and a curated list of dozens of papers on RCA, incident triage, and agent-driven remediation. These agents are going to run somewhere, with identity, RBAC, audit trail, GPU scheduling, and SLOs. And after the semester Uber and Microsoft just had, it became obvious that governing their cost is a first-class problem: the Tokenomics Foundation existing is the market admitting it needs a FinOps for the agent era — caps, budgets, spend attribution, model-routing optimization, prompt caching. Anyone who’s already tamed a cloud bill recognizes this movie from the first frame, and knows it ends with a new discipline under the platform umbrella. This is our problem. In the best sense: it’s our opportunity.
Domain context and the mental model of the system. The model knows Kubernetes better than I do. It doesn’t know that the cluster in region X has had a janky peering link since 2023 and that team Y deploys on Fridays because of the India time zone. And in a world drowning in comprehension debt, whoever still carries the mental model of the system — and materializes it in docs, ADRs, and runbooks that both humans and agents consume — becomes the scarcest asset on the team. Commodity knowledge is the generic; the specifics of your system are your leverage, and they’re what turns a dumb agent into a useful one.
The roadmap: what to study to cover those four edges
Talking about abstract skill is easy; the hard part is knowing what to open on Monday morning. So here’s the map I find interesting — one macro axis per edge, and the concrete micro topics inside each. It’s not meant to be studied end to end: it’s to help you locate yourself and choose where your biggest gap is.
Edge 1 — Judgment and direction
Macro: systems thinking and decision-making under uncertainty. It’s the skill that automates the least and shows up the least in job descriptions — but it’s the one that separates senior from staff.
- Decision as artifact: ADRs in the MADR format — which even has an academic paper behind it (Kopp, Armbruster & Zimmermann, ZEUS 2018) — and internal RFCs: recording the why on top of the what. (If you’ve never written a real ADR, start there: it’s the cheapest judgment exercise there is.)
- SLOs and error budgets as a negotiation language: the Google SRE books and Implementing Service Level Objectives (Alex Hidalgo, O’Reilly) — an SLO isn’t an alert, it’s a decision tool about risk.
- Incident command and blameless postmortems: PagerDuty’s public incident response guide is the best free starting point. Coordinating the response matters more than typing the fix — even more when the one typing is an agent.
- Spec-driven development: writing specs an agent executes faithfully — GitHub’s Spec Kit formalized the practice into an open source workflow. Well-written intent became an engineering skill, not a PM one.
Edge 2 — Verification as a property of the pipeline
Macro: turning “correct” into executable code. The grounding here is plentiful: Veracode’s GenAI Code Security Report tested 100+ LLMs on 80 tasks and found an OWASP Top 10 vulnerability in 45% of the generated code — a rate that didn’t improve between 2025 and 2026, even with bigger models — and the academic literature has documented the problem since Pearce et al., “Asleep at the Keyboard?” (arXiv:2108.09293, IEEE S&P 2022). Add that to Faros’s third of PRs landing unread and the conclusion is just one: the pipeline is the last line of defense left.
- Policy as code: OPA/Gatekeeper and Kyverno — admission, compliance, and security policies that run on every deploy, without depending on a human remembering.
- Progressive delivery: Argo Rollouts or Flagger — canary and blue/green with automatic metric analysis and ceremony-free rollback. It’s what makes merging volume safe.
- GitOps as audit trail: ArgoCD or Flux — every desired state versioned, every drift detected. When 70% of commits come from a machine, the Git history is forensic evidence.
- Supply chain security: SLSA, Sigstore/cosign, SBOM with Syft/Grype — artifact provenance matters double when you don’t know who (or what) wrote the code.
- Tests that prove contracts: contract testing (Pact), mutation testing, and chaos engineering — formalized by Netflix in a paper (Basiri et al., Chaos Engineering, IEEE Software 2016, arXiv:1702.05843) and accessible today via Chaos Mesh or LitmusChaos — to validate resilience hypotheses instead of hoping for them.
- Observability as proof: end-to-end OpenTelemetry and SLO burn-rate alerts — production behavior is the final test no reviewer (human or model) replaces.
Edge 3 — Operating the agents and their bill
Macro: an agent is a workload like any other — with identity, SLOs, an attack surface, and a budget. LLMOps is the new part; the rest is classic platform applied to a noisy neighbor.
- Inference serving: vLLM — whose PagedAttention paper (Kwon et al., SOSP 2023, arXiv:2309.06180) is required reading to understand why inference GPUs waste memory — and alternatives like SGLang and TGI. Quantization, continuous batching, KV cache: the basics of making a GPU pay off.
- GPU on Kubernetes: NVIDIA GPU Operator, MIG/time-slicing, autoscaling with Karpenter/KEDA — capacity planning is a hard skill again.
- Model gateway and routing: LiteLLM, Envoy AI Gateway, or Kong AI Gateway — fallback across providers, prompt caching, per-team rate limiting. It’s the load balancer of the token era.
- LLM observability: OpenTelemetry’s GenAI semantic conventions as a neutral baseline, and tools like Langfuse, Arize Phoenix, or OpenLLMetry — plus continuous evals. Agent tracing is the new distributed tracing.
- Protocols and orchestration: MCP for tools, A2A for agent-to-agent; frameworks like LangGraph at the application level and K8s-native projects like kagent (CNCF Sandbox) for lifecycle via CRD and GitOps.
- Agent security: execution sandboxing (gVisor, Firecracker, Kata), workload identity with SPIFFE/SPIRE, least privilege on tool calls, and the OWASP Top 10 for LLM apps. Prompt injection is the new SQL injection — and the indirect kind is worse: the paper that formalized the attack (Greshake et al., arXiv:2302.12173) should be onboarding reading for any platform team.
- Token FinOps: the FinOps Foundation framework applied to inference — spend attribution per team, caps, showback/chargeback — and keep an eye on the just-announced Tokenomics Foundation, which wants to standardize exactly that.
Edge 4 — Domain context and mental model
Macro: institutional knowledge as infrastructure — readable by humans and agents. The CNCF Platforms whitepaper is the best neutral conceptual grounding for what “platform as a product” means. It’s the cheapest axis to start and the most underrated.
- IDP and service catalog: Backstage with TechDocs, or managed alternatives like Port and Cortex — the catalog is the map that gives context to both the new dev and the agent.
- Docs-as-code and executable runbooks: a runbook that becomes automation, and context files for the agent-team — the open AGENTS.md standard (and equivalents like
CLAUDE.md) became the knowledge interface between your repository and the models. - RAG over internal knowledge: start with the original paper (Lewis et al., 2020, arXiv:2005.11401), pick a vector store (Qdrant, pgvector, Weaviate, Milvus — doesn’t matter for learning), and invest in the part that separates toy from tool: retrieval evaluation, with frameworks like RAGAS. An agent without good context is a confident intern.
- Postmortems and ADRs as a dataset: write with the assumption that this will be consumed by semantic search two years from now. Today’s incident is tomorrow’s agent context.
If I had to prioritize: start with edge 2 — it’s where the Faros data shows the fire right now, and it’s what transfers most from what you already know. 3 is where the growth is (and the jobs). 4 is the silent multiplier you can advance in parallel, one ADR a week. And 1 isn’t studied in a sprint — it’s practiced inside all the others.
One last data point to close, because it sums up the future for me. Project Glasswing — the coalition bringing together Anthropic, AWS, Apple, Google, Microsoft, NVIDIA, the Linux Foundation itself, and others to use Mythos defensively — found more than ten thousand high- or critical-severity vulnerabilities in a few weeks, and just expanded to another 150 organizations across 15+ countries, including energy, water, and healthcare sectors. Finding a vulnerability, officially, became a commodity.
And the most telling detail came from a critic of the project: Bruce Schneier, dissecting the status report, pointed out that almost none of those vulnerabilities have been fixed so far. Ten thousand flaws found — and patching isn’t keeping up. You can disagree with Schneier about how much of this is Anthropic PR; but his observation independently confirms the thesis: the bottleneck of global security stopped being finding and became fixing fast enough — in production, without taking anything down, at scale.
That, my friends, is ops. It always was.
AI isn’t taking reliability out of fashion. It’s manufacturing change at a velocity that makes reliability the scarcest resource in the system. And scarcity, as anyone who’s ever negotiated a salary knows, is where the value lives.
Sources and further reading
News of the week
- Anthropic Institute — When AI builds itself (Jun 2026)
- Reuters — Anthropic says AI labs need coordinated plan to halt development if risks rise
- The Economist — Will artificial intelligence soon escape human control?
- TechCrunch — Anthropic’s Claude Fable 5 is a version of Mythos the public can access today
- NBC News — coverage of the Fable 5 launch
- CNBC — Fable 5 and Mythos 5 announcement and Project Glasswing’s expansion to 150 organizations
- ECO (in Portuguese) — Anthropic launches Claude Fable 5, the public version of the Mythos model with new safeguards
- crypto.news — Fable 5’s safety classifiers
- CRN — 5 things to know on Anthropic’s Claude Mythos and Project Glasswing
- BBC — coverage
Project Glasswing in depth
- Anthropic — the official Project Glasswing page, the initial update with the numbers, and the technical red team deep dive
- Cloudflare — Project Glasswing: what Mythos showed us — a hands-on account from a partner using the model in real security work (required reading if you’re in infra)
- Schneier on Security — Anthropic’s Project Glasswing Update — the skeptical counterpoint
The bill coming due (AI’s operational cost)
- Fortune (May 2026) — Uber burned through its entire 2026 AI budget in four months
- Fortune (May 2026) — Microsoft reports are exposing AI’s real cost problem
- TechCrunch (Jun 2026) — The token bill comes due: inside the industry scramble to manage AI’s runaway costs (includes the Tokenomics Foundation announcement, from the Linux Foundation)
Papers and studies
- Becker, Rush, Barnes & Rein (METR, 2025) — Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity — arXiv:2507.09089
- METR (Feb 2026) — We are Changing our Developer Productivity Experiment Design (the follow-up to the study above)
- Kwa et al. (METR, 2025) — Measuring AI Ability to Complete Long Tasks — arXiv:2503.14499 and the tracker metr.org/time-horizons
- DORA / Google Cloud (2025) — State of AI-assisted Software Development and the follow-up The ROI of AI-assisted Software Development (2026)
- Faros AI (Mar 2026) — AI Engineering Report 2026: The Acceleration Whiplash — telemetry from 22,000 devs / 4,000 teams
- Addy Osmani (Mar 2026) — Comprehension Debt: the hidden cost of AI-generated code (also on O’Reilly Radar)
- Beyond Technical Debt: How AI-Assisted Development Creates Comprehension Debt in Resource-Constrained Indie Teams — arXiv:2512.08942
- Zhang et al. (2025) — A Survey of AIOps in the Era of Large Language Models — ACM Computing Surveys — doi:10.1145/3746635
- Cottier et al. (Epoch AI, 2025) — LLM inference prices have fallen rapidly but unequally across tasks
- Algorithmic Efficiency and the Falling Cost of AI Inference (2025/2026) — arXiv:2511.23455
- Siegel et al. (2024) — CORE-Bench (research reproducibility by agents) — arXiv:2409.11363
- Curated LLM+AIOps paper list — awesome-LLM-AIOps
Papers and reports cited in the roadmap
- Veracode (2025, updated 2026) — GenAI Code Security Report
- Pearce et al. (IEEE S&P 2022) — Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions — arXiv:2108.09293
- Greshake et al. (2023) — Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection — arXiv:2302.12173
- Kwon et al. (SOSP 2023) — Efficient Memory Management for Large Language Model Serving with PagedAttention (the vLLM paper) — arXiv:2309.06180
- Lewis et al. (NeurIPS 2020) — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — arXiv:2005.11401
- Es et al. (2023) — RAGAS: Automated Evaluation of Retrieval Augmented Generation — arXiv:2309.15217
- Basiri et al. (IEEE Software 2016) — Chaos Engineering — arXiv:1702.05843
- Kopp, Armbruster & Zimmermann (ZEUS 2018) — Markdown Architectural Decision Records: Format and Tool Support
- CNCF TAG App Delivery — Platforms White Paper