#8 — DevOps Jobs Agentic AI Can Actually Do - B.O.R.I.S

After seven foundation-laying episodes, the hosts of Agentic AI in DevOps take the practitioner's tour: which DevOps jobs agentic AI actually does well, and which still fight back. Devyatkin reframes the "AI deleted my production database" headlines, arguing they are functionally identical to "my terminal deleted my database" — the human gave the credentials and confirmed the action — and walks through why infrastructure-as-code is harder for agents than application code (one word: state). The hosts dig into the gap between C-suite adoption claims and practitioner reality, with Gonçalves noting that "use AI" is now a manager KPI, and they land on documentation, runbooks, and postmortems as the place where agents quietly earn their keep right now.

Spotify Apple Podcasts RSS

Summary

After seven foundation-laying episodes, the hosts of Agentic AI in DevOps take the practitioner’s tour: which DevOps jobs agentic AI actually does well, and which still fight back. Devyatkin reframes the “AI deleted my production database” headlines, arguing they are functionally identical to “my terminal deleted my database” — the human gave the credentials and confirmed the action — and walks through why infrastructure-as-code is harder for agents than application code (one word: state). The hosts dig into the gap between C-suite adoption claims and practitioner reality, with Gonçalves noting that “use AI” is now a manager KPI, and they land on documentation, runbooks, and postmortems as the place where agents quietly earn their keep right now.

Key Topics

“AI deleted my production database” — blame the human

The episode opens on a recurring headline pattern: someone gives an agent live database credentials, lets it run any tool it asks for, and then blames the agent when production is gone. Devyatkin’s framing is that this is no different from a human deleting GitLab’s production database years ago, with no AI involved. The agent asked for confirmation; the operator gave it. As Devyatkin puts it, blaming the agent is like saying “my terminal deleted my database” — you take the responsibility, don’t blame the tool.

Samoylov makes the structural point underneath the headlines: in most of these incidents, dev and prod live in one basket, so the blast radius was always going to be production — “it’s just a matter of time.” Gonçalves shares a story from a practitioner who watched an agent actively try to route around a hook designed to block destructive rm commands. His own working pattern is to keep the agent out of the live path entirely: it generates Terraform, he reviews and applies. He compares letting an agent execute against production credentials to handing those credentials to a brand-new hire who has not yet built the judgment to know when to stop.

This connects directly to the hooks and confirmation patterns covered in episode #5 — the safeguards exist; the question is whether teams configure and respect them.

Code generation works; infrastructure-as-code is harder

Code generation is the use case that has clearly landed — Devyatkin notes it is why Cursor, Anthropic, and others are seeing strong revenue. The reason it works is that application code is largely stateless: an agent can read it, run it, and reason about it end to end on a laptop.

Infrastructure-as-code is a different shape. Terraform has state, and agents are often state-unaware: they generate plans that conflict on apply, miss moved blocks, or propose destroys where a refactor was intended. Devyatkin’s recommendation is to wrap that gap with skills — either your own, or established community skills like Anton Babenko’s Terraform skill for AI agents — so the agent gets the procedural context it needs before it starts generating HCL. Gonçalves adds the pattern people often miss: skipping plan mode and asking for something complex with no context, then being surprised when the output is rough.

A second, often-overlooked benefit of code-aware agents: they make unfamiliar repositories navigable. A platform engineer or SRE who is not fluent in the application’s language can drop into the repo with an agent and get a guided tour of how the pieces fit together — useful when the failure is in code you do not own.

The C-suite vs. practitioner expectation gap

The hosts return to a recurring observation from their own client work and from public reporting: leadership tends to describe AI adoption and impact in markedly more optimistic terms than the engineers actually using the tools. Devyatkin attributes part of this to a hype-driven expectation of magic: when the tool needs hand-holding, practitioners feel let down even when the underlying productivity gain is real.

Samoylov offers a more interesting read: managers know how to delegate. They expect failure, write detailed instructions up front, and budget for iteration. Practitioners often dump a small slice of context, iterate reactively, and never invest in setup — so they get worse results from the same tools.

Gonçalves adds a structural detail: AI usage is now a manager KPI in many engineering orgs. “Did your team adopt Cursor or Claude Code?” is being measured. That answers the survey question without answering the productivity question. He also flags a training gap — leadership assumes technical people will figure technical tools out on their own, hands them a license, and skips the part where someone explains how to actually use the thing well.

Skills, runbooks, and living documentation

This is where the hosts agree the wins are clearest right now. Runbooks have always been a halfway artifact: too judgment-heavy to wrap in a bash script, so they are written for humans. An agent that can apply some judgment is a runbook executor — the document becomes the automation.

Devyatkin describes a self-evolving pattern: at the end of a session that ran a skill, ask the agent to reflect on gaps and propose updates. Do that continuously and the skill improves itself in place. For greenfield skill creation, the “grill me” pattern — instruct the agent to interview you about a procedure, then write the skill from your answers — does most of the structural heavy lifting. The technique requires no extra tooling, just willingness to change how you work.

Samoylov frames the regulated-industry angle: keeping runbooks current is a compliance requirement, and this is exactly the kind of routine maintenance work that benefits from agents that already see infrastructure changes as they happen — close the gap between current state and documented state on a schedule.

Gonçalves connects this back to senior engineers: grill the top performers, capture what they actually do, hand the resulting skills to the rest of the team. Devyatkin cites a recent adoption study with the same recommendation — replicate the magic of your top performers through documented skills they author. The hosts note this raises an uncomfortable feeling — “you ask a person to build a replacement of himself,” as Samoylov puts it — and Gonçalves references news of Meta recording employee desktop interactions to train models, putting the same dynamic in a starker frame.

Postmortems

Adjacent to runbooks but worth its own callout: postmortems work surprisingly well even with naive prompting. Dump the Slack thread, a couple of screenshots, some logs from the relevant systems, and the agent can construct a timeline and draft. The judgment — what to improve, what the takeaway is — still belongs to a human, but the heavy lifting of correlation and narrative is exactly the part engineers most often skip. Devyatkin’s framing: agents remove the excuse for not writing them.

CI/CD pipelines

Pipelines are a sweet spot. The languages are highly declarative YAML with very little imperative state, which is easy for agents to reason about. Generation works well; troubleshooting works well, especially with gh CLI in the agent’s tool layer.

Gonçalves describes the multiplier effect: in projects with twenty repositories of varying pipeline quality, you can codify a golden standard and have the agent adapt it to PHP, Node.js, or whatever each repo uses — keeping deploy stages and method consistent while replacing the language-specific steps. Devyatkin extends the pattern with Cursor automations: package the golden-standard skill, run it on a schedule across your repository zoo, open pull requests where repos drift. Drift detection becomes a cron job that produces PRs.

Security checks: deterministic plus non-deterministic

Samoylov makes a precise point about layered analysis. Deterministic security tools (linters, scanners) all have allow-lists — rules someone disabled “just to ship” a year ago that have quietly become permanent. The agent does not care about that history. It can read the allow-list, understand why each entry was added, check whether the workaround is still necessary, and propose the actual fix when one now exists. The agents do not replace the deterministic tools; they bring a second axis of review that catches what the allow-list buried.

Security triage and SOC automation

Devyatkin describes a pattern he has seen in practice: a single engineer running what amounts to a 24/7 SOC pipeline by wiring agents into incoming security signals. The agents enrich each issue with context from multiple sources, decide whether it is worth a human’s attention, and only escalate the survivors. Gonçalves frames why this matters: a huge fraction of security alert work is investigation that ends in “this was a waste of time.” Pushing that filtering layer onto agents — with feedback (“this one was not useful, here’s why”) so the filter improves — is high-leverage and low-risk: the worst case is that a human still has to look.

Code review

Devyatkin places AI-enabled code review in the “jury still out” bucket. Done well, agents can flag security issues and code duplication. But quality depends on context, and quality depends on model — anything below Sonnet on high thinking mode tends to produce noise, and that is expensive to run. On a high-velocity team, the bill scales with PR volume in a way that can surprise. The hosts do not dismiss the use case; they flag it as one where the cost-benefit math needs explicit attention before you turn it on for the whole org.

Incident investigation and on-call

Log analysis is where agents shine: pattern matching across thousands of lines is exactly the work humans get fatigued doing. Samoylov’s observation is that engineers who have stared at the same logs for months read them intuitively, but a new hire — or any human asked to triage unfamiliar logs at 3 a.m. — has no such advantage. Agents do not get tired and do not have to onboard.

Devyatkin’s caveat: applying agents to autonomous on-call pre-triage is plausible, but expensive in token terms. “Humans have alert fatigue. AI doesn’t. But your wallet does.” For high-volume alert streams, this is another use case where the production math matters more than the demo.

Resources

The GenAI Divide: State of AI in Business 2025 — MIT NANDA — MIT Project NANDA’s July 2025 report, source of the widely cited finding that ~95% of enterprise GenAI pilots show no measurable P&L impact while ~5% extract significant value.
How a GitLab developer accidentally deleted the production database — The 2017 GitLab incident Devyatkin references — a human, no AI involved, classic blast-radius lesson.
Anton Babenko’s Terraform skill for AI agents — Community Terraform/OpenTofu best-practices skill compatible with Claude Code and other agent platforms; the kind of procedural context the hosts recommend wrapping around state-aware IaC work.
Claude Code Skills documentation — Official guidance for authoring and using skills, the mechanism the hosts recommend for capturing runbooks, golden-standard pipelines, and senior-engineer judgment.
GitHub CLI (gh) — The tool that makes pipeline troubleshooting and PR/run interactions tractable for agents working on GitHub Actions.
Episode #3 — Skills, Powers, SOPs — The foundations episode on skills referenced throughout, including supply-chain considerations for third-party skills.
Episode #5 — Stop Your Agent Before It Breaks Prod — Hooks and the confirmation patterns that prevent the “AI deleted prod” failure mode the episode opens on.
Episode #6 — The Big AI Squeeze — Pricing and economics context for the cost-sensitivity flags around code review and on-call automation.

Join B.O.R.I.S Slack Playground

#8 — DevOps Jobs Agentic AI Can Actually Do