#4 — Harness Engineering: What Claude Code Accidentally Taught Everyone - B.O.R.I.S

A packaging mistake exposed Claude Code's full source tree to the world — and instead of scandal, the community got a masterclass in how agentic coding tools actually work under the hood. In this episode, Andrey, Vladimir, and Fernando unpack what the disclosure revealed about the engineering behind coding agents, introduce the emerging discipline of "harness engineering," and argue that the model is only the horsepower — it is the harness that determines whether the agent gallops toward the right destination or off a cliff. Along the way, they weigh in on NVIDIA's OpenShell sandboxing, Cursor 3.0's agent-first interface, and why Docker containers are not the security blanket many DevOps engineers assume them to be.

Spotify Apple Podcasts RSS

Key Topics

NVIDIA OpenShell and the Agent Containment Problem

Vladimir introduces OpenShell, NVIDIA’s open-source sandboxing runtime related to the NemoClaw stack. OpenShell operates across four layers — network, filesystem, process, and inference — all configurable through a single YAML policy schema. The hosts frame this in the context of a growing concern: agents running locally on developer machines inherit full user permissions, making prompt injection, malicious skills, and supply chain attacks real threats.

The discussion turns to Docker containers, a tool many teams reflexively reach for when thinking about isolation. Vladimir pushes back hard: containers were built for packaging and isolation, not security. Container escapes are not trivial, but they are routine — especially when running as root with a full operating system and toolchain inside. Fernando adds that agents need file and network access to be useful, creating an inherent tension between containment and capability. Andrey raises an uncomfortable point: if agentic coding tools are already good enough to find and configure Sonos speakers on a local network, how capable would they be at escaping a container if asked?

The consensus: containers are better than nothing, but security is about layers. OpenShell represents a more purpose-built approach to constraining what agents can access and where they can reach.

Cursor 3.0 and the Convergence of Agentic Interfaces

The hosts discuss Cursor 3.0’s redesigned interface, which moves away from the traditional IDE file-tree layout toward a fully agentic view: workspaces and parallel chats on the left, the active conversation in the center, and a built-in terminal and browser on the right. Fernando compares the polish to “driving a BMW in a coding world” and notes the layout resembles both Codex and tmux — though considerably more refined than the latter.

Vladimir highlights a practical improvement: the previous Cursor version crammed agent panes into tiny side windows, which made parallel agent work impractical. The new layout better supports running five or more agents simultaneously, though he notes there are even faster development environments worth exploring in future episodes. Fernando traces the trend back to Google Antigravity, which popularized the idea that code output could be managed rather than manually written — a paradigm the entire industry is now converging toward.

The broader observation: Cursor, Codex, Antigravity, and CLI tools like OpenCode are all arriving at remarkably similar shapes, echoing the container orchestrator wars of a decade ago — but moving at far greater speed.

The Claude Code Source Code Disclosure

Andrey carefully frames what happened: a packaging mistake in the npm release of Claude Code included source maps that exposed the full unobfuscated TypeScript source. This was not a malicious leak but an accidental disclosure — repositories sprung up with copies, and Anthropic began issuing takedown notices. Andrey mentions having seen numbers suggesting a couple of million dollars spent in tokens to build the tool, though he explicitly cautions to “take them with a grain of salt” and not quote him on that figure.

The hosts argue the disclosure’s biggest beneficiary is the broader ecosystem. Open-source and commercial harness builders — including their own B.O.R.I.S — got a window into how Anthropic engineers the tool layer around their models. Community reverse-engineering of the disclosed source code surfaced a number of notable features and internal mechanisms, none of which have been officially confirmed by Anthropic:

Kairos: Described in community analysis as a proactive background loop, similar in concept to NemoClaw, suggesting autonomous monitoring capabilities.
Outer Dream: Interpreted by community analysts as a memory consolidation system inspired by how humans process information during sleep.
Memory layers: Community examination revealed what appears to be the full architecture of how Claude Code manages persistent context (visible in the .claude directory on disk).
Anti-distillation decoy tools: Identified in the source as apparent countermeasures against competitors using Claude to train competing models.
Numerous slash commands and hidden features: Community analysis also surfaced a large number of commands and capabilities — including a security review command, ultraplan (which reportedly offloads planning to Anthropic’s infrastructure), and many others that most users never discover. Fernando references a figure of roughly eighty-five commands from the community analysis, though the exact count could not be independently verified against Anthropic’s public documentation.

On the telemetry controversy, Andrey notes the code confirms Claude Code uses Sentry and StatsIG — both publicly listed Anthropic partners — for error tracking and feature usage analytics. Nothing points to unauthorized data collection, though he speculates that stack traces in Sentry could inadvertently capture fragments of user information despite best intentions.

Harness Engineering: The Main Event

Fernando sets up the core metaphor: the model is the horse — raw intelligence and power — while the harness is everything that steers it. Claude Code, Cursor, Codex, Kiro — they all wrap the same foundation models with different harnesses, and the harness is what determines the quality of output.

Andrey introduces the formal framework from a recent guest article on the Martin Fowler blog titled “Harness Engineering.” The article distinguishes two layers:

Outer harness (user harness): Feed-forward and feedback controls configured by the user — skills, prompts, hooks, CLAUDE.md files, and custom rules. As explored in previous episodes on skills (episode #3) and context management (episode #2), this is where teams encode their standards and workflows.
Inner harness (provider harness): The system prompts, built-in tools, slash commands, and self-correction mechanisms built by the tool provider.

A well-built outer harness serves two goals: increasing the probability the agent gets it right on the first attempt, and providing feedback loops that self-correct before issues reach human review.

The framework further breaks down into two control types:

Feed-forward guides: Instructions provided before generation — skills, how-tos, rules, language server access, and RAG-style memory retrieval.
Sensors (feedback): Checks applied after generation — static analysis, unit tests, log inspection, browser verification, and review agents.

Each control type has two flavors:

Computational: Deterministic, fast, CPU-bound — linters, type checkers, test suites. Milliseconds to seconds. Reliable results.
Inferential: Semantic, slower, GPU-bound — AI code reviews, LLM-as-judge patterns. More expensive, non-deterministic.

Fernando draws a direct parallel to CI/CD pipelines and the testing pyramid: unit tests are cheap and fast (like computational sensors), while end-to-end tests are brittle and expensive (like inferential sensors). The same economics apply to harness engineering — lean on deterministic checks wherever possible and reserve expensive LLM-based review for where it truly adds value, as discussed in the context management principles from episode #2.

Harness Differences in Practice: Claude Code vs. Kiro

The hosts compare how different inner harnesses produce dramatically different developer experiences with the same underlying models. Vladimir shares that in his hands-on usage, Kiro feels “really TDD heavy” — in his experience, it generates extensive unit tests before writing implementation code and leans on them for self-correction. This reflects Vladimir’s personal workflow with the tool; Kiro’s official design centers on a spec-driven workflow with optional correctness checks rather than mandating test-first development by default. Claude Code, by contrast, treats testing more like a typical developer: write the code first, maybe add some tests afterward. Andrey notes that Claude Code is “well known to disregard instructions” from CLAUDE.md and agents.md files, while Kiro strictly follows its steering rules — a direct consequence of how each tool’s inner harness is engineered.

Why Harness Engineering Matters for DevOps

The discussion turns to DevOps-specific implications. When troubleshooting a live system, the harness determines whether the agent has the right tools: DNS lookups, log access, metrics dashboards. Andrey gives a concrete example with AWS ECS service events — querying them via CLI dumps massive amounts of noise (“service reached steady state” repeated endlessly), which can overwhelm the model’s context window. A purpose-built tool that filters irrelevant events and surfaces only actionable information produces dramatically better results.

This connects directly to the context management principles from episode #2: better tools mean less noise in the context window, which means better model reasoning. Boris, the hosts’ own agent, ships with specialized AWS tools that encode domain knowledge into the tool layer itself — not just running generic CLI commands but understanding what output matters and what is noise.

Fernando summarizes: the raw model plus a generic CLI is not enough for production DevOps work. The harness — both the tools it provides and the feedback loops it enforces — is what bridges the gap between a “fancy chatbot” and a reliable operational teammate.

Resources

Harness Engineering for Coding Agent Users — Guest article on Martin Fowler’s blog defining the emerging discipline of building feed-forward guides and feedback sensors around coding agents.
NVIDIA OpenShell — Open-source, policy-driven sandbox runtime for AI agents, enforcing network, filesystem, process, and inference constraints via declarative YAML policies.
NVIDIA NemoClaw — Open-source stack adding privacy and security controls to agents by combining OpenShell sandboxing with managed inference and policy-based guardrails.
Cursor 3.0 — Major interface overhaul introducing an agent-first “Agents Window” for managing parallel agents across repositories and environments.
OpenAI Codex CLI — Open-source, Rust-based terminal coding agent supporting code editing, execution, web search, and sub-agent parallelization.
Kiro — AWS’s spec-driven agentic coding tool (IDE and CLI) with steering rules, correctness checks, and MCP support.
Claude Code Source Disclosure Coverage (VentureBeat) — Reporting on the accidental npm source map inclusion that exposed Claude Code’s TypeScript source across 1,906 files.

Join B.O.R.I.S Slack Playground

#4 — Harness Engineering: What Claude Code Accidentally Taught Everyone