What happens when your AI coding assistant forgets what it was just working on? Andrey and Fernando dive into the mechanics of context windows, reveal how MCP servers can silently eat 20-30% of your session before you even type a message, and explain why treating your AI agent like a micromanaged junior developer actually makes it perform worse.

Key Topics

Agentic AI News Roundup

The hosts open with several developments from the past week. First up is Anthropic’s addition of Dispatch to Cowork, which lets users assign tasks from the Claude mobile app to a desktop-backed Claude session — start a review on your commute, and have results waiting at your desk. Anthropic also launched Claude Code Channels, which connect a running Claude Code session to Telegram or Discord, letting developers interact with their agent from their phone. Fernando notes the convenience but flags the security trade-off: “As soon as you have on your phone a chat interface that can actually act on your machine… if someone gets a hold of that, your password doesn’t matter.”

Andrey adds that Anthropic clarified and began enforcing its existing policy against using Claude consumer subscriptions in third-party harnesses like OpenClaw — a tool that let users route their flat-rate Claude Max subscription through open-source agent harnesses, sometimes racking up massive token bills. Users can no longer rely on their included subscription limits in third-party tools; they now need either discounted usage bundles or API-key billing. Interestingly, the hosts note that OpenAI has been slower to enforce similar restrictions, potentially as a competitive play to attract more users.

Cursor Composer 2 and the Kimi Controversy

The conversation turns to Cursor’s release of Composer 2, their proprietary coding model. Fernando tested it as a daily driver for two days and reports it is fast and cheap compared to Opus. However, the release was clouded by a transparency issue: users quickly discovered that Composer 2 is built on top of Moonshot AI’s open-source Kimi K2.5 model after spotting model identifiers in API responses. Cursor later confirmed the Kimi base, noting they had added significant reinforcement learning on top, and Moonshot AI endorsed the partnership. Andrey credits Cursor’s narrow focus on coding data as a strategic advantage but cautions viewers to take their self-published benchmarks “with a grain of salt.” Fernando also tried running Kimi on AWS Bedrock for B.O.R.I.S but hit compatibility issues where the Strands framework confused tool calls with the end of the agentic loop.

What Is the Context Window and Why It Matters

The core of the episode, building on concepts introduced in episode #1, explains the context window — the amount of information an LLM can hold in its “attention” at any given time. Andrey breaks it down: a 200,000-token context window translates to roughly 150,000 words. Once you exceed that limit, the model starts dropping information — either from the beginning, the middle, or by compacting (summarizing) earlier content.

Fernando highlights a practical observation: quality degrades well before the limit. “If you start reaching above sixty percent of the context window, you start seeing the answers start to not be as sharp anymore.” This aligns with research showing that effective capacity is typically 60-70% of the advertised maximum.

The hosts note that frontier labs are racing to expand context windows — Gemini offers models with one million tokens, and Anthropic is testing Sonnet and Opus with similar capacities — but bigger windows come with higher inference costs and no guarantee that quality will hold.

What Fills Up the Context Window

The hosts walk through everything competing for space in your context window before you even start working:

  1. System prompt — The instructions that define the agent’s identity and behavior. As Fernando explains, “This is where Cursor becomes Cursor” — the system prompt is what differentiates one coding tool from another, even when they use the same underlying model.

  2. Tool descriptions — Every tool the agent can use, whether loaded from MCP servers or defined locally, includes a description of when and how to use it. Each MCP server can consume around 2,000 tokens. Fernando warns: “Before you even send the first message, twenty, thirty percent of your session is already gone.” This matches industry analysis showing MCP tool schemas can burn through significant context before any real work begins.

  3. Configuration files — Files like CLAUDE.md, agents.md, and similar project-level instructions all get loaded into context. Andrey references research suggesting that overly restrictive rules in these files actually degrade model performance: “You don’t want to micromanage a senior software engineer.”

  4. File reads and tool outputs — As the agent explores your codebase (grepping, reading files, running commands), all outputs flow into the context window. Fernando notes seeing 60,000-70,000 tokens consumed just in the exploration phase on large projects.

MCP Servers vs. CLI Tools

A key practical takeaway: MCP servers are not always the best approach. Andrey argues that for many use cases, simply having the agent run CLI tools (like the GitHub CLI instead of a GitHub MCP server) achieves the same result while consuming far less context. The CLI approach also uses the same interface a human developer would, making it more straightforward and scriptable. Fernando agrees, framing it as a shift toward using “the LLM as orchestrator” — calling deterministic scripts rather than loading heavy tool descriptions for the model to reason over.

Managing Context: Sub-Agents and Plan Mode

The hosts share several techniques for keeping context windows clean and productive:

  • Sub-agents: Delegating tasks to child LLM sessions that burn their own context windows, do the work, and return only the essential results. Andrey describes using the primary session as an “orchestrator” that dispatches work rather than doing everything itself.

  • Plan mode: Running a first agent to collect information and build a plan, then starting a fresh session with just the plan document. Fernando describes his workflow: complete a feature, run /clear to reset context, then start the next feature fresh — “as if you work for a full day, you go to sleep, next day, head is fresh.”

  • Dynamic tool loading: Rather than loading all tools upfront, B.O.R.I.S uses a pattern where the agent starts with only a tool-discovery capability and loads specific tool descriptions on demand. This keeps the initial context lean.

  • Tool output persistence: Instead of returning tool outputs directly into the agent’s context, results are saved externally and recalled only when needed, preventing context bloat from large command outputs.

  • Context hygiene: Fernando recommends developers run /context on a fresh session to audit what is pre-loaded and remove anything unnecessary — a simple exercise that can meaningfully improve output quality.

Indexing and Memory Approaches

The hosts compare how different tools handle the cold-start problem. Cursor runs embeddings of the entire codebase, creating a searchable map so it does not need to grep through every file. Claude Code, by architectural choice, does not build persistent memory — it starts from scratch each session, relying on grep and file reads. Users can bridge this gap with external vector databases or by manually creating markdown summary files at the end of sessions, though Fernando acknowledges “it’s still kind of an ecosystem where people are trying to figure out this pain point.”

Resources

Join B.O.R.I.S Slack Playground