#6 — The Big AI Squeeze - B.O.R.I.S - Your AI DevOps Teammate

LLM subsidies are drying up, subscription limits are tightening, and the astronomical data center CapEx has to be paid by someone — spoiler: it is you. In this episode, Fernando, Andrey, and Vladimir tackle what they call "the big squeeze": the two-sided pressure of rising AI costs and emerging local-inference technologies that could reshape how teams budget for and deploy AI. Along the way, they debate whether buying a Mac Mini is a rational investment or just hype, reveal the mental gymnastics required to get meaningful work out of a twenty-dollar subscription, and argue that if your AI vendor has no pricing page, that tells you everything you need to know about their business model.

Spotify Apple Podcasts RSS

Key Topics

The Vercel–Context AI Supply Chain Breach

The episode opens with the Vercel security incident, where a compromised credential at Context.ai — a third-party AI tool used by a Vercel employee — led to unauthorized access to Vercel’s internal systems. Vladimir breaks down the chain of assumptions that made the breach possible: the Vercel engineer assumed Context AI was handling security properly, the Google Workspace administrator assumed the platform was secure by default, and Context AI did not even know a company of Vercel’s size was connected to their legacy product. Every party assumed someone else was doing the hard security work. Vladimir warns this pattern will repeat as AI-accelerated development outpaces security diligence — “all the tools look beautiful and nice, but maybe not everywhere perfect.”

Vladimir notes the deeper concern around environment variables in the Vercel incident. Customers may have stored secrets in regular, non-sensitive environment variables rather than marking them as sensitive — and that data could have been exposed. Vercel says environment variables marked as sensitive were not readable, and it has no evidence those values were accessed, but anything stored in non-sensitive environment variables was at risk. With non-technical users now deploying production apps to Vercel, the chances of finding misconfigured environments are “super high.”

The hosts also address the Lovable security scare, which they describe as an access-control and product-security failure rather than a traditional breach. Project data — including source code and credentials — was exposed to anyone on the platform, with documentation failing to convey the implications of the “public” setting. Lovable had disabled the feature for enterprise accounts almost a year earlier, but the broader exposure caught users by surprise. Fernando criticizes Lovable’s response of blaming users, noting they essentially told people “you clicked the public button, so what do you want from us.” Andrey adds that everyone is a builder now, and people approve permissions and load data into their environments without stopping to think — a pattern that makes this kind of exposure increasingly likely.

Gemma 4 and the TurboQuant Breakthrough

Fernando highlights Google’s Gemma 4 release as more significant than it first appears. Most open-source models hover around Haiku-level intelligence — useful but limited. Gemma 4’s 31-billion-parameter variant ranks among the top open-source models on LM Arena human-preference benchmarks, competitive with some proprietary mid-tier frontier models — putting meaningfully capable intelligence within reach of local hardware.

The smaller Gemma 4 variants are sized to run on consumer and edge devices — including laptops — thanks to their parameter counts and weight quantization, which reduce memory requirements compared to larger open-source models. This is distinct from TurboQuant, a separate Google Research vector quantization technique that Fernando dives into during the episode. TurboQuant targets the key-value cache rather than model weights, compressing it to roughly three to three-and-a-half bits per channel. Fernando reads from the research paper to explain: vectors are the fundamental way AI models process information, and TurboQuant applies classical data compression to reduce the size of high-dimensional key-value vectors, enabling faster similarity lookups and dramatically lowering KV-cache memory costs. The practical effect is that models can handle longer context windows and higher concurrency on the same hardware — achieving near-zero quality loss at roughly three to three-and-a-half bits per value, with trade-offs depending on benchmark and configuration. Vladimir notes the Silicon Valley HBO comparison is hard to avoid when Google names their compression technology “TurboQuant.”

Together, these developments point toward a step change in local inference from different angles: smaller, capable model variants make local deployment feasible, while KV-cache compression like TurboQuant makes longer context and sustained use practical once you get there. Fernando expects labs are now racing to implement similar compression techniques. The practical implication for DevOps is substantial. As discussed in episode #5, agentic work involves massive amounts of tool calling, with each call constituting a turn. Smaller, faster models excel at this pattern. Fernando proposes the hybrid approach: run a local model for all the context gathering and tool calling, then bring in a hosted frontier model for the actual reasoning. This could meaningfully lower API bills while keeping output quality high.

The hosts also note Kimi K2.6 from Moonshot AI, which claims Opus-level performance but requires significantly beefier hardware. If TurboQuant-style compression is applied to models like Kimi, local inference options could expand dramatically — and Fernando expects every lab is now racing to implement the technique.

Should You Be Buying a Mac Mini?

The Mac Mini buying frenzy is real. Vladimir reports that higher-tier Mac Studios and Mac Minis with 64GB+ RAM are out of stock in the US, with supply shortest on configurations offering the best memory-to-price ratio. The trend traces back to OpenClaw, the open-source personal AI assistant project, where a screenshot of the creator running the project from a hotel room Mac Mini went viral and spawned a wave of YouTube “this is why you should buy a Mac Mini” videos.

Vladimir explains the technical appeal: Apple’s M4 family chips offer exceptional memory bandwidth — the ability to generate tokens quickly — not just raw memory capacity and CPU power. This makes them uniquely suited for local inference compared to similarly priced alternatives.

But the hosts push back on the hype. Andrey shares that when Sirob Technologies evaluated running coding agents on a Mac Mini, the token-per-second throughput and intelligence level made the cost amortization “not make sense” even for the beefiest Mac Studio configuration. Fernando tried running models on an Intel MacBook and got roughly five tokens per second — entertainingly slow to read with his wife, but useless for real work. And as Andrey notes, that is just the beginning: as you load more context for the model to be useful, it gets even slower.

Vladimir offers a sharp counterpoint to the whole trend: many people buying Mac Minis for “local AI” are actually just running twenty-four-seven assistants that call OpenAI or Anthropic APIs, doing zero local processing. For those use cases, you are paying the Apple premium for macOS and its ecosystem when cheaper, lower-power hardware would suffice. “Start from the solar panels instead,” he quips, noting the environmental cost of everyone running always-on hardware at home.

The consensus: for personal assistants and embedding tasks, local hardware can make sense. For coding agents, the math does not work yet. For anyone consuming only public APIs, there are more affordable and energy-efficient options — including purpose-built AI hardware like NVIDIA Jetson and upcoming ARM-based chips designed specifically for inference workloads.

The Big Squeeze: Subscriptions, Pricing Gymnastics, and the End of Free

Andrey frames “the big squeeze” as pressure from both sides. On one side, hyperscalers are going cash-negative building data centers — Oracle is borrowing money, and someone has to pay for all that CapEx. On the other side, new compression technologies like TurboQuant are making local inference viable. The middle ground is where users are getting crushed.

Two years ago, a twenty-dollar AI subscription meant unlimited use. Now, as the hosts experienced firsthand in episode #3 with Cursor’s pricing shock, that same twenty dollars “turns into a pumpkin” — burning through in roughly fifteen minutes of serious work. Vladimir describes the absurd workarounds people have adopted: using ChatGPT to plan how to use their Claude subscription, then copy-pasting the plan into Claude. Fernando describes his own multi-tool gymnastics — using Kiro credits to build a plan, feeding it to Cursor on a free tier, then reviewing with Claude Code if his usage window is available.

The five-hour usage windows have become common knowledge, with people setting alarms for 6 AM to align their automation with the reset. Annual plan subscribers face an additional sting: terms change mid-year, and suddenly what you budgeted for no longer matches what you get.

Fernando connects this to the Opus 4.7 pricing discussed in episode #5: while the per-token price stays the same, the new tokenizer maps input to more tokens, and higher-effort runs produce more output. The public messaging says “same price, more capable,” but the API bill tells a different story. This kind of pricing gymnastics is becoming the norm.

Building a Sustainable AI Business Model

The squeeze is not just a consumer problem — it is an existential question for AI-enabled businesses. Andrey explains the challenge Sirob Technologies faces with B.O.R.I.S: when your product mixes traditional software with LLM capabilities, token spend becomes a variable cost that directly affects pricing. Venture-backed startups can subsidize usage to attract customers, but that approach mirrors exactly what is happening with consumer subscriptions — it works until it does not.

Andrey argues for framing B.O.R.I.S-style AI costs by runtime or hours rather than exposing tokenomics — “this will cost you a thousand tokens” means nothing to someone who just has features to build. Price-per-hour is a concept familiar in DevOps consulting and contractor work, and Andrey’s belief is that this kind of framing will become more common. The optimization of which models to use and how to minimize token spend becomes the company’s problem to solve, not the customer’s.

Fernando warns that many tools in the emerging AI SRE category have no pricing page at all, which “tells you something about the business.” Vladimir adds that in practice, the only two pages people check on a vendor’s website are pricing and the company name for due diligence. If pricing is missing or obscured behind credit systems that require a spreadsheet to decode — “nine thousand credits, and this action costs this many credits” — it signals an unsustainable or deliberately opaque business model.

The hosts observe that as costs rise, the economics start looking remarkably similar to pre-AI software development: a couple thousand dollars per feature, the same build-versus-buy calculations, the same project prioritization debates. The difference is speed and elasticity, but the pricing converges. Andrey’s hope is that this reality check will slow the flood of throwaway open-source projects — “people just running around burning tokens on stuff that no one is using” — and refocus the industry on whether AI work is actually solving real problems or just generating noise.

The Hybrid Future and Vertical Models

The hosts converge on a vision where intelligence is distributed across tiers. Small, fast local models handle tool calling and context gathering — Vladimir estimates the threshold at roughly twenty billion parameters, with Gemma 4 fitting the bill. Larger hosted models handle reasoning and orchestration. This maps directly to the harness engineering concepts from episode #4: the orchestrator agent thinks while sub-agents execute, and routing tools like OpenRouter or local harnesses manage which model handles which task.

Fernando sees a broader trend emerging: instead of massive general-purpose frontier models doing everything, the industry is moving toward vertical specialization — models optimized for coding, documentation, tool calling, and other specific capabilities. The orchestrator selects the right specialist for each sub-task. This is still early, but the economic pressure of the big squeeze is accelerating it.

Anthropic no longer lets standard Claude subscription limits cover third-party harnesses like OpenClaw; users need extra usage billing or API-key billing, while first-party Claude products remain covered. This further fragments the tooling landscape and pushes people toward understanding which models work best for specific tasks.

Vladimir grounds the discussion with a practical heuristic: if the task is straightforward enough, skip the model entirely and write a Python script. Eight-billion-parameter models can chat, but they cannot reliably use tools. Models need to clear a minimum intelligence threshold to be useful for agentic work — and below that threshold, deterministic code is both cheaper and more reliable.

The episode closes with a question left for the audience: the industry is building at an unprecedented pace, but is it building things people actually use? Is it solving real problems or just creating noise? Next week, the series returns to its fundamentals track with an episode on memory — how agents remember, forget, and manage context across sessions.

Resources

Vercel April 2026 Security Incident — Official Bulletin — Vercel’s official disclosure of the supply chain breach originating from a compromised Context AI credential.
Lovable Security Crisis: 48 Days of Exposed Projects (The Next Web) — Timeline of the Lovable access-control failure and the structural security problems in vibe-coding platforms.
Gemma 4 Announcement (Google Blog) — Google’s official release of the Gemma 4 open model family featuring efficient multimodal architecture.
TurboQuant KV-Cache Compression (InfoQ) — Technical coverage of TurboQuant’s KV-cache compression achieving approximately 6x memory reduction with near-zero quality loss.
Kimi K2.6 Release (MarkTechPost) — Moonshot AI’s 1T-parameter model with 32B active parameters, claiming competitive performance with frontier hosted models.
LM Arena Leaderboard — Crowdsourced Elo-rating leaderboard for comparing LLM performance across categories, formerly known as LMSYS Chatbot Arena.
OpenRouter — Unified API gateway for routing requests across 300+ AI models from 60+ providers with automatic failover and cost optimization.
OpenClaw (GitHub) — The open-source personal AI assistant project that sparked the Mac Mini buying frenzy, running locally on Mac, Linux, and Raspberry Pi.

Join B.O.R.I.S Slack Playground

#6 — The Big AI Squeeze