Bugs Happen. Agents Still Run.

We're rapidly turning "AI assistants" into "AI operators."

That shift is not subtle: agents now write code, run commands, touch cloud infrastructure, and move data between systems. But the security model many teams are still relying on is basically: a prompt, a dialog, and a log.

That's not a security boundary. It's a user experience.

The uncomfortable truth: the control plane is still software

Most agent-risk discussions focus on model behavior: prompt injection, jailbreaks, tool misuse.

Those are real. But there's a simpler, more inevitable category of failure: the agent runtime itself is software, and software has bugs. Consent dialogs can be unclear. Trust checks can fire in the wrong order. Config can load earlier than intended. A "safe default" can flip into "auto-approve."

When that happens, the model isn't the problem. The trust boundary is.

And trust boundaries implemented in software will sometimes fail -- quietly.

A recent proof point: Claude Code + malicious repos

Check Point disclosed multiple issues in Anthropic's Claude Code where cloning and opening a repository could trigger behavior a user did not expect -- including command execution and API key exposure.

Coverage: The Hacker News write-up
Technical deep dive: Check Point Research report

Two of these issues were assigned CVEs:

CVE-2025-59536 -- command execution prior to the startup trust dialog:
- NVD entry
- GitHub advisory (GHSA-4fgq-fpq9-mr3g)
CVE-2026-21852 -- API-key exfiltration via malicious environment configuration used before trust confirmation:
- NVD entry
- GitHub advisory (GHSA-jh7p-qr78-84p7)

There was also a related issue about consent clarity: the startup warning did not adequately communicate that proceeding could allow execution of files in the folder without additional confirmation:

GitHub advisory (GHSA-ph6w-f82w-28w6)

Different bugs, same outcome: repo inputs influenced execution earlier than intended.

Zoom out, and the pattern matters more than the details:

If a repo can influence execution, the repo is no longer "just data." It becomes a capability negotiation surface.

In other words: project files can implicitly negotiate what gets executed, what gets read, and where data gets sent.

And capability negotiation is exactly what you want to treat as untrusted by default -- especially when agents are involved.

Why partial supervision is not a safety boundary

In partially supervised flows, approvals become muscle memory; in CI or background agents, there's no human at all -- and the only boundary that matters is policy.

So the question isn't "can the prompt be improved?"

It's: what happens when the prompt is wrong?

What defense in depth looks like for agents

Patching is necessary -- but it's not sufficient. The durable approach is to assume bugs will recur and limit the blast radius at runtime.

Concretely, that means three controls:

Constrain execution. Don't let repo-driven behavior spawn arbitrary shells and binaries.
Constrain data access. Don't let agents or their subprocesses read sensitive paths, credentials, or environment secrets by default.
Constrain egress. Don't let agent workloads connect to arbitrary destinations -- especially not when credentials may ride along.

This is the heart of execution-layer security: evaluate actions at the moment they become real.

How AgentSH mitigates this class of failures

AgentSH is built around a simple premise:

Models are probabilistic. Execution must be deterministic.

Instead of trusting what a model claims, what a tool UI shows, or what a consent dialog intended, AgentSH enforces at the point where intent becomes reality: file operations, network connections, and process execution -- including subprocess trees.

1) Key exfil via redirected API traffic (CVE-2026-21852)

If a repo causes a tool to connect to an attacker-controlled endpoint prior to trust confirmation, the only reliable prevention is: don't allow that connection.

AgentSH enforces tight outbound allowlists so even if the workload tries to connect to an attacker-controlled host, the connection is denied and logged. If the failure mode involves environment-variable steering -- for example, a malicious ANTHROPIC_BASE_URL -- AgentSH can restrict which env vars the process can read and inject operator-trusted values. LLM traffic can be routed through a local proxy so the workload doesn't need LLM provider API keys in its environment. DLP can redact sensitive strings in payloads before they leave the machine.

Relevant policy surfaces: Network Rules, Environment Policy, Environment Injection, LLM Proxy & DLP

If repo content triggers command execution earlier than expected, the mitigation is: policy gates the exec attempt.

AgentSH can deny or require approval for high-risk operations -- shells, downloaders, destructive utilities -- and captures the entire subprocess tree, closing the "the UI didn't show me that step" gap that nested hooks and scripts exploit.

Relevant policy surfaces: Command Rules, Execve Interception

3) Config becomes behavior (Hooks / MCP / project settings)

The core risk is letting repo-scoped config expand capability.

AgentSH policies are capability-first: your allowlists decide what's permitted; repo config doesn't get to promote itself. Filesystem access is governed independently of what any project setting requests.

For MCP specifically, AgentSH intercepts tool calls, evaluates them against policy, and supports allowlists, cross-server pattern detection, and version pinning to prevent supply-chain swaps. For the broader "dev tool itself" pattern -- where subprocess decisions and silent file reads happen outside the visible UI -- AgentSH supports wrapping tools like Claude Code so you govern everything they do, not just the commands they surface.

Relevant policy surfaces: File Rules, MCP Security, MCP Rules, Version Pinning, Protecting Dev Tools

A practical baseline for right now

You don't need a full platform to reduce exposure today: default-deny egress, run agents without ambient credentials, disallow shells and downloaders by default, and treat repo-scoped config as untrusted unless it's been explicitly allowlisted. That won't eliminate the class of failure -- but it will mean the next trust-dialog bug doesn't become an incident.

The point

This isn't about dunking on one tool. Claude Code shipped fixes, as vendors should (see the advisories above).

The point is that the industry is turning "configuration" into "agent behavior," and "project files" into "execution surfaces." In that world, betting everything on the trust prompt always being correct isn't a security posture -- it's optimism.

Bugs happen. Agents still run.

The only posture that survives the next bug is defense in depth: patch fast, assume regressions, and put guardrails under the agent -- at the execution layer -- where file access, network egress, and process execution can be evaluated, recorded, and constrained.

← All posts

Built by Canyon Road

We build Beacon and AgentSH to give security teams runtime control over AI tools and agents, whether supervised on endpoints or running unsupervised at scale. Policy enforced at the point of execution, not the prompt.

Learn the category: Execution-Layer Security → See examples: Use Cases →