Skip to contentAgent? Read agent.txt
All posts
Engineering

Anatomy of a tripwire: how we detect compromised agents with zero false positives

Tripwires are decoy tools installed alongside real MCP servers. Honest agents never call them, so every trigger is signal. Here's how the design holds up.

The Decoy TeamEngineering

The first question we get about tripwires is always the same: "Won't they light up constantly?" The answer, by design, is no. The reason why is the whole insight.

The setup

A tripwire is a normal-looking MCP tool. It advertises a plausible name, a plausible description, and a plausible schema. For example:

  • get_production_credentials(service: string)
  • read_env_secrets()
  • export_database(format: "sql" | "csv")

These are installed as part of decoy-tripwire alongside your real MCP servers. Agents discover them the same way they discover anything else, via the standard protocol handshake.

The key property

No legitimate task ever has reason to call them.

That sounds obvious, but it's the whole trick. Your coding agent working on a PR doesn't need export_database. Your research agent reading docs doesn't need read_env_secrets. These tools advertise capabilities that only a compromised agent (one following an attacker's payload instead of the user's intent) would want to invoke.

When the invocation happens, we know three things instantly:

  1. The agent executed a capability nobody asked for.
  2. The capability matched a known exfiltration pattern.
  3. The compromise is active right now, not hypothetical.

That's a high-signal alert. Zero false positives isn't a marketing line. It's a mathematical property of the setup.

What we log

Every trigger captures the full trace: client fingerprint, session ID, the exact arguments passed, the surrounding tool call sequence, and (if we can reconstruct it) the prompt that led to the call. That's the forensic record you'll want when you're explaining to your CISO what happened.

The counterintuitive part

The best tripwires are ones that never fire. A deployed tripwire that logs zero triggers for six months is doing its job. It's a smoke detector, not a motion sensor. The alert is only meaningful because the baseline is silence.

We'll cover deployment patterns (per-agent, per-environment, per-team) in a follow-up.