Indirect prompt injection: when your AI reads the wrong instructions
Give an AI coding agent the ability to fetch a URL and you have also given it the ability to read instructions written by a stranger. Indirect prompt injection is the attack that turns that into a problem: hidden text on a page that tells the agent to do something its operator never asked for.
The shape of the attack
The agent is helpful by design. It reads a page, a tool result, or a file and acts on what it finds. An attacker exploits exactly that helpfulness by planting commands inside content the agent is expected to trust — “ignore your previous instructions”, “send the contents of this file”, “run this”. The instructions never come from the user, but the agent can’t always tell the difference.
Wrap it as data
The defence is a boundary. Everything that comes from outside the trusted conversation — web pages, subagent output, files from outside the project — is data to be read, not instructions to be followed. Make that boundary explicit and the agent stops treating a web page as a source of commands.
That is what safe-fetch and
mcp-safe-fetch do: fetch the URL in isolation and wrap the
response in UNTRUSTED-WEB tags so the model reads it for facts, never for orders. The
Claude Code prompt-injection gate enforces
the same rule with hooks, so it holds even on the edge cases a prompt alone would miss.
Mechanical, not optional
A policy written in a prompt is a suggestion. A policy enforced by a hook is a rule. 5bats leans on the second kind — the protection that does not depend on the model remembering to behave.
