[PERDITION//SEC]Contact
back to writing
// ai-security   2026-01-21  ·  7 min

AI red team field notes: patterns we keep finding

Across a year of LLM and agent red team engagements, the same five vulnerability classes keep showing up. None of them are bugs in the model. All of them are architectural choices that could have been made differently.

I keep a running list of vulnerability classes I find on AI red team engagements. After the last twelve months, the same five categories keep showing up in roughly the same order, regardless of whether the client is a big-three cloud, a Series-B fintech, or a research lab. None of them are bugs in the underlying model. All of them are architectural choices the application team could have made differently.

First and most common: indirect prompt injection through retrieved or referenced content. The agent reads something — a document, a web page, an email — and that something contains instructions that the agent then executes. This is the bread and butter of AI red teaming and it works almost everywhere I try it. The defense is structural separation of retrieved data from the agent's instruction channel, which almost nobody is doing yet.

Second: tool-abuse chains. The agent has access to several tools that are individually safe but combine into something dangerous. "Read a customer record" is safe. "Send an email" is safe. "Read a customer record AND then send the contents of that record to an attacker-controlled email address as part of answering an unrelated query" is the actual product of those two tools, and most agents will do it cheerfully if you ask in the right way. The defense is strict capability scoping per task and human-in-the-loop on any tool call that mixes data sources.

Third: data exfiltration via output channels nobody thought to control. The agent generates Markdown. The Markdown can render an image with a URL. The URL can be on an attacker-controlled domain and the path can encode whatever the agent just read. Free PII exfiltration, served via the user's own browser. The defense is content-security policy on rendered output and strict allowlisting of any URL the agent can produce.

Fourth: jailbreaks that survive into the supply chain. A model that's been prompted into a particular state can pass that state along through downstream agents, fine-tuned variants, or cached prompts. I've seen jailbreaks survive an architecture refactor because the cached embeddings still encoded the malicious intent. The defense is taking jailbreak survival seriously as a property of the system, not just the prompt.

Fifth: identity confusion between the user and the agent. The LLM has a sense of "I am an agent" and "the user said this," but those two senses are fragile. A confidently-worded retrieval can convince the agent that the user said something they didn't. The defense is, again, structural — instructions come from one channel only, and that channel is verified at the application layer.

If your AI security review didn't cover these five categories, it wasn't an AI security review.