The Case for Headless AI Security

I believe AI governance is simpler than the industry assumes. Here is a long walk through the architecture I think actually works, and why.

The Map

There is a question that keeps showing up in engineering Slack channels, in security team meetings, in late-night threads on Hacker News: how are you supposed to secure all this AI code?

It is a good question. AI coding assistants are now the default. Engineers at every level use them daily. GitHub reported that over 46% of code written with Copilot enabled across all programming languages is now AI-generated.8 The productivity gains are real. But so is the risk, and I think the risk is underappreciated relative to the speed of adoption.

A 2023 Stanford study found that developers using AI assistants produced significantly less secure code than those without, while being more confident in its safety.1 A 2025 Veracode report found that 45% of AI-generated code across 100+ LLMs contains security flaws.1b Not bugs. Not style issues. Security vulnerabilities: SQL injection, hardcoded credentials, insecure deserialization, missing authentication checks. The kinds of flaws that lead to breaches. Snyk's 2024 AI Code Security Report confirmed the trend: 56% of surveyed organizations reported that AI coding tools had introduced security issues into their codebases, while only 10% had formal policies governing AI-generated code.9 That number will likely improve as models improve, but the structural problem (that AI generates plausible code faster than humans can review it) is not going away.

If you have been feeling overwhelmed by the complexity of AI governance, I want to offer what I think is good news: it is simpler than the industry is making it seem. The attack surface is large, but the architecture to govern it does not have to be. The tools you need are, in my view, fewer than you think.

This essay is a walk through that architecture as I see it. I will start with the territory as it actually exists, trace the shape of the threat, and arrive at a design that I believe is surprisingly minimal. I could be wrong about parts of this. But the reasoning is here for you to evaluate, and if you enjoy the journey, that is the point.

The Terrain

Let me begin by looking at what has actually changed in how software gets made.

Anthropic calls MCP the "USB-C for AI." The Model Context Protocol, launched in November 2024, is an open standard that solves the integration problem between AI agents and the tools they connect to. Instead of building custom connectors between every AI client and every tool, developers build against a single protocol. MCP servers already exist for Google Drive, Gmail, Slack, GitHub, Postgres, Stripe, Notion, Figma, Salesforce, and hundreds more. Monthly SDK downloads have reached 97 million across Python and TypeScript.

What makes MCP significant is not just the protocol itself. It is who adopted it. OpenAI adopted MCP in March 2025. Google confirmed support in April 2025. Microsoft integrated it into Windows 11, Copilot Studio, and VS Code. In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, co-founded by Anthropic, OpenAI, and Block, with Google, Microsoft, and AWS as supporting members. MCP is the industry standard for agent-to-tool communication.

At the same time, Anthropic launched Computer Use (teaching Claude general computer skills rather than building task-specific tools), Claude Desktop's agent capabilities (autonomous multistep task execution by managing, reading, and editing files on your local machine), and the Claude Agent SDK (a framework for building agent-mediated workflows). The pattern is clear: AI is not just generating code. It is operating within entire organizations, autonomously, through existing channels.

The industry has a name for this shift. Foundamental coined "Outcome as a Service" in May 2024 to describe business models where providers deliver outcomes rather than tools. Gartner projects that by 2026, at least 40% of enterprise SaaS will include outcome-based pricing elements. IDC describes the vision explicitly: "Process teams will design workflows around end-to-end outcomes rather than application silos, supported by a new breed of 'headless' software modules accessible via APIs and marketplaces."

This is the terrain as I read it. AI agents connected to everything, operating autonomously, delivering outcomes through existing interfaces rather than new dashboards. The question is: how do you govern it? I have a perspective on this that I think is worth laying out in detail.

The Agentic Org

Before talking about governance, you need to understand what you are governing. And the best way to understand it is to look at what real companies are already building.

Consider a company that has gone all-in on AI agents. They have connected their agents to every system their employees touch: project management, source control, document storage, email, calendar, customer support, knowledge bases. The agents have credentials for all of these. They are not just generating code. They are reading tickets, writing emails, updating documentation, triaging support requests, and scheduling meetings.

The company then defines "skills" for each agent. A skill is a discrete job function: "triage inbound support tickets," "summarize weekly engineering progress," "draft responses to customer emails," "update project status from commit messages." The agents run these skills on a schedule, across every department, dozens of times a day. What used to require a person opening an application, reading context, making a decision, and taking action is now a background process.

This is extraordinarily powerful. An employee sends a message to the agent through a chat interface, and the agent queries across every connected system, correlates data, and returns a synthesized answer in seconds. Work that would have taken an hour of cross-referencing happens instantly.

It is also extraordinarily dangerous.

Every one of those integrations is an attack surface.3 The agent that reads customer support tickets can be tricked into exfiltrating them. The agent that updates project management can be manipulated into closing tickets or injecting malicious content into task descriptions. The agent that sends emails on behalf of employees can be prompt-injected into sending phishing messages from legitimate internal accounts.

The multi-hop chain

Here is the scenario that keeps security engineers up at night.

An attacker plants a prompt injection payload in a customer support ticket. The triage agent reads it, follows the injected instruction, and writes the payload into a document in the shared knowledge base. Another agent, the one that answers employee questions by searching the knowledge base, picks up the poisoned document and begins including the malicious instruction in its responses. A third agent, one with write access to the project management system, receives a poisoned response and executes the embedded instruction, modifying project priorities or creating tasks that redirect engineering effort.

The attack cascades through interconnected agents. Each one amplifies the original injection because none of them were designed to distrust input from other internal systems. This is not hypothetical. Security researchers have demonstrated indirect prompt injection attacks against real systems: Johann Rehberger showed in 2023 that Microsoft Copilot could be exploited through poisoned documents in SharePoint, causing the assistant to exfiltrate data from other connected systems without user awareness.10 The pattern is the same: trusted internal data becomes the attack vector.

Multi-hop prompt injection chain Attacker | v [Support Ticket] <-- injects payload | v Agent A (Triage) <-- reads ticket, follows injected instruction | | writes poisoned content v [Knowledge Base] | v Agent B (Q&A) <-- retrieves poisoned doc, propagates instruction | | responds with embedded payload v Agent C (Project) <-- executes injected instruction | v [Modified priorities, redirected work, data exfiltration] Each hop looks like normal internal traffic. No single request is malicious on its own. The attack is in the sequence.

This is nearly impossible to detect at the network layer. Every individual API call looks legitimate. Every agent is using its own credentials. The data flowing between systems is normal operational traffic. A gateway or proxy sees nothing wrong because nothing is wrong with any single request. The attack lives in the sequence, in the content, in the way trusted internal systems have become vectors for injecting instructions into other trusted systems.

The Attack Surface

It would be easy to look at the agentic org and conclude that the attack surface is impossibly complex. Dozens of agents, hundreds of integrations, thousands of daily operations. Where do you even begin?

My answer, once I drew it out, turned out to be surprisingly simple. I think every AI security threat falls into one of two categories: what happens on the developer's machine, and what happens in the code they produce.4

The AI attack surface, simplified ┌─────────────────────────────────────────────────────┐ │ AI ATTACK SURFACE │ │ │ │ ┌───────────────────┐ ┌───────────────────────┐ │ │ │ ENDPOINT │ │ CODE LAYER │ │ │ │ │ │ │ │ │ │ Data leakage │ │ Prompt injection │ │ │ │ Unauthorized │ │ Excessive agency │ │ │ │ AI tools │ │ Insecure output │ │ │ │ Sensitive data │ │ handling │ │ │ │ in prompts │ │ RAG poisoning │ │ │ │ Shadow AI │ │ Memory poisoning │ │ │ │ │ │ Hardcoded secrets │ │ │ │ │ │ PII exposure │ │ │ │ ───────────── │ │ Dependency vulns │ │ │ │ SOLVED by: │ │ │ │ │ │ MDM + EDR + DLP │ │ ───────────────── │ │ │ │ │ │ THE GAP │ │ │ └───────────────────┘ └───────────────────────┘ │ └─────────────────────────────────────────────────────┘

That is the entire surface as I see it. Two boxes. The left box is about controlling the developer's interaction with AI tools. The right box is about governing the code that AI helps produce and the agents that code creates. The left box has mature, well-understood solutions. The right box is where I believe the gap lives.

A few of the items on the right deserve brief explanation because they are newer than the traditional OWASP categories. RAG poisoning occurs when an attacker injects malicious content into the documents, databases, or knowledge bases that a Retrieval-Augmented Generation system uses for context. The model trusts the retrieved content and follows embedded instructions as if they were legitimate. Memory poisoning targets AI systems that persist conversation history or learned context across sessions. An attacker injects instructions that get stored in the agent's memory, influencing all future interactions even after the original malicious conversation ends. Both attacks exploit the same structural weakness: AI systems that treat data-plane content as control-plane instructions.11

Let me walk through each layer.

The Endpoint (Solved)

The most mature layer of AI governance is the endpoint. This is not a new problem. Organizations have been controlling what runs on managed devices for decades, and the existing tools apply directly to AI.

MDM (Mobile Device Management) controls which applications are allowed on corporate devices. If your concern is employees using unauthorized AI tools, MDM is the answer. You define an allowlist of approved applications, and nothing else runs. This extends to MCP servers: on MDM-enforced devices, you can say "you may only use these MCP servers" and maintain an updated allowlist. The enforcement happens at the device level, before any network traffic is generated.

EDR (Endpoint Detection and Response) monitors for suspicious behavior on the device itself. Unusual file access patterns, credential harvesting, data exfiltration attempts. If an AI desktop application starts behaving oddly, EDR catches it.

DLP (Data Loss Prevention) restricts what data flows into AI conversations. If an employee tries to paste customer records, financial data, or proprietary source code into a chat interface, DLP blocks it.

Endpoint governance layer Developer's Machine (MDM-managed) ┌──────────────────────────────────────────┐ │ │ │ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │ │ MDM │ │ EDR │ │ DLP │ │ │ │ │ │ │ │ │ │ │ │ Enforce │ │ Monitor │ │ Block │ │ │ │ app │ │ behavior │ │ data │ │ │ │ allowlist│ │ patterns │ │ leaks │ │ │ └──────────┘ └──────────┘ └────────┘ │ │ │ │ ┌──────────────────────────────────┐ │ │ │ Approved AI Tools │ │ │ │ (Claude, Cursor, etc.) │ │ │ │ │ │ │ │ MCP Servers: allowlist only │ │ │ │ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ │ │ │ │ GH │ │Jira│ │Docs│ │Slack│ │ │ │ │ └────┘ └────┘ └────┘ └────┘ │ │ │ └──────────────────────────────────┘ │ └──────────────────────────────────────────┘ This layer is well-understood. Existing enterprise tooling handles it. No new products needed.

This layer works. It is well-understood by IT security teams, it leverages existing enterprise tooling, and it provides real protection against the most common data leakage scenarios. If your concern is employees pasting sensitive data into a chat window, or unauthorized AI tools running on corporate laptops, MDM and EDR and DLP are, in my view, the right answers. They are already deployed at most enterprises. No new products are needed here.

But endpoint controls have a blind spot. They govern the user's interaction with AI. They do not govern the code that AI helps write, or the agents that code creates, or the security posture of AI-integrated applications before they deploy to production.

Once code leaves the developer's machine and enters source control, endpoint security has no visibility.

The Gaps

The right side of the diagram is where things get interesting. Once AI-assisted code enters the repository, three things need to happen that are not happening today at most organizations.

1. A runtime security SDK

Application code that integrates with AI models needs runtime guardrails. Input validation before prompts reach the model. Output sanitization before model responses reach the user. Token budget enforcement before an agent burns through your API credits. Scope restrictions on agent actions before they execute something destructive.

These guardrails belong in the application itself. A runtime SDK provides the primitives: sanitize this input, validate this output, enforce this budget, restrict this scope. The SDK runs inside the application, where it has full context and near-zero latency. It is a library you import, not a proxy you route through.

The latency point matters more than it might seem. Engineers building agentic systems are often wary of adding security libraries because they fear slowing down the agent's "thinking" process. A network proxy adds a round trip on every model call: DNS resolution, TLS handshake, request serialization, response deserialization. That is 50 to 200 milliseconds per hop, compounding across multi-step agent chains. An in-process SDK, by contrast, executes validation logic in microseconds. It runs in the same memory space as the application, with no network overhead. For an agent making dozens of tool calls in a single task, the difference between in-process and network-proxy security is the difference between imperceptible overhead and a noticeably slower product.7

2. A pre-commit VSCode Extension and post-commit source control app

Before code is committed, a VSCode Extension scans the staged diff for AI-specific vulnerabilities. It catches prompt injection payloads, hardcoded API keys, and insecure agent configurations before they ever enter the repository. This is fast and local. It runs in milliseconds on the developer's machine.

After code is committed and a pull request is opened, a source control app (GitHub App, GitLab integration) scans the full diff with the complete rule set and AI review. This is thorough and remote. It runs in seconds on a server with the full detection engine.

Two checkpoints. One local and fast, one remote and thorough. Together they create a layered defense that catches issues at the earliest possible moment and again before merge.

3. Pre-deployment probabilistic red teaming

Before code deploys to production, an AI agent reviews the complete change set with adversarial intent. Not just pattern matching, but reasoning about how an attacker might exploit the new code. This is the probabilistic layer. It catches architectural risks, subtle logic flaws, and context-dependent vulnerabilities that deterministic patterns cannot express.

Consider a concrete example. A developer adds a new endpoint that accepts a user-provided URL and fetches its content to display a preview. A deterministic scanner checks for obvious issues: is the URL validated? Is SSRF protection in place? But the red team agent reasons further. It considers: what if an attacker chains this endpoint with the internal agent that summarizes fetched documents? The agent could be directed to fetch a URL containing prompt injection payloads, which then propagate through the summarization pipeline. This cross-component reasoning, connecting a seemingly benign feature to the broader agent architecture, is exactly the kind of contextual analysis that deterministic patterns cannot express.

The red team runs after the deterministic scan passes, adding a final layer of scrutiny before code reaches users. Read-only. It observes and reports. It never modifies your code.

The full AI governance stack ┌─────────────────────────────────────────────────────────────┐ │ FULL AI GOVERNANCE STACK │ │ │ │ ENDPOINT (solved) CODE LAYER (the gap) │ │ ┌──────────────┐ ┌──────────────────────────────┐ │ │ │ │ │ │ │ │ │ MDM │ │ 1. Runtime SDK │ │ │ │ EDR │ │ Input/output guardrails │ │ │ │ DLP │ │ Token budgets │ │ │ │ │ │ Scope restrictions │ │ │ │ Controls │ │ │ │ │ │ the user's │ │ 2. Pre-commit Extension │ │ │ │ interaction │ │ Local, fast, immediate │ │ │ │ with AI │ │ │ │ │ │ │ │ 3. Post-commit App │ │ │ │ │ │ GitHub/GitLab, thorough │ │ │ │ │ │ │ │ │ │ │ │ 4. Pre-deploy Red Team │ │ │ │ │ │ Probabilistic, read-only │ │ │ │ │ │ │ │ │ │ │ │ Controls the code AI │ │ │ │ │ │ produces and the agents │ │ │ │ │ │ that code creates │ │ │ └──────────────┘ └──────────────────────────────┘ │ │ │ │ Two layers. One solved. One open. │ └─────────────────────────────────────────────────────────────┘

That is the full stack as I envision it. Two layers. The endpoint layer is solved with existing enterprise tools. The code layer has five components: a runtime SDK, a pre-commit VSCode Extension, a post-commit source control app that performs code review for AI-specific vulnerabilities, a pre-deployment red team, and a full audit trail for observability. This is where I see the gap. This is what I believe needs to exist.

Why Not Gateways

If you have been following the AI security space, you might be wondering: what about gateways? What about MCP proxies? What about network-layer interception?

The current wave of AI security startups is building gateways: network-layer proxies that sit between your application and the AI model provider, intercepting every request and response. The pitch is compelling: "We are the firewall for AI. All your LLM traffic flows through us."

I want to be fair here: gateways are genuinely useful for some things. Cost observability, rate limiting, usage analytics, PII redaction on outbound prompts. If you need to know how much your organization spends on model API calls, a gateway gives you that in one place. These are real operational problems and gateways solve them well. Smart people are building them and I respect the work.

But as a security architecture, I think gateways have structural limitations that matter. Worse, they can create a false sense of security. A gateway catches the symptom (a dangerous request) but leaves the disease (the vulnerable code that constructed it) in the repository, ready to be triggered by other vectors the gateway might not see. The team sees "gateway is active, we are protected" while the architectural flaw persists in source control, waiting for a code path that bypasses the proxy entirely.5 Here is my reasoning for why this matters.

MDM already enforces MCP server allowlists

The argument for an MCP gateway goes something like: "You need a proxy between your agents and their MCP servers to control what tools they access." But on managed devices, MDM already enforces which MCP servers are allowed. You define an allowlist at the device level and keep it updated. The enforcement happens before any network traffic is generated. It seems to me like the gateway is solving a problem that device management already solved, though I acknowledge there may be edge cases in unmanaged environments where a gateway adds value here.

Agents bypass them

Modern AI agents do not make one API call. They make dozens, across multiple providers, often from different network contexts. A tool server running locally. An orchestration layer spawning sub-agents that each call different model APIs. An AI assistant making requests from an isolated sandbox. These calls do not flow through your corporate proxy. The gateway sees a fraction of the actual AI activity.

They are a single point of failure

Any system that requires all traffic to flow through a single intermediary is a single point of failure. When the gateway goes down, every AI feature in your application stops working. Your customers experience this as "the app is broken." You have introduced a dependency in the critical path that has no business being there.

They have no visibility into local or self-hosted models

A growing number of organizations run inference locally using open-weight models (Llama, Mistral, Qwen) or deploy models on private infrastructure for compliance or latency reasons. These model calls never leave the network. They never hit an external API endpoint. A gateway designed to intercept traffic between your application and a cloud model provider has zero visibility into this activity. The agent's behavior is identical, the security risks are identical, but the gateway sees nothing because there is no outbound traffic to intercept. As local inference becomes more common, this blind spot will grow.

They solve the wrong problem

This is the argument I feel strongest about. A gateway scans what your application sends to a model. It does not scan the code that builds the application. In my experience, the vulnerability is usually not in the API call. It is in the source code that constructs the prompt, handles the response, and decides what the agent is allowed to do. By the time a dangerous request reaches the gateway, the architectural mistake was made weeks ago in a pull request that nobody reviewed for AI-specific threats.

I believe prevention beats detection. Scanning the code that creates the agent is, in my view, more valuable than intercepting the agent's runtime traffic.

To be clear: this is not an argument against gateways existing. They serve real operational needs and a defense-in-depth approach might well include both. It is an argument against gateways as the primary security control. I believe the vulnerability lives in source code. The gateway sees the traffic that source code produces, which I think is too late and too shallow for the most important class of detections.

The Architecture

So if gateways are not the primary answer, what is?

I believe the right place to catch AI security issues is the same place you catch every other kind of security issue: in the code, at the pull request, before it merges. The architecture I am proposing is headless. It delivers outcomes, not dashboards.

  1. Install a GitHub or GitLab app on your organization. Read-only access. No config files, no CLI tools, no agent to install.
  2. Open a pull request. Pattern scanners and AI review run in parallel. Results appear as a PR comment in seconds.
  3. Get notified. Critical findings go to Slack. Weekly summaries arrive by email. Fix issues before they merge.

The entire security workflow happens where the developer already works: in the pull request, in Slack, in email. There is no new tool to adopt. Security governance becomes a side effect of the development process they are already following.

Headless scan pipeline Developer opens PR | v ┌─────────────────────────────────────────┐ │ Scan Pipeline │ │ │ │ ┌─────────────┐ ┌───────────────┐ │ │ │ Deterministic│ │ AI Review │ │ │ │ Patterns │ │ (15% of │ │ │ │ (85% of │ │ findings) │ │ │ │ findings) │ │ │ │ │ │ │ │ Contextual │ │ │ │ 200+ rules │ │ analysis │ │ │ │ 12 scanner │ │ Adversarial │ │ │ │ categories │ │ reasoning │ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └─────────┬─────────┘ │ │ │ │ │ ┌─────────v──────────┐ │ │ │ Confidence Router │ │ │ │ │ │ │ │ High -> finalize │ │ │ │ Medium -> cache │ │ │ │ Low -> AI judge │ │ │ └─────────┬──────────┘ │ └───────────────────┼─────────────────────┘ │ ┌──────────────┼──────────────┐ v v v PR Comment Slack Alert Email Report No dashboard. No login. No new workflow. Results arrive where you already work.

The design principle is minimal surface area. The system should do its job without requiring the user to learn a new interface, configure rules, or remember to check a dashboard. If findings appear in the PR comment thread, developers will read them because they already read PR comments. If critical alerts go to Slack, security teams will see them because they already watch Slack. The delivery mechanism matters as much as the detection engine.

The 85/15 Thesis

There is a popular belief in the AI security space that you need AI to fight AI. That the only way to detect prompt injection is with another model, that threat detection must be "AI-native" to be effective.

I think this is mostly wrong. I want to explain why, because the reasoning matters more than the conclusion.

In my experience building detection patterns, about 85% of AI-specific security threats have deterministic signatures. Prompt injection payloads contain recognizable phrases. Hardcoded API keys match well-known patterns. Agents configured without human confirmation steps have that configuration visible in the source code. These patterns do not require AI to detect. They require well-written pattern matching, tested against thousands of examples, executing in milliseconds at zero marginal cost. The 85% number is approximate (it comes from my own work, not a peer-reviewed study) but the directional claim is what matters: the majority of these threats are pattern-matchable.

The remaining 15% genuinely needs AI judgment. A model API call inside an error-handling block might be fine, or it might be swallowing a critical security error. A data flow from user input to a database query might pass through a parameterized query builder (safe) or string concatenation (dangerous). For these ambiguous cases, you send the finding and its context to an AI model and ask for a verdict: confirm, dismiss, or escalate.

If I am right about this ratio, it is a feature, not a limitation:

  • Deterministic rules are predictable. Same input, same output, every time. No hallucinations. No prompt injection against your own security scanner.
  • Deterministic rules are fast. A regex scan completes in under one second. An AI review takes 3 to 10 seconds. Running everything through AI makes your tool 10x slower.
  • Deterministic rules are cheap. Zero API cost per pattern match. The difference between "scan everything with AI" and "scan 15% with AI" is an order of magnitude in cost.
  • Deterministic rules are auditable. Every pattern can be inspected, tested, version-controlled, and explained to an auditor. "The AI said it was dangerous" is not an acceptable audit finding.6

The part of this architecture I find most compelling, and the part I am most curious to see validated at scale, is that the deterministic percentage should increase over time. When the AI judge confirms the same pattern 100 times with greater than 95% consistency, that pattern becomes a candidate for promotion to a deterministic rule. Critically, a human reviews and approves every promotion. This is not optional. Without a human-in-the-loop gate, you risk "hallucinated" security rules: patterns the AI confidently but incorrectly validated, now running deterministically on every future scan with no second opinion. The human approves the promotion, verifies the pattern logic, and only then does the AI call get eliminated for future matches. In theory, the system gets faster, cheaper, and more reliable with every scan while maintaining the integrity of its rule set. Whether this holds in practice at scale is something I will learn by running it.

Self-improving detection loop New finding (ambiguous) | v AI Judge reviews with context | ├── Confirm (true positive) │ | │ v │ Tally: pattern X confirmed 100 times │ | │ v │ Human reviews promotion candidate │ | │ ├── Approved -> deterministic rule │ │ (no AI needed for future matches) │ │ │ └── Rejected -> stays probabilistic │ (AI continues to judge case-by-case) │ ├── Dismiss (false positive) │ | │ v │ Refine pattern to reduce noise │ └── Escalate (needs human review) The longer it runs, the less it depends on AI. No rule is promoted without human approval.

My operating thesis: AI should build better deterministic logic, not be the logic. The goal is for AI to make itself unnecessary for an ever-growing fraction of detections.

What I Built

I built a system that implements the architecture described above. I want to walk through how some of the harder detection problems actually work in practice, because the interesting parts are in the details.

The scan pipeline

When a pull request is opened, the scanner receives a webhook, fetches the diff, and runs 12 scanner categories in parallel. Each scanner is a deterministic pattern matcher: regex-based rules tested against thousands of examples. The combined rule set covers prompt injection, agentic security, secret detection, PII exposure, dependency vulnerabilities, taint analysis, compliance rules, and more.2

Findings that pass a confidence threshold are finalized immediately. Ambiguous findings go to the AI judge: a single Claude API call per diff chunk that reviews the pattern match in context and returns a verdict: true positive, false positive, or needs review. The response is structured JSON, not free text, so parsing is deterministic even though the judgment is not.

Agent intent analysis

Consider a PR that adds an MCP tool registration:

server.tool("send_email", { to: z.string(), body: z.string() }, async (args) => {
  await emailClient.send(args.to, args.body);  // no confirmation step
  return { sent: true };
});

The agentic scanner detects three things here: a tool registration with write capabilities (it sends email), an unbounded scope (any recipient, any body), and no confirmation flow (the agent executes immediately without human approval). The PR comment surfaces all three as a structured finding with the file path, line number, and a specific remediation: add a confirmation step before the send, or restrict the recipient to a validated allowlist.

This is not something traditional SAST tools look for, at least not today. They scan for SQL injection and XSS. They do not yet understand that an agent registering a tool with write access and no confirmation gate is a security risk. The vulnerability is architectural, not syntactic, and I think this category of architectural vulnerability is going to grow as agent systems get more complex.

Agent replay

For multi-agent systems, the scanner traces data flow between agents in the diff. If Agent A writes to a shared resource (a database, a message queue, a file) and Agent B reads from that same resource, the scanner identifies the chain and checks whether Agent B validates its input. This is how you catch the multi-hop injection pattern described earlier, not by intercepting network traffic, but by reading the code that builds the agents and asking: does this chain trust internal data without validation?

The scanner looks for concrete patterns: a tool handler that writes to a database table paired with a retrieval function that reads from the same table without sanitization. A message queue producer in one agent and a consumer in another that parses the message body directly into a prompt template. A shared file system where one agent writes summaries and another agent reads them as context for user-facing responses. In each case, the detection is static: the scanner reads the code, identifies the shared resource, and checks whether the consuming agent treats the data as untrusted input. If it does not, the finding is surfaced with the specific file paths and line numbers for both the producer and the consumer, so the developer can see the complete chain in a single PR comment.

Breakage risk

When a PR upgrades a dependency to a new major version, the dep-analyzer checks the changelog and migration guide for breaking changes. But more importantly, it performs reachability analysis: does your code actually import the APIs that changed? A major version bump in lodash that removes _.pluck is only a risk if your codebase calls _.pluck. If it does, the PR comment shows the exact import path and suggests the migration. If it does not, the finding is suppressed.

The self-improving loop

Every PR comment includes a feedback mechanism. When a developer reacts to a finding or replies explaining why it is a false positive, that signal is recorded against the pattern that produced it. When a pattern accumulates enough false-positive feedback, the AI judge reviews the pattern itself and either refines it or suppresses it. When the AI judge consistently confirms a particular pattern type across many scans (greater than 95% precision over 100+ evaluations), that judgment gets promoted to a deterministic rule and the AI call is eliminated for future matches.

This is the part I find most interesting technically, and the part I am most honest about being unproven. The system's dependency on AI should decrease over time. Early on, a higher percentage of findings need AI judgment. As the feedback loop runs, more judgments should become deterministic. If my thesis is correct, the product gets faster, cheaper, and more reliable the longer it operates on a codebase. But I have not yet seen this at enough scale to claim it as proven.

The Road Ahead

The governance question is not going away. Agents are getting more autonomous, more interconnected, and more embedded in organizational workflows. The attack surface I described (multi-hop injection chains, excessive agency, unvalidated tool registrations) will, in my estimation, get worse before it gets better, because the economic incentives all point toward more agents with more access.

But I do not think the architecture to govern it needs to be complex. The attack surface has two layers. The endpoint layer is solved with existing enterprise tools. The code layer needs a small number of components, each doing one thing well:

What you actually need ENDPOINT (you already have this) MDM .............. app allowlists, MCP server allowlists EDR .............. behavior monitoring DLP .............. data leak prevention CODE LAYER (the gap) Runtime SDK ...... in-app guardrails for AI integrations Pre-commit Ext ... local scan before code enters repo Post-commit App .. AI-specific code review on every PR Red Team ......... pre-deploy adversarial AI review Audit Trail ...... full observability and compliance evidence Two layers. Existing tools + a thin platform.

The regulatory landscape is also converging on this direction. The EU AI Act, which entered into force in August 2024, imposes specific obligations on providers and deployers of high-risk AI systems, including requirements for risk management, technical documentation, and human oversight that map directly to the code-layer controls described above.12 In the United States, Executive Order 14110 on Safe, Secure, and Trustworthy AI (October 2023) directed NIST to develop frameworks for AI red-teaming and security evaluation. Organizations that build audit trails, deterministic detection, and human-in-the-loop governance into their development pipelines now will be better positioned when these requirements become enforceable standards rather than voluntary guidelines.

The interesting open question is whether the 85/15 ratio holds as agent architectures get more sophisticated. My intuition is that it does, since more sophisticated agents still have their behavior defined in source code, and source code patterns remain deterministic. But it is possible that novel agent interaction patterns will require a higher AI percentage before enough examples accumulate to generate deterministic rules. This is something I will learn by running the system at scale and watching the feedback loop.

Another open question is where the runtime SDK boundary should live. In-process (a library you import) gives you zero-latency enforcement and full context, but it requires code changes. A sidecar process gives you language independence but adds latency and loses context. I lean toward the in-process approach for security-critical guardrails and a sidecar for observability, but this is an area where real-world usage patterns will dictate the design.

This is why I built it. Not because the problem is impossibly complex, but because I believe the solution is surprisingly tractable once you identify where the vulnerability actually lives. The source code. The pull request. The place where the architectural mistake is made, weeks before any gateway would see the traffic it produces. I may be wrong about parts of the architecture I have laid out here. But the core claim, that the vulnerability lives in the code and that is where governance should start, is one I am confident enough to bet on.

1. Perry et al., "Do Users Write More Insecure Code with AI Assistants?" Stanford University, 2023 (ACM CCS). The study found that participants with access to AI coding assistants produced significantly less secure code than those without, while being more likely to believe their code was secure.

1b. Veracode, "AI-Generated Code Security Risks," 2025. Analysis across 100+ LLMs found that 45% of AI-generated code contains security flaws.

2. Coverage spans the OWASP Top 10 for LLM Applications (2025), the OWASP Top 10 for Agentic AI, the CWE Top 25 (MITRE, 2024), and nine compliance frameworks: GDPR, HIPAA, PCI-DSS, SOC 2, SOX, CCPA, FERPA, COPPA, and ISO 27001.

3. OWASP Top 10 for LLM Applications (2025) identifies "Excessive Agency" (LLM08) as a critical risk: agents granted unnecessary permissions, functions, or autonomy to act without proper controls. The OWASP Top 10 for Agentic AI further classifies "Uncontrolled Downstream Access" and "Agent-to-Agent Trust" as top-tier threats in multi-agent systems.

4. NIST AI 100-2 (Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations, January 2024) categorizes AI system threats into supply chain, training data, and inference-time attacks. MITRE ATLAS (Adversarial Threat Landscape for AI Systems) provides a complementary attack framework mapping adversarial techniques to the AI lifecycle, reinforcing that threats manifest both in the development environment and in the deployed artifacts.

5. NIST SP 800-53 Rev. 5 (Security and Privacy Controls for Information Systems) emphasizes "defense in depth" and warns against reliance on single-layer controls: "Organizations should not depend on a single security mechanism for any security function." The OWASP Application Security Verification Standard (ASVS) echoes this with V1.1.2, requiring that security controls are not bypassable by routing requests through alternative paths.

6. NIST AI 600-1 (AI Risk Management Framework: Generative AI Profile, July 2024) highlights the importance of explainability and traceability in AI-augmented security decisions. Section 2.6 notes that "organizations should maintain the ability to explain and reproduce decisions" made by or with AI systems, making deterministic, auditable rules essential for compliance and trust.

7. OWASP Top 10 for LLM Applications (2025) identifies "Unbounded Consumption" (LLM10) as a risk category that includes performance degradation from excessive overhead in the inference pipeline. The NIST Secure Software Development Framework (SSDF, SP 800-218) recommends integrating security checks "as close to the developer as possible" to minimize both latency and the cost of remediation, favoring in-process and pre-commit approaches over post-deployment controls.

8. GitHub CEO Thomas Dohmke reported at the GitHub Universe 2024 conference that over 46% of code written with GitHub Copilot enabled, across all programming languages, is now AI-generated, with the figure exceeding 55% in certain languages. This trajectory suggests that AI-assisted code will constitute the majority of new code within most enterprise codebases by 2026.

9. Snyk, "AI Code Security Report," 2024. The survey of over 500 security practitioners and developers found that 56% of organizations experienced AI-introduced security issues, while only 10% had formal governance policies for AI-generated code. The report also found that 80% of developers bypass established security policies to use AI coding tools.

10. Johann Rehberger, "Prompt Injection Attacks on Microsoft 365 Copilot," 2023-2024. Rehberger demonstrated multiple indirect prompt injection vectors against Microsoft Copilot, including data exfiltration through poisoned documents in SharePoint and Teams. The research was acknowledged by Microsoft and contributed to improvements in their content filtering. Similar indirect injection research has been published by Greshake et al. ("Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection," 2023).

11. OWASP Top 10 for LLM Applications (2025) classifies "Sensitive Information Disclosure" (LLM06) and "Prompt Injection" (LLM01) as the top two risks, both of which are exploitable through RAG and memory poisoning vectors. The OWASP Agentic AI Threat Modeling framework specifically identifies "Tainted Data in Shared Memory" and "Poisoned Retrieval Sources" as distinct attack patterns in multi-agent systems.

12. Regulation (EU) 2024/1689 (the EU AI Act) entered into force on August 1, 2024, with staged enforcement through 2027. Article 9 requires risk management systems, Article 11 mandates technical documentation, and Article 14 requires human oversight measures for high-risk AI systems. In the United States, Executive Order 14110 (October 30, 2023) directed NIST to develop guidelines for AI red-teaming, and NIST responded with AI 600-1 (the Generative AI Profile of the AI Risk Management Framework) in July 2024.

Clay Good is a security engineer building AI security infrastructure. More at claygood.com.