Flamehaven LogoFlamehaven.space
back to writing
Prompt → RAG → MCP → Agent → Harness, and What?

Prompt → RAG → MCP → Agent → Harness, and What?

Why the next layer in AI may be governance infrastructure, not just better agents.

notion image

🔎 Quick glossary — for readers new to this space:

  • Harness — the software layer around an AI model that controls what it can see, what tools it can use, and what actions it can take
  • OpenClaw — Anthropic's Claude Code, an AI agent that runs directly inside a developer's terminal and codebase
  • MCP — Model Context Protocol, a standard way for models and agents to connect to external tools, data sources, and services
  • Prompt injection — an attack where malicious instructions hidden in data override an agent's intended behavior
  • Fail-closed — a design default where a system stops and waits for human input when uncertain, rather than continuing
  • Tamper-evident audit trail — a record of system actions written in a way that cannot be altered after the fact, typically using cryptographic hashing
  • Drift — the process by which a system that was once safe gradually becomes unsafe through small, accumulated changes rather than a single visible event

Where We Left Off

notion image
The previous report analyzed the Claude Code source-map exposure and made one central claim: the model is not the whole product.
The harness is.
Context handling, tool routing, permissions, continuity, recovery, cost discipline. That is where production agent performance actually lives.
We closed that report with a blunt conclusion: "The next durable moat in AI may not be the model alone. It may be the harness."
This piece takes the next step in that argument.
If the leak showed that the real moat was the harness, it also raised a harder question. Once the harness itself became visible, the issue was no longer simply who had built the strongest one. The issue was whether the harness itself was being governed by anything independent of the team that built it.
That is where this piece begins.
MCP mattered because it standardized the connection layer between models, tools, and external systems — which is precisely why governance can no longer stop at the prompt boundary.

Two Findings That Define the Problem

notion image
On March 31, 2026, a 59.8 MB JavaScript sourcemap shipped inside the Claude Code npm package. Within hours, 512,000 lines of TypeScript were mirrored across GitHub.
A flood of analysis followed. Much of it focused on hidden prompts, tool calls, or line-by-line code secrets.
What interested us was different. Among the many posts, videos, and blog breakdowns that appeared, two analyses mattered most here because they exposed operating failures inside the harness itself.

1. Alex Kim: silent failure at scale

The leaked source contained an internal BigQuery data point showing repeated compaction failures at significant scale. The fix was three lines of code. The more important fact was that the problem had been running silently long enough to require a data pull to surface it.
The system did not halt. It continued. Silently.

2. Adversa AI: guardrails failing in sequence

They reported that once a pipeline exceeded 50 subcommands, deny rules stopped running and security validators were skipped without surfacing the failure to the developer. The primary boundary between the agent and the system failed silently.
Interestingly, both findings point to the same structure. The harness had rules. The rules had exceptions. The exceptions were invisible. No one inside the system detected either failure. Both were surfaced by external parties only after the fact.
Each incident, taken alone, could be read as an implementation error. Taken together, they point to something larger: a mature harness, built by a serious team, can run silent failures for months without its own monitoring detecting them.
That is the bridge to governance. Once the question is no longer whether the harness exists, but who is watching it, the discussion moves beyond bugs and into control, oversight, and accountability.

The Structural Problem Nobody Names Directly

notion image
The industry response was predictable. Governance became the next keyword.
LinkedIn filled with AI Governance Leads. Substack newsletters explained that what you need after the harness is a policy layer. RSA 2026 dedicated entire tracks to it. Vendors announced governance products within weeks of the leak.
Here is the structural problem none of that addresses.
The first structural problem is independence.
The people now selling governance are the same category of people who built the harnesses that failed. Their commercial incentive is to make AI systems run, not to constrain them. When a vendor builds a governance layer into its own agent platform, that layer is designed to make the platform more useful. It is not designed to be adversarial to the platform.
To be precise: this is not a claim that vendor-built controls are worthless. First-line controls built by the vendor serve a real purpose. A viable architecture combines vendor-built first-line controls with an independent second-line governance layer maintained by a party with no stake in the system's output.
The problem is not the first line. The problem arises when the first line is the only line, and when the same entity that builds the agent also defines, operates, and audits the rules that govern it. An auditor who reports to the entity being audited is not an auditor. A governance layer that runs inside the runtime it governs inherits that runtime's failure modes.
Anthropic's permission system failed because an internal performance optimization created a silent exception. That exception was invisible to the system's own monitoring. It was found by Adversa AI, an external security firm with no stake in Claude Code's success. That is what independent governance looks like. The problem is that it appeared after deployment, not before.
Real governance is adversarial by design. It asks: what is this system doing that it should not be doing? That question cannot be answered by the system itself.

The Speed Problem Makes It Worse

There is a second structural problem that compounds the first: speed.
Every governance framework currently being proposed operates at human speed. Committees review policies quarterly. NIST held three workshops in 2025, released a preliminary draft in December, and is targeting an initial public draft for 2026. Framework-level documents take months to produce and years to enforce.
The attack surface is moving faster.
OpenAI has publicly stated that prompt injection may never be fully solved for browser-based agents. Their own automated red-teaming found a new class of attacks that human researchers had not previously identified. Their RL-trained attacker can steer an agent into executing long-horizon harmful workflows unfolding across tens or hundreds of steps.
A joint paper from researchers at OpenAI, Anthropic, and Google DeepMind, published on arXiv in October 2025, tested 12 published defenses against prompt injection. Using adaptive attacks, they bypassed all 12 with success rates above 90% for most. Those defenses had originally reported near-zero attack success rates.
The governance gap is not a policy gap. It is a speed gap.
Framework-level documents are necessary. They establish vocabulary, assign accountability, and create legal surface area. They are not sufficient as a substitute for controls that run at machine speed. You cannot defend against attacks that defeat every published defense before your committee has convened to review the log.

What That Actually Requires

notion image
So what would governance actually have to look like if it had to be independent and operate at machine speed?
Three properties follow from that constraint. A fourth requirement — drift-aware revalidation over time — remains unresolved and is addressed separately below.
Requirement
Why it exists
What fails without it
Policy-as-Code, external to the runtime
The rules must sit outside the same context surface they are meant to constrain.
The agent can reinterpret or override the very rules meant to govern it.
Audit trail that is tamper-evident
Enforcement only matters if the record can be verified outside the system that produced it.
A compromised system can rewrite its own history.
Fail-closed as the architectural default
When uncertainty appears, the default cannot be to continue silently.
The system keeps operating through ambiguity, drift, or hidden failure.

1. Policy-as-Code, external to the runtime.

The governing rules cannot live inside the same context surface they are meant to constrain. A CLAUDE.md file that instructs the agent what it cannot do is not governance. The agent reads it. The agent can be instructed to override it. The agent's own context window is the attack surface.
As Simon Willison described the OpenClaw architecture, access to private data, exposure to untrusted content, and the ability to act externally form a lethal trifecta. Any governance layer that lives inside that trifecta inherits its vulnerabilities.

2. Audit trail that is tamper-evident.

Enforcement only matters if the record can be verified outside the system that produced it. A log is not enough. Logs live inside the same system they record. A compromised agent can write to its own logs.
Tamper-evident infrastructure means the record is written to a location the agent cannot reach, with a cryptographic signature that proves it has not been altered. The structure matters more than the platform.

3. Fail-closed as the architectural default.

When uncertainty appears, the default cannot be to continue silently. Without policy code precise enough to define uncertain state, and without an audit trail that records halts as well as executions, a fail-closed rule has nothing meaningful to enforce.
The Claude Code incidents pointed in the opposite direction: when uncertainty appeared, the system continued. Governance has to reverse that default.
Taken together, these properties define the minimum shape of governance infrastructure rather than a policy layer attached after the fact.

How the Industry Is Responding — and What It Is Missing

The industry response is real. It is not sufficient. Each response addresses the problem from inside the system being governed. That is the constraint that none of them escapes.
Anthropic now advises separating static and dynamic context in system prompts to reduce attack surface. That is correct architectural practice. It partially addresses speed by reducing what an attacker can reach. It does not address independence. Anthropic still defines, operates, and monitors the control. It also does not address the record requirement: there is no tamper-evident audit trail that an external party can verify without access to Anthropic's internal systems.
OpenAI uses an adversarially trained model as an internal red-teamer. The attacker and the defender are both built by OpenAI, share the same deployment pipeline, and serve the same commercial objective. OpenAI acknowledged publicly that prompt injection for browser agents may never be fully solved. That is an honest statement about the speed problem. It is not a governance solution. An adversary inside the system it is testing cannot satisfy the independence requirement by definition.
The Cloud Security Alliance's AI Controls Matrix, released in July 2025, is the most operationally complete public framework available. It maps controls across the AI lifecycle and assigns accountability across providers, orchestrators, and application developers. It addresses the record requirement in principle by specifying what audit artifacts should exist. It does not close the speed gap: framework documents are updated on timelines measured in months, not milliseconds. And it does not enforce independence: compliance is self-reported.
The gap is structural, not accidental. Vendor-internal controls address execution-layer problems faster than external frameworks can. External frameworks establish accountability that internal controls cannot self-certify. Neither alone satisfies all three requirements. The missing piece is what sits between them: independent, machine-speed infrastructure that produces a record neither the vendor nor the framework can alter after the fact.
To be clear: this assessment is based on publicly available documentation. Anthropic, OpenAI, and CSA may operate internal controls not publicly disclosed. What can be evaluated is what is publicly verifiable. On that basis, the three requirements remain unsatisfied as a complete set.

What Is Actually Achievable Now

notion image
The four requirements above are not equally mature. Treating them as a single package overstates what is available today and understates what is genuinely hard.

1. Policy-as-Code is achievable now.

The infrastructure exists and has already been proven outside the AI domain. Open Policy Agent, maintained by the CNCF, is the clearest example. It decouples policy decisions from services and enforces rules at sub-millisecond latency across Kubernetes, APIs, and CI/CD pipelines.
The same pattern is now appearing in agent tooling. On April 2, 2026, Microsoft released the Agent Governance Toolkit under MIT license. It integrates with LangChain, CrewAI, LlamaIndex, OpenAI Agents SDK, and LangGraph through native extension points, meaning governance does not require rewriting agent code. NVIDIA also surfaced the same direction at GTC 2026 through OpenShell-style runtime sandboxing.

2. Tamper-evident audit trails are achievable now.

The cryptographic foundations are largely solved. Nitro, published at ACM CCS 2025, is a high-performance tamper-evident audit logging system using eBPF. A formal framework for constant-size cryptographic evidence structures for regulated AI workflows was published on arXiv in November 2025. PunkGo, a Rust sovereignty kernel for verifiable agent execution, pushes the same direction. The gap is not cryptographic. It is architectural.

3. Fail-closed as default is achievable, but only if teams choose it.

This is not a hard engineering problem. It is a priority problem. The patterns already exist. The question is whether teams implement them before an incident forces them to.

4. Drift-aware revalidation is the unsolved frontier.

None of the tools above address semantic drift. They catch what a system does wrong at execution time. They do not catch what a system does correctly today that will be wrong next month because the context around a rule has changed.
The practical implication for teams building now is straightforward:
  • implement Policy-as-Code and tamper-evident audit trails immediately using available tools
  • treat fail-closed as a default that requires explicit opt-out justification, not opt-in configuration
  • on drift, instrument the system to detect state divergence and flag it for human review
  • use manual revalidation if necessary; a quarterly process is imperfect, but still far better than none

The Drift Problem Remains Unsolved

notion image
Drift is what happens when a system that was once safe slowly becomes unsafe. Not through a single visible change. Through accumulated small movements that individually look like normal operation.
Three mechanisms produce drift in production agent systems:
  1. Path drift. An approved command writes to a path that was low-sensitivity when the approval was granted. Six months later, a dependency update wires that path into a deployment-critical pipeline. The approval still exists. The risk profile of the action has changed. The governance layer does not know this because it checks the rule, not the current context of what the rule governs.
  1. Approval semantic drift. A rule permits an agent to "summarize and forward email threads to the project channel." When the rule was written, the project channel was internal. After a contractor integration, the same channel now includes external parties. The rule has not changed. The meaning of executing it has.
  1. Model update boundary shift. A model version upgrade changes how the agent interprets ambiguous instructions at the boundary of a permission. Actions that previously triggered a permission request now pass through silently, because the new model resolves the ambiguity differently. The governance logic was calibrated to the previous model's behavior.
The three properties described earlier address the execution layer. They catch what a system does wrong in the moment. Drift operates on a different axis entirely: the action is executed correctly, but what "correct" means has quietly changed.
The Adversa AI bypass is a compressed version of all three. The permission logic was correct when written. A performance optimization changed the execution path. The governance logic was not updated to match. The divergence was invisible until an external party specifically looked for it.
Drift-aware governance requires one additional capability beyond the three properties: periodic re-checking of prior approvals against current system state. Not just "was this action allowed?" but "is the action this approval describes the same action being taken today?" No production system has solved this cleanly. The teams that solve it first will have built something that cannot be replicated by purchasing a governance product, because drift is always domain-specific. No vendor can pre-package the semantic understanding of what a rule means inside a specific deployment.

The Actual Answer to the Title

notion image
Prompt. RAG. MCP. Agent. Harness.
And what?
Not governance as a product category. Not a framework document. Not a vendor-bundled policy layer.
The answer is governance infrastructure with three properties: structurally independent of the system it governs, operating at machine speed rather than committee speed, and producing a tamper-evident record that cannot be altered by the system being governed.
The organizations that build this will not necessarily be the ones with the best models or the best harnesses.
They will be the ones who understood that a system cannot fully govern itself, that the speed gap between attacks and frameworks is structural rather than temporary, and that "Show us the record" is a demand that will only get louder.

"Show us the record."

That demand is already arriving. From insurers requiring audit evidence before covering AI-assisted workflows. From enterprise procurement teams who lived through March 31. From regulators whose enforcement timelines are converging on 2026 and 2027.
The record either exists or it does not.
Build the infrastructure that produces it before someone else decides what that record should contain.

A practical note

If your team is working on agent governance in practice — not as a policy memo, but as real control logic, audit design, and reviewable implementation — our team at Flamehaven would be glad to help.
No advertising angle.
No commission pitch.
No hype.
Our work is grounded in serious engineering: governed systems, careful code, original control logic, and architecture shaped by sustained implementation work and accumulated operational evidence over the past year.
If you are thinking through governance design, collaboration, or code-level development in this direction, feel free to DM me.
We are open to technical discussion, early collaboration, and helping teams turn governance ideas into working systems.

References

  1. Alex Kim — "The Claude Code Source Leak: fake tools, frustration regexes, undercover mode" — March 31, 2026
  1. Adversa AI — "Critical Vulnerability in Claude Code Emerges Days After Source Leak" — April 2026
  1. CyberScoop — "OpenAI says prompt injection may never be solved for browser agents like Atlas" — December 2025
  1. Nasr, Carlini, Sitawarin et al. — "The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections" — October 2025
  1. Simon Willison — "The lethal trifecta for AI agents" — June 16, 2025
  1. Anthropic — "Reduce prompt leak" — 2026
  1. Cloud Security Alliance — "AI Controls Matrix (AICM)" — released July 10, 2025
  1. Open Policy Agent — CNCF Project — ongoing
  1. Microsoft — "Introducing the Agent Governance Toolkit" — April 2, 2026
  1. NVIDIA — "How Autonomous AI Agents Become Secure by Design With NVIDIA OpenShell" — March 23, 2026
  1. Zhao et al. — "Rethinking Tamper-Evident Logging: A High-Performance, Co-Designed Auditing System" — ACM CCS 2025
  1. Kao et al. — "Constant-Size Cryptographic Evidence Structures for Regulated AI Workflows" — November 2025
  1. Kell et al. — "Right to History: A Sovereignty Kernel for Verifiable AI Agent Execution" — February 2026
  1. Fabian et al. — "A Systematic Taxonomy of Security Vulnerabilities in the OpenClaw AI Agent Framework" — 2026

Flamehaven Research publishes under flamehaven.space, Substack, and dev.to (flamehaven01).
This piece is part of a series on verifiable AI governance infrastructure.
 

Share

Related Reading