Flamehaven LogoFlamehaven.space
back to writing
My LLM Kept Forgetting My Project. So I Built a Governance Schema.

My LLM Kept Forgetting My Project. So I Built a Governance Schema.

Session loss isn't a UX inconvenience — it's a structural failure with compounding consequences for long-running AI projects. This post defines the problem precisely and introduces MICA, a governance schema for AI context management.

Post preview cover

Glossary: terms used in this article

🔸 MICA (Memory Invocation & Context Archive): A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.
🔸 Session Loss: The architectural characteristic of LLMs where no information persists between independent conversations. Not a bug. A design property with real engineering consequences for long-running projects.
🔸 Invoke Role: A governance label that defines a context item's eviction behavior. In this series: anchor (never evict), bridge (preserve across phases), hint (drop first under pressure), none (drop immediately).
🔸 Trust Class: The reliability classification of a context item's source. In this series: canonical (repo truth), distilled (summarized from sessions), raw (unprocessed session output), symbolic (reference only).
🔸 Anchor Item: A context item that cannot be dropped under any memory pressure. Eviction priority: 0. Later parts explain how this is enforced at the schema level.
🔸 Semantic Collapse: A pattern introduced in later parts of this series. A JSON Schema is applied to an LLM as a runtime contract rather than as a validator.
🔸 Fail-Closed Gate: An admission rule that excludes a context item if it fails any defined threshold. No exceptions. Formalized in v0.1.7.

1. The Problem

The Structural Flaw
LLMs do not retain state between sessions. Every conversation starts from zero.
For a single task, that is manageable.
For a project maintained across dozens of sessions, it is not.
Such a project accumulates architectural decisions, non-negotiable constraints, protected files, and a decision history explaining why things are the way they are. In that setting, the lack of continuity becomes a structural failure with compounding consequences.
The failure mode is insidious. The model does not fail visibly. It produces well-formed, internally consistent output. What it cannot know is which of its inferences are wrong, because the context it received was incomplete. It cannot identify what is missing. Instead, it fills the gaps silently, drawing on training data rather than on the project's actual history.
Standard responses address parts of the problem. Longer system prompts, RAG pipelines, session summaries. None addresses the governance layer: which context items are authoritative, which are provisional, how they should be weighted against one another, and what happens when memory pressure forces eviction.
This post documents the specification I built to address that layer.

2. Where This Started

For a long time, my workaround was simple and inefficient. Copy a conversation from one AI, paste the whole thing into another, and ask the second one to analyze it. The idea was to get a fresh perspective. To avoid the first model's blind spots by bringing the work to a different context.
It rarely worked cleanly. The second model would inherit the framing, the vocabulary, even the confident wrong assumptions of the first session. It would pick up the conclusion without the reasoning errors that led to it. Then build on both. This pattern is well-documented in multi-model workflows. The receiving model anchors to the original session's assumptions rather than evaluating the content independently. Researchers studying AI review and self-critique systems have noted the same anchoring effect. Same-session review produces systematically worse error detection than fresh-session review.
So I kept looking for a better way to carry context across sessions.
At some point, I came across Claude's memory feature. Not while looking for it specifically. While exploring settings. Claude had introduced persistent memory for paid plans in October 2025, with the stated goal that users shouldn't have to re-explain their context at the start of every session. Anthropic's own description: "your first conversation feels like your hundredth." The export feature produces structured output. Categories, dates, entries, one per line.
Conceptually, this is similar to Claude's Skills system: a modular, reusable layer that carries working state between sessions without rebuilding it from scratch each time. The intent is sound.
I used it. It helped. But something was consistently off.
The output looked like this:
Four entries. Same format. Same weight. But they are not the same kind of thing. The second and fourth are hard constraints. In a medical AI context, violating them is not a quality issue. It is a safety issue. The first is a style preference. The third is an open question that may be abandoned next week. The export format has no mechanism to express that difference.
Others in the community have hit the same wall in different ways. From r/ClaudeAI:
  • A custom MCP memory server still caused the model to skip stored context ~40% of the time
  • "Notes are lossy. They capture what I thought was important, not what Claude actually found important."
The pattern is consistent. The flat format puts the entire prioritization burden on the user.

The Governance Gap

Storage is not Governance
Anthropic labels the memory import feature experimental, with the explicit caveat that Claude may not always successfully incorporate imported memories. The export is a snapshot. What it cannot express is governance: which items are anchors, which are provisional, what survives memory pressure, and what the source of each item actually was.
At the same time, I had been building AI-SLOP-Detector. A static analysis tool for detecting low-quality AI-generated code. Its core premise: not all signals are equal. Some patterns are weighted more heavily. Some trigger hard blocks regardless of aggregate score. The scoring model is explicit, versioned, and auditable.
The same structure was missing from context management. A flat export has no weights, no eviction rules, no provenance, no trust hierarchy. It is a list. A governance schema is something different.
That gap is where MICA (Memory Invocation & Context Archive) started.

3. The Structural Problem with LLM Context

The Danger of Silent Regression
Language models do not retain information between sessions. This is a known architectural property, not an implementation gap. Every new conversation begins from zero.
For short, self-contained tasks, this is manageable. For long-running engineering projects, it is not. Codebases under active development. Governance systems with accumulated decision history. Architectures with non-negotiable invariants. Each creates a specific and compounding failure mode.
The failure is not that the model performs poorly. It is that the model performs well. Confidently. Using whatever context it was given. That context is almost always incomplete in ways the model cannot detect.
Consider what a project accumulates over months of development:
Type of knowledge
Survives session loss?
Consequence of loss
Current file contents
Partially (if re-provided)
Recoverable
Architecture decisions and rationale
No
Silent regression risk
Constraints that are non-negotiable
No
Violated without awareness
Solutions already tried and abandoned
No
Repeated work, repeated failures
Trust levels of different information sources
No
All context treated equally
The model cannot distinguish between a constraint that is load-bearing and one that is provisional. It does not know which decisions have downstream dependencies. It cannot identify that a suggested refactoring was already evaluated and rej
ected for a reason not present in the current context.
The result: the model fills gaps with plausible inference. Inference drawn from training data, not from project history.

4. Why the Standard Fixes Don't Fully Work

The Illusio of Current Fixes
Every developer running a long project with an LLM eventually hits the wall and builds something to address it. The community has been running these experiments for a while now.
The diagram above shows the pattern. Each approach handles one or two dimensions. None handles governance.
  • Long system prompts are the first instinct. Write everything down, start every session with it. It works until maintaining the prompt becomes a second job. One developer described it precisely: "the scaffolding becomes the work." There is also a structural issue no token count fixes: the "lost in the middle" effect. Models attend well to the beginning and end of long contexts and degrade in between. An architecture constraint buried in paragraph 11 of a 3,000-token prompt may not receive adequate attention regardless of whether it fits in the window.
  • RAG pipelines handle knowledge injection well. They do not handle governance. Retrieving a paragraph about how a caching layer works is different from the model understanding that this specific caching layer cannot be modified without a deviation log entry. RAG provides facts. It does not provide the weight of facts.
  • Session summaries are a reasonable manual workaround and the one I used most. The problem: summarization is lossy by design.
    • Summaries strip the rationale behind decisions. The model gets the conclusion without the reasoning that produced it. When a new edge case appears that challenges an earlier decision, the model has no basis to evaluate whether the constraint still holds. It just follows the summary.
  • Flat memory exports — including Claude's own export format — do carry context across sessions. That part works. What they cannot express is priority. Which items are non-negotiable. Which are provisional. Which are stale. The burden of sorting that out falls entirely on the user, every time.
The common limitation across all four: they treat context as a document to be read. None treats it as a governed structure where different items have different trust levels, different eviction protection, and different behavior when they conflict.

5. Looking for an Answer

The question was whether anyone had already solved this at the specification level.
The research on LLM memory management is substantial. MemGPT (Packer et al., 2023) introduced the OS analogy: main context as RAM, external storage as disk. Later work extended this into hierarchical architectures and tiered storage models.
A 2025 survey (arXiv:2509.18868) maps the landscape and documents a problem the field had started to name: self-degradation. Naive "add everything" strategies cause memory inflation. Inflation leads to error propagation. The agent performs worse over time. Not better.
The direction is clear: memory needs to be tiered, decayed, and managed. Not accumulated.
What I could not find was a specification for the governance layer. Not for a multi-agent research system. For a single developer maintaining a real project across sessions.
  • How to express that some items are non-negotiable anchors and others are provisional notes.
  • How to define eviction behavior a conforming implementation must follow. How to require provenance.
  • How to version and audit context state the way you version and audit code.
The existing systems were engines. What was missing was a schema.
That is what v0.0.1 attempted. The first version was rough. It got several things wrong. Part 2 covers what those failures revealed.

6. What This Series Covers

Intoducing MICA
That is the problem. MICA is the attempt at a specification-level answer to it.
This is Part 1. It documents the problem and the motivation.
Part 2: v0.0.1. The first schema version, what it defined, what it got wrong, and why those failures revealed exactly the problem this series is trying to solve.
The series runs as long as there is something concrete to document.

Named failure mode from this post: the governance gap. Context systems that store what you said, but not what it meant, where it came from, or how much it mattered.

MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: flamehaven.space. Open-source tooling: AI-SLOP-Detector. All schema references follow the v0.1.7 Universal standard unless a specific earlier version is named.

Share

Related Reading