Jaideep Singh

Why context files fail, what works instead, and how Atelier's memory should actually work. Companion to problem-space.md (Phase 9, 10) and strategic-insights.md (Shift 3, Spec-Driven + Memory Convergence).

Last updated: 2026-04-13

## The Context Engineering Problem

Context engineering — the discipline of designing the full information environment an AI model operates within — is now infrastructure, not conversation. 60,000+ GitHub repositories include agent instruction files. But the evidence shows most of these files are making agents worse, not better.

## The ETH Zurich Findings (AGENTbench, 2026)

The most rigorous study on context files tested 138 real-world coding tasks across 12 Python repositories with Claude Code, Codex, and Qwen Code.

Three conditions tested:

→ No context file
→ LLM-generated context (auto-generated via /init commands)
→ Human-written context (manually curated)

Results:

| Condition | Task Success Impact | Inference Cost Impact | |-----------|-------------------|---------------------| | No context file | Baseline | Baseline | | LLM-generated | -0.5 to -2% (WORSE) | +20-23% (more expensive) | | Human-written | +4% (marginal improvement) | +19% (still more expensive) |

The counter-intuitive finding: Auto-generated context files perform worse than having no file at all. They duplicate documentation agents can already access independently, adding cost without signal value.

Why LLM-generated context hurts:

→ Duplicates existing documentation agents find on their own
→ Adds 2-4 additional reasoning steps per task (more tokens, more chances for distraction)
→ Architecture overviews "do not provide effective overviews" — removing them while keeping commands and constraints produces identical behavior at lower cost
→ The "lost in the middle" phenomenon: agents ignore instructions during extended sessions when files are long

## The Context Decay Problem

The most vivid description from research: "I was feeding it a map of a city that didn't exist anymore."

What happens over time:

→ Instructions accumulate from debugging sessions
→ Contradictory patches pile up (rule A says X, rule B says not-X)
→ Relative dates lose meaning ("by next Thursday" → when?)
→ Structural references become stale after refactors
→ Abandoned approaches linger as active instructions
→ Tool-specific file proliferation (CLAUDE.md, .cursorrules, copilot-instructions.md, Gemini.md) drift apart

The decay timeline:

→ Week 1-2: Context file is accurate and helpful
→ Month 1: Some instructions are slightly stale
→ Month 3: The file is a mix of current and deprecated guidance. The "month-3 problem" again.
→ Month 6: The file is actively harmful — directing agents toward patterns that no longer exist

No automated staleness detection exists. This is manual maintenance, and developers don't maintain documentation.

## What Should Be in a Context File

The research consensus (Augment, HumanLayer, Martin Fowler) is converging:

### Include (non-inferable details ONLY):

→ Custom build commands not in documentation
→ Non-standard tooling choices with explanations
→ Counterintuitive architectural decisions (with WHY)
→ Exact version requirements and constraints
→ Permission boundaries (always/ask-first/never)

### Exclude (agents find these independently):

→ Architecture overviews
→ Content already in README or existing docs
→ Standard framework conventions
→ Edge-case instructions rarely applicable

### Optimal length: Under 150 lines

→ Start under 150 lines
→ Split into subdirectories when exceeding 200 lines (root AGENTS.md → apps/web/AGENTS.md → apps/api/AGENTS.md)
→ Agents read the file closest to the file being edited; deeper files take precedence

### The symlink solution for tool fragmentation:

CLAUDE.md, .cursorrules, copilot-instructions.md should all be symlinks to one canonical AGENTS.md. Same file, one source of truth.

## What This Means for Atelier's Memory System

The existing Atelier concept: "Persistent, human-curated project memory that flows from research → design → build → test."

The research sharpens this into specific design requirements:

### Requirement 1: Memory Must Be Structured, Not Accumulated

Static context files fail because they accumulate. Atelier's memory must be structured by phase and card type, not appended chronologically.

What this looks like:

→ Research card completes → produces structured research findings (sources, synthesis, key decisions)
→ Design card completes → produces design spec (component list, tokens, interaction specs)
→ Build card completes → produces implementation notes (decisions made, edge cases found, patterns used)
→ Each card's output IS its memory contribution — structured, bounded, typed

This is the "spec-as-memory" insight from strategic-insights.md, now validated by the context engineering research: minimal, focused, phase-specific context outperforms generic accumulated context.

### Requirement 2: Memory Must Be Editable, Not Auto-Generated

The ETH Zurich finding is definitive: auto-generated context hurts. Atelier's memory must be:

→ Auto-suggested by agents (propose facts for promotion to project memory)
→ Human-curated (user decides what's worth keeping)
→ Minimal by default (less is more — every entry must justify its existence)

The Windsurf approach (implicit learning over 48 hours) is the wrong model. The Auto Dream approach (automated consolidation) is better but still needs human editing.

### Requirement 3: Memory Must Be Card-Scoped, Not Project-Scoped

A giant project memory file hits the same problems as a giant CLAUDE.md. Instead:

→ Each card has its own memory (the output/spec it produces)
→ Project memory = the union of active card memories, filtered by relevance to the current task
→ An agent working on a backend card gets: backend-relevant research findings + design spec + relevant prior build decisions. NOT: everything ever recorded about the project.

This solves the "lost in the middle" problem — agents ignore instructions in long context. Short, relevant, card-scoped context stays effective.

### Requirement 4: Memory Must Have Explicit Decay

The "context engineering decay" problem demands that memory has a lifecycle:

→ Active — currently informing agent work
→ Archived — no longer actively loaded but retrievable
→ Deprecated — explicitly marked as superseded (with reference to what replaced it)
→ Deleted — removed from the system

Without explicit decay, memory accumulates into the same mess that kills CLAUDE.md files.

### Requirement 5: Memory Must Be Export-Ready

Solo founders use 5-8 tools. Atelier's memory needs to flow outward:

→ Export as CLAUDE.md / AGENTS.md (for use in Claude Code outside Atelier)
→ Export as .cursorrules (for Cursor sessions)
→ Export as structured JSON (for MCP servers)
→ Import from existing context files (onboarding from current workflow)

The memory system is only valuable if it's the canonical source of truth that feeds all other tools.

## The Existing Memory Landscape

### Static Context Files (CLAUDE.md, .cursorrules, AGENTS.md)

→ Strengths: Simple, version-controlled, tool-agnostic
→ Weaknesses: Manual maintenance, context decay, tool fragmentation, "lost in the middle" for long files
→ Best for: Stable project conventions that rarely change

### Auto Dream (Claude Code, March 2026)

→ How it works: Automated memory consolidation modeled after REM sleep. Runs after sessions, extracts key decisions and patterns.
→ Strengths: Automatic, no manual effort
→ Weaknesses: By session 15-20, accumulated memory is "a mess — stale entries, contradictory instructions, relative dates that no longer made sense"
→ Missing: Human curation, explicit decay, phase-awareness

### Windsurf Memories

→ How it works: Implicit learning over 48 hours. The tool decides what to remember.
→ Strengths: Zero-effort
→ Weaknesses: Implicit = user can't curate what matters. Black box. No phase-awareness.

### Kiro Living Specs

→ How it works: Specs that agents update as they work. "When an agent completes work, the spec updates to reflect reality."
→ Strengths: Addresses staleness automatically. Specs stay synced with code.
→ Weaknesses: Only covers the build phase. Requirements → design → tasks. No research memory, no cross-phase flow.

### agentmemory (Open Source)

→ Architecture: Vector store + knowledge graph + semantic memory
→ Strengths: Technical foundation is strong. Multiple memory types.
→ Weaknesses: Developer tool, not user-facing. No curation UX. No phase-awareness.

### Replit Decision-Time Guidance (April 2026)

How it works: Instead of front-loading all context into the prompt, a lightweight classifier analyzes the current trajectory (user messages, tool results, error patterns) and injects only the relevant guidance from a micro-instruction bank. Ephemeral reminders appear at decision points and disappear after, never persisting in conversation history.

Why it matters:

→ Addresses the "lost in the middle" problem — guidance appears only when relevant
→ False positives are safe — reminders function as suggestions, not constraints
→ Prompt caching preserved — core prompt stays static, achieving 90% cost reduction vs dynamic prompt modification
→ Scales from 4-5 static reminders to hundreds of contextually relevant interventions
→ Consultation triggers detect doom loops and invoke external agents with fresh plans (model switching)

Atelier connection: This is the technical validation of Atelier's card-scoped context injection idea. Instead of giving agents the entire project memory, you give them the relevant slice at the relevant moment. Replit proves this works better than front-loading. Atelier's card types ARE the classifier — a research card injects research-relevant context, a backend card injects backend-relevant context.

### What Nobody Has Built (Atelier's Opportunity)

→ Phase-aware memory — memory structured by development phase (research → design → build → test) rather than accumulated chronologically
→ Card-scoped context injection — each task gets only the memory relevant to its type and dependencies
→ Human-curated with auto-suggest — agents propose, humans decide, system enforces minimalism
→ Explicit memory lifecycle — active → archived → deprecated → deleted, with forced review
→ Cross-tool export — memory as the canonical source that feeds CLAUDE.md, .cursorrules, MCP servers

## The Figma Taste Gap Data (Supporting the Design Moat)

From Figma's 2025 AI Report — data that validates why Atelier's design-first approach matters:

### The 31-Point Efficiency-Quality Gap

| Metric | Designers | Developers | |--------|-----------|------------| | "AI makes me more efficient" | 78% | 82% | | "AI makes me better at my role" | 47% | 68% | | "I trust AI output" | 32% | — | | "AI improves work quality" | 40% | 68% |

The gap: 78% of designers say AI makes them more efficient, but only 47% say it makes them better. Efficiency ≠ quality. This is the taste gap in data form.

### Why Designers Trust AI Less

Design work is fundamentally more subjective than development. Developers can run tests to verify AI-generated code works. Designers must evaluate if a layout "feels right" or if copy "sounds on-brand" — harder to verify, harder to trust, harder to defend to stakeholders.

### The Strategic Implications

→ 51% of Figma teams building agentic AI (doubled from 21% in 2025)
→ 52% of designers report design becomes MORE important for AI products
→ 75% of successful AI products had tight design-development collaboration
→ 85% of designers consider AI skills essential to future success
→ Taste and intuition are increasingly important in hiring as AI raises the floor

What this means for Atelier: "AI has raised the floor, making it easier to produce 'pretty good' work. But the ceiling — work that resonates, differentiates, and endures — remains human." Atelier positions itself as the tool for people who care about the ceiling, not just the floor. The design taste moat isn't aesthetic — it's functional. It's the difference between 47% and 68% confidence in output quality.

## The Cost of Context

At scale, context engineering has real cost implications:

| Scale | Monthly Overhead (context files) | |-------|-------------------------------| | 1,000 tasks | ~$45 | | 10,000 tasks | ~$450 | | 100,000 tasks | ~$4,500 |

Mitigation: Prompt caching reduces context cost by 90% for repeated reads. Atelier's card-scoped context (smaller, more focused) is inherently cheaper than project-wide context injection.

Memory as a product surface