Corpus Map
The Corpus Map is Guildhall’s compact index of a project. It helps agents reuse the right code, component, helper, package, test pattern, and convention without dumping the entire repo into every prompt.
It is Guildhall’s answer to a common agent failure: the worker sees one local file, invents a local solution, and misses the shared abstraction that already exists somewhere else.
What it stores
Guildhall writes the map under ./memory/:
| File | Purpose |
|---|---|
./memory/codebase-map.yaml | Current compact map. |
./memory/codebase-map.history.jsonl | Refresh history and why each refresh ran. |
./memory/codebase-map.stale.json | Last refresh failure, if the map could not be rebuilt. |
./memory/codebase-map.overrides.yaml | Human or learned corrections layered over automatic discovery. |
./memory/design-system.yaml | Optional project design-system source summarized into the map. |
The map contains:
- file fingerprints: path, size, modified time, SHA-256
- language and file kind
- exported symbols and imports
- short file summaries
- owned areas such as runtime, web UI, agents, tools, docs, and config
- canonical files for each area
- known abstractions such as shared UI controls or runtime helpers
- design-system token counts, primitives, component files, maturity, and reuse recommendations when a project design system exists
- suggested verification commands
It does not store full source contents. Helpers still open source files when they need evidence.
How Guildhall builds it
The builder starts with a Git-aware file list:
- Prefer
git ls-files --cached --others --exclude-standard. - Fall back to a recursive walk when Git is not available.
- Skip generated, binary, dependency, and noisy memory paths.
- Skip command-shaped path fragments that may appear in agent notes or checkpoint metadata.
- Fingerprint text files and classify them by path and extension.
- Extract lightweight symbols and imports.
- Group files into areas.
- Detect reusable abstractions.
- Summarize the project design system when
./memory/design-system.yamlexists. - Apply any overrides.
- Save the map and append a history event.
The first refresh is a full build. Later refreshes can be partial.
Semantic enrichment is optional and explicit. A normal refresh builds the deterministic map without spending model tokens. When you run a semantic refresh, Guildhall first builds the deterministic map, then asks the contextIndexer model to add purpose, current-truth notes, architecture areas, canonical abstractions, risks, read-next guidance, and worker guidance. The model output is validated as structured JSON and stored under the map's semantic section.
Semantic refresh has a repair ladder. Guildhall first attempts strict parsing, then deterministic cleanup for obvious JSON issues such as fenced output and trailing commas. If the response still cannot parse, or if it parses but does not match the required schema, Guildhall performs one repair pass with a fast OpenAI-compatible model and asks it to preserve the substance while returning valid schema-shaped JSON.
Guildhall also wires this into the normal agent lifecycle. If no map exists when an agent context is being built, the context builder creates one lazily from the task project or active worktree before rendering the prompt. After a worker changes files and hands work forward, the orchestrator refreshes the map from dirty files and checkpoint-touched files it can prove.
Manual refresh remains available for debugging, repair, or explicit control, but normal projects do not need a "remember to build the map" chore. If refresh fails, Guildhall keeps running and records stale status instead of blocking the task.
Partial refresh
Guildhall refreshes individual touched files when the project shape is stable. A worker completion can pass the files it changed; Guildhall updates those entries, recomputes affected areas and abstractions, and leaves unrelated entries alone.
Some changes force a full rebuild because they can change how the whole project fits together:
package.json- lockfiles
- workspace config
- TypeScript, Vite, Svelte, Vue, React, ESLint, or Prettier config
.gitignoreAGENTS.md./guildhall.yaml./memory/design-system.yaml- schema/version changes
- very large touched-file sets
- missing or corrupt previous maps
This keeps refreshes cheap during normal work while still avoiding stale architecture guidance after project-wide changes.
How agents use it
The context builder turns the map into a small prompt block:
## Corpus Map
Project: Local project with indexed files across TypeScript and Svelte.
Design system:
- Maturity: thin, approved
- Tokens: color 8, spacing 6, typography 4, radius 3, shadow 2
- Primitives: Button, Select, FrameCard
- UI surface area is larger than the captured token/primitive set; prefer extending the design system when a second repeated treatment appears.
Mapped area:
- Web UI: shared controls, surfaces, and UI conventions.
Reuse / Extend:
- Command buttons (./src/web/lib/Button.svelte)
- Use when: a user triggers an action from a toolbar, form, panel, drawer, or wizard.
- Avoid: local button padding, radius, neutral backgrounds, or one-off action styles.
Read next:
- ./src/web/lib/Button.svelte: Reuse Command buttons
Corpus fit required: before editing, name the existing primitive, helper,
package, design token, component, or area you are extending.That block is intentionally small. It points the helper toward the right starting files and abstractions; it does not ask the model to trust the map blindly.
Evaluation ladder
Guildhall tests context-indexer models against a ladder rather than a single repository. Each rung isolates a different failure mode:
| Rung | Project | Corpus | Why |
|---|---|---|---|
| 1 | narrative-harness | Documentation and product intent | Proves the indexer can summarize specs, decisions, and future architecture without inventing implementation details. |
| 2 | linecraft | Small-to-medium code | Proves the indexer can map real source structure, canonical modules, and verification entrypoints at practical cost. |
| 3 | Guildhall UI slice | Design-system reuse | Proves the indexer can identify shared UI primitives and warn when repeated one-off styles should become a small abstraction. |
| 4 | jess | Hard architecture | Proves the indexer still gives bounded, accurate guidance in a deeper compiler/parser codebase. |
The first two rungs are deliberately different. narrative-harness is mostly documentation, so it is a product-intent test. linecraft is the first real code-corpus test.
CLI and Settings
You can rebuild the map manually:
guildhall corpus-map refresh [path]You can also run the model-assisted semantic pass:
guildhall corpus-map refresh --semantic [path]This uses the OpenAI-compatible provider configured in Guildhall and the contextIndexer model assignment when present. If no explicit context-indexer model is configured for that provider, Guildhall uses the current live-ladder fallback for semantic enrichment.
The semantic pass is intentionally allowed to spend tokens. Guildhall derives a generous completion budget from the compact Corpus Map prompt size, gives repair passes their own larger budget because they include both raw output and map context, and relies on schema plus usefulness checks to keep the saved map compact. The goal is not to starve the context indexer. The goal is to avoid runaway output while giving the model enough room to produce read-next guidance, worker guidance, and risks that agents can actually use.
The project Settings screen also has a compact Codebase Map panel showing file, area, abstraction, and design-system maturity counts plus the last build time. The panel is deliberately quiet: useful when you need it, invisible when you do not.
Design-system guidance
The Corpus Map treats the design system as part of codebase orientation, not as a separate aesthetic checklist. When ./memory/design-system.yaml exists, Guildhall records:
- counts for color, spacing, typography, radius, and shadow tokens
- documented primitives and their intended usage
- nearby component files that look like UI primitives
- whether the design system has been approved
- a maturity rating:
absent,thin,emerging, orestablished - recommendations for reuse or just-in-time systemization
This helps agents avoid the pattern where every screen invents its own button, card, badge, spacing, or color treatment. It also keeps the system from becoming ceremony for small projects. A thin or absent design system is not an automatic mandate to pause all work; it is a prompt to ask whether repetition has become stable enough that a shared token or primitive would now reduce future maintenance.
Why this matters
Corpus Map support lets Guildhall steer workers away from one-off solutionizing:
- Specs can name the abstraction a task is expected to reuse.
- Workers can start from mapped files before broad exploration.
- Reviewers can reject parallel implementations when a mapped abstraction was ignored.
- UI workers can see whether a project already has tokens and primitives before adding local styles.
- Future runs can query the map instead of relearning the repo from scratch.
The goal is not a perfect static analysis database. The goal is a durable, inspectable orientation layer that makes the right code easier to reuse than the wrong code is to invent.