Open model recommendations
Guildhall can run different models for different agent roles. During development, we test open and open-weight models against saved Guildhall task prompts so the defaults stay grounded in the work the agents actually do: blueprint drafting, coordinator decisions, bounded implementation, review, and gate checking.
These recommendations are intentionally practical. They are not a benchmark leaderboard, and they will change as providers update models, context windows, tool support, and output behavior.
Current recommendation
| Role | Recommended model | Why | Watch for |
|---|---|---|---|
spec | deepseek-ai/DeepSeek-V4-Flash | Strong enough reasoning for project framing without requiring the largest lane. | Re-test for long, ambiguous specs before making it the only spec model. |
coordinator | deepseek-ai/DeepSeek-V4-Flash | Good decision quality on structured coordination prompts and supports the OpenAI-compatible API shape Guildhall uses. | Keep deterministic guards around promotions and task handoffs. |
worker | Qwen/Qwen3-235B-A22B-Instruct-2507 | Best strict pass rate in the current worker-lane replay set, with reliable JSON formatting. | It is slower than some flash models, so use lane-specific routing instead of making every role use the worker model. |
reviewer | deepseek-ai/DeepSeek-V4-Flash | Good fit for critique and acceptance checks when paired with deterministic gate evidence. | Ask for concrete files, commands, and acceptance criteria, not vibes. |
gateChecker | deepseek-ai/DeepSeek-V4-Flash or a deterministic path | Gate checks mostly run commands and parse evidence. A model helps when it summarizes failures or chooses the next recovery playbook. | Command exit codes and explicit checks stay in charge. |
contextIndexer | zai-org/GLM-4.6 for semantic enrichment; deepseek-ai/DeepSeek-V4-Flash for cheap replay/default experiments | GLM led the combined live context-indexer ladder across documentation, code, UI design-system, and hard architecture tracks. DeepSeek remains a useful cheap challenger and repair model, but GLM is the current recommended semantic default. | Keep prompts tight, keep schema repair enabled, and retest as provider behavior changes; this lane summarizes architecture, not product or implementation decisions. |
Premium or experimental lanes:
Qwen/Qwen3-Coder-480B-A35B-Instruct-Turbois a plausible premium worker lane for difficult code tasks, but its output-shape behavior makes it a targeted experiment rather than the default.openai/gpt-oss-120bis worth retesting for reviewer-style work after Guildhall has a stricter schema-repair path for model outputs.zai-org/GLM-4.7-Flash,MiniMaxAI/MiniMax-M2.5, andstepfun-ai/Step-3.5-Flashare not recommended for structured Guildhall lanes from the current replay set unless the provider/output-format path is changed and retested.
How we test
Model testing replays the same frozen task input across each candidate:
- the exact agent role prompt,
- the same task context and project facts,
- the same tool schema or response format,
- the same role-specific scoring rubric,
- captured latency, token use, schema compliance, and decision quality.
Deterministic checks come first. A model that chooses the right answer but cannot return the expected shape is not ready for a structured Guildhall lane.
Model comparison findings
The current recommendations came from comparing frozen Guildhall prompts across candidate models. The table below summarizes the result shape without treating it as a permanent benchmark.
| Model | What we saw | Recommendation |
|---|---|---|
Qwen/Qwen3-235B-A22B-Instruct-2507 | Best strict pass rate and strongest structured-output reliability in the worker-style cases. | Use for worker by default. |
deepseek-ai/DeepSeek-V4-Flash | Strong general decision quality across coordinator/reviewer-style cases, with acceptable structured output in replay. In the first live context-indexer ladder it produced one good docs result, then malformed JSON for linecraft. | Use for spec, coordinator, reviewer, and model-assisted gateChecker work. Keep it in the context-indexer set as the cheap challenger, but do not rely on it without schema repair. |
zai-org/GLM-4.6 | Best combined live context-indexer result across the expanded ladder. It gave stronger semantic summaries overall and was faster than DeepSeek on the later Guildhall/Jess tracks, with schema repair covering occasional malformed JSON. | Use for semantic Corpus Map enrichment until a cheaper model passes the same live ladder. |
Qwen/Qwen3.6-35B-A3B | Good replay candidate, but returned no parseable JSON in the first live context-indexer ladder. | Keep in the comparison set only if prompt/schema handling changes. |
openai/gpt-oss-120b | Promising decisions, but weak format reliability in the current harness. | Retest after schema repair before recommending. |
Qwen/Qwen3-Coder-480B-A35B-Instruct-Turbo | Strong decisions, but poor strict-format reliability in the current worker harness. | Keep as a premium worker experiment, not a default. |
deepseek-ai/DeepSeek-V3.2 | Reasonable decisions, but failed the strict structured-output path. | Do not use for structured Guildhall lanes as tested. |
openai/gpt-oss-20b | Did not clear enough worker/reviewer cases in this harness. | Do not recommend for default lanes yet. |
zai-org/GLM-4.7-Flash, MiniMaxAI/MiniMax-M2.5, stepfun-ai/Step-3.5-Flash | Did not work well enough with the current structured lane requirements. | Do not recommend unless the provider/output path changes and is retested. |
Compare models locally
Use the built-in replay harness to generate a deterministic report:
guildhall model-bakeoffBy default, this writes:
artifacts/model-bakeoff/model-bakeoff-report.jsonartifacts/model-bakeoff/model-bakeoff-report.md
You can choose another JSON output path:
guildhall model-bakeoff artifacts/model-bakeoff/my-report.jsonTo run the context-indexer replay specifically:
guildhall model-bakeoff --context-indexerThe current command uses saved replay scenarios and simulated model lanes. It is useful for checking Guildhall's reporting, scoring, and learning-candidate pipeline without spending provider credits.
The context-indexer replay set covers semantic code orientation: canonical abstraction selection, legacy/current path detection, design-system drift, and module contract summaries. Current DeepInfra candidate lanes are deepseek-ai/DeepSeek-V4-Flash, Qwen/Qwen3.6-35B-A3B, and zai-org/GLM-4.6.
The context-indexer report also records the real-project test ladder Guildhall uses when moving from replay checks to provider-backed evaluation:
| Track | Project | What it tests |
|---|---|---|
| Documentation corpus | narrative-harness | Product theory, specs, decisions, and architecture intent without claiming code exists where it does not. |
| Code corpus | linecraft | A small-to-medium real codebase with enough structure to compare architecture summaries and read-next guidance. |
| Design-system slice | guildhall/src/web and guildhall/packages/ui | Whether the indexer steers workers toward shared primitives instead of one-off controls or styling. |
| Hard architecture | jess | Whether summaries stay bounded and correct in a deeper compiler/parser architecture. |
Planned live comparison mode
The live version adds provider-backed candidate runs:
guildhall model-bakeoff --live \
--provider openai-api \
--models deepseek-ai/DeepSeek-V4-Flash,Qwen/Qwen3-235B-A22B-Instruct-2507 \
--judge-model deepseek-ai/DeepSeek-V4-FlashThe judge model is not the whole evaluation. Guildhall scores hard facts first:
- Did the model return valid structured output?
- Did it call the right tool or produce the expected decision?
- Did it avoid false approvals and false escalations?
- Did it preserve the task's acceptance criteria?
- How long did it take, and how many tokens did it use?
After that, an explicit judge model can compare outputs under a rubric and explain tradeoffs. That evaluator is configurable, excluded from the candidate set by default, and treated as advisory evidence. You approve any change to global model defaults.