Skip to content

Open model recommendations

Guildhall can run different models for different agent roles. During development, we test open and open-weight models against saved Guildhall task prompts so the defaults stay grounded in the work the agents actually do: blueprint drafting, coordinator decisions, bounded implementation, review, and gate checking.

These recommendations are intentionally practical. They are not a benchmark leaderboard, and they will change as providers update models, context windows, tool support, and output behavior.

Current recommendation

RoleRecommended modelWhyWatch for
specdeepseek-ai/DeepSeek-V4-FlashStrong enough reasoning for project framing without requiring the largest lane.Re-test for long, ambiguous specs before making it the only spec model.
coordinatordeepseek-ai/DeepSeek-V4-FlashGood decision quality on structured coordination prompts and supports the OpenAI-compatible API shape Guildhall uses.Keep deterministic guards around promotions and task handoffs.
workerQwen/Qwen3-235B-A22B-Instruct-2507Best strict pass rate in the current worker-lane replay set, with reliable JSON formatting.It is slower than some flash models, so use lane-specific routing instead of making every role use the worker model.
reviewerdeepseek-ai/DeepSeek-V4-FlashGood fit for critique and acceptance checks when paired with deterministic gate evidence.Ask for concrete files, commands, and acceptance criteria, not vibes.
gateCheckerdeepseek-ai/DeepSeek-V4-Flash or a deterministic pathGate checks mostly run commands and parse evidence. A model helps when it summarizes failures or chooses the next recovery playbook.Command exit codes and explicit checks stay in charge.
contextIndexerzai-org/GLM-4.6 for semantic enrichment; deepseek-ai/DeepSeek-V4-Flash for cheap replay/default experimentsGLM led the combined live context-indexer ladder across documentation, code, UI design-system, and hard architecture tracks. DeepSeek remains a useful cheap challenger and repair model, but GLM is the current recommended semantic default.Keep prompts tight, keep schema repair enabled, and retest as provider behavior changes; this lane summarizes architecture, not product or implementation decisions.

Premium or experimental lanes:

  • Qwen/Qwen3-Coder-480B-A35B-Instruct-Turbo is a plausible premium worker lane for difficult code tasks, but its output-shape behavior makes it a targeted experiment rather than the default.
  • openai/gpt-oss-120b is worth retesting for reviewer-style work after Guildhall has a stricter schema-repair path for model outputs.
  • zai-org/GLM-4.7-Flash, MiniMaxAI/MiniMax-M2.5, and stepfun-ai/Step-3.5-Flash are not recommended for structured Guildhall lanes from the current replay set unless the provider/output-format path is changed and retested.

How we test

Model testing replays the same frozen task input across each candidate:

  • the exact agent role prompt,
  • the same task context and project facts,
  • the same tool schema or response format,
  • the same role-specific scoring rubric,
  • captured latency, token use, schema compliance, and decision quality.

Deterministic checks come first. A model that chooses the right answer but cannot return the expected shape is not ready for a structured Guildhall lane.

Model comparison findings

The current recommendations came from comparing frozen Guildhall prompts across candidate models. The table below summarizes the result shape without treating it as a permanent benchmark.

ModelWhat we sawRecommendation
Qwen/Qwen3-235B-A22B-Instruct-2507Best strict pass rate and strongest structured-output reliability in the worker-style cases.Use for worker by default.
deepseek-ai/DeepSeek-V4-FlashStrong general decision quality across coordinator/reviewer-style cases, with acceptable structured output in replay. In the first live context-indexer ladder it produced one good docs result, then malformed JSON for linecraft.Use for spec, coordinator, reviewer, and model-assisted gateChecker work. Keep it in the context-indexer set as the cheap challenger, but do not rely on it without schema repair.
zai-org/GLM-4.6Best combined live context-indexer result across the expanded ladder. It gave stronger semantic summaries overall and was faster than DeepSeek on the later Guildhall/Jess tracks, with schema repair covering occasional malformed JSON.Use for semantic Corpus Map enrichment until a cheaper model passes the same live ladder.
Qwen/Qwen3.6-35B-A3BGood replay candidate, but returned no parseable JSON in the first live context-indexer ladder.Keep in the comparison set only if prompt/schema handling changes.
openai/gpt-oss-120bPromising decisions, but weak format reliability in the current harness.Retest after schema repair before recommending.
Qwen/Qwen3-Coder-480B-A35B-Instruct-TurboStrong decisions, but poor strict-format reliability in the current worker harness.Keep as a premium worker experiment, not a default.
deepseek-ai/DeepSeek-V3.2Reasonable decisions, but failed the strict structured-output path.Do not use for structured Guildhall lanes as tested.
openai/gpt-oss-20bDid not clear enough worker/reviewer cases in this harness.Do not recommend for default lanes yet.
zai-org/GLM-4.7-Flash, MiniMaxAI/MiniMax-M2.5, stepfun-ai/Step-3.5-FlashDid not work well enough with the current structured lane requirements.Do not recommend unless the provider/output path changes and is retested.

Compare models locally

Use the built-in replay harness to generate a deterministic report:

bash
guildhall model-bakeoff

By default, this writes:

  • artifacts/model-bakeoff/model-bakeoff-report.json
  • artifacts/model-bakeoff/model-bakeoff-report.md

You can choose another JSON output path:

bash
guildhall model-bakeoff artifacts/model-bakeoff/my-report.json

To run the context-indexer replay specifically:

bash
guildhall model-bakeoff --context-indexer

The current command uses saved replay scenarios and simulated model lanes. It is useful for checking Guildhall's reporting, scoring, and learning-candidate pipeline without spending provider credits.

The context-indexer replay set covers semantic code orientation: canonical abstraction selection, legacy/current path detection, design-system drift, and module contract summaries. Current DeepInfra candidate lanes are deepseek-ai/DeepSeek-V4-Flash, Qwen/Qwen3.6-35B-A3B, and zai-org/GLM-4.6.

The context-indexer report also records the real-project test ladder Guildhall uses when moving from replay checks to provider-backed evaluation:

TrackProjectWhat it tests
Documentation corpusnarrative-harnessProduct theory, specs, decisions, and architecture intent without claiming code exists where it does not.
Code corpuslinecraftA small-to-medium real codebase with enough structure to compare architecture summaries and read-next guidance.
Design-system sliceguildhall/src/web and guildhall/packages/uiWhether the indexer steers workers toward shared primitives instead of one-off controls or styling.
Hard architecturejessWhether summaries stay bounded and correct in a deeper compiler/parser architecture.

Planned live comparison mode

The live version adds provider-backed candidate runs:

bash
guildhall model-bakeoff --live \
  --provider openai-api \
  --models deepseek-ai/DeepSeek-V4-Flash,Qwen/Qwen3-235B-A22B-Instruct-2507 \
  --judge-model deepseek-ai/DeepSeek-V4-Flash

The judge model is not the whole evaluation. Guildhall scores hard facts first:

  • Did the model return valid structured output?
  • Did it call the right tool or produce the expected decision?
  • Did it avoid false approvals and false escalations?
  • Did it preserve the task's acceptance criteria?
  • How long did it take, and how many tokens did it use?

After that, an explicit judge model can compare outputs under a rubric and explain tradeoffs. That evaluator is configurable, excluded from the candidate set by default, and treated as advisory evidence. You approve any change to global model defaults.

Released under the FLL-1.2 License.