Skip to content

fix(server): Qwen3.6-27B tool calling for claude-code Anthropic path#276

Open
dusterbloom wants to merge 8 commits into
Luce-Org:mainfrom
dusterbloom:fix/qwen36-claude-code-tool-calling
Open

fix(server): Qwen3.6-27B tool calling for claude-code Anthropic path#276
dusterbloom wants to merge 8 commits into
Luce-Org:mainfrom
dusterbloom:fix/qwen36-claude-code-tool-calling

Conversation

@dusterbloom
Copy link
Copy Markdown
Collaborator

Summary

Qwen3.6-27B (Q3_K_M / Q4_K_M, non-UD) tool calling silently fails for real claude-code traffic on /v1/messages — same model works fine through hermes (OpenAI shape). Confirmed community bug (HF discussion, reference fix repo).

Symptoms reproduced on a captured real claude-code request (22.8K-token system prompt + 24 tools):

  • Pre-fix: model emits 16 leading newlines, then ```bash\nls -lS /, then finish=stop. No structured tool_use.
  • Post-fix: model emits proper content_block_start type:tool_use, name:Bash with {"command":"ls -lhS /tmp 2>/dev/null | head -30"}, finish=tool_calls.

What's in the PR

Six coordinated server-side fixes, sequenced because each exposes the next:

  1. normalize_tools_for_qwen() — convert Anthropic input_schema → OpenAI parameters shape so the Jinja template sees the schema key Qwen3-Coder was trained on.
  2. <bash> / <read> / <write> / <edit> / <ls> / <grep> / <glob> native-tag parser fallback — Pattern 6 in tool_parser.cpp. Gated on `tools` being present (avoids fabricating phantom calls from prose).
  3. scrub_schema_metadata() — strip JSON-Schema metadata (`$schema`, `additionalProperties`, `$defs`, `oneOf`/`anyOf`/`allOf`/`not`) before the Jinja `render_extra_keys` macro turns them into garbage XML tags that hallucinate function names like `<function=cls>`.
  4. truncate_description() — cap each tool description at 500 bytes (paragraph break → sentence boundary → UTF-8-safe hard cut). Claude-code embeds 12KB of "use other tools instead" recipes inside Bash's own description, which steered Qwen to pick Write.
  5. Closed <think> prefill in Jinja renderer when thinking is disabled — mirrors the hardcoded Qwen renderer. Handles trailing whitespace variants in the template. (Diagnosed by Codex.)
  6. resolve_param_alias() — Q3 quant hallucinates short forms (`cmd`→`command`, `path`↔`file_path`, `expr`→`expression`, etc.). Resolves the emitted parameter name to the schema's canonical name via case-insensitive direct match plus a small alias table.

Plus PR #271's `find_tool_start` is extended to recognize the 7 native tags (rebase-resolved cleanly).

Test plan

  • 1013 LOC diff: 337 production + 676 tests (~2:1 test:prod ratio)
  • New unit coverage: native-tag parsing, scrub recursion incl. combinators, description truncation incl. UTF-8 2/3/4-byte safety, alias resolution, no-tools-no-fabrication gate, Anthropic→OpenAI shape normalization
  • All sources compile clean (only pre-existing fread warnings)
  • End-to-end verified against captured real claude-code request: `finish=tool_calls`, structured `tool_use` block, valid `command` argument
  • Underwent self-review cycle (cubic-style); 3 P1 + 5 P2 findings addressed in commit `129e606`
  • Local server interactive verification: launch with `--chat-template-file ` against your dflash_server build, run a multi-turn claude-code session, confirm agentic loop completes
  • Regression check on hermes (OpenAI shape) — should be unaffected; Pattern 6 gate ensures no behavior change when tools shape is already OpenAI

Recipe to deploy

```bash
dflash_server <Qwen3.6-27B GGUF> \
--host 127.0.0.1 --port 18099 --max-ctx 131072 \
--draft \
--chat-template-file
```

Plus `~/.claude/settings.json` env block:
```json
{"env": {
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
"CLAUDE_CODE_DISABLE_TELEMETRY": "1"
}}
```
(`CLAUDE_CODE_ATTRIBUTION_HEADER=0` must be in JSON, not shell `export` — see Unsloth docs.)

Known caveats

  • Q3_K_S works with all 6 fixes. Q4_K_M structurally also works but model prefers writing Python scripts over running Bash directly — Qwen3-Coder training bias amplified at higher precision.
  • UD-Q3_K_XL / UD-Q4_K_XL (Unsloth Dynamic, calibrated on tool-calling examples) likely needs fewer of these patches — not tested in this PR since download wasn't available in-session.
  • Native-tag fallback (`` etc.) is opportunistic: `` and `` can't satisfy Anthropic's Edit/Write tool schemas without `file_path` / `old_string` / `new_string`. They're kept for parser correctness but won't produce useful calls; flagged in code comments.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 5 files

Re-trigger cubic

Comment thread dflash/src/server/chat_template.cpp Outdated
// assistant suffix, which leaves the model in the wrong decoding state
// for tool use. Mirror the hard-coded behavior here when the rendered
// prompt ends with a bare assistant generation prompt.
if (!enable_thinking) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is general code path. based on your comment, we should check the model or arch here.

dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 26, 2026
… only

The kAssistantBare -> kAssistantPrefill post-processing in
render_chat_template_jinja was applied to all Jinja-rendered prompts.
Add arch_hint (ChatFormat) parameter, defaulting to QWEN3, and guard the
block with arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp
passes chat_format_ so other archs (Laguna, Gemma4) are unaffected.

Addresses howard0su's review comment on PR Luce-Org#276.
@dusterbloom
Copy link
Copy Markdown
Collaborator Author

Addressed in 0e3c79a: added arch_hint (ChatFormat) parameter to render_chat_template_jinja, gated the closed-think prefill block on arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes chat_format_ so Laguna and Gemma4 Jinja templates are untouched.

@dusterbloom
Copy link
Copy Markdown
Collaborator Author

Addressed in 0e3c79a: added arch_hint (ChatFormat) parameter to render_chat_template_jinja, gated the closed-think prefill injection on arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes chat_format_ so Laguna and Gemma4 Jinja templates are untouched.

easel added a commit to easel/lucebox-hub that referenced this pull request May 26, 2026
…template

Anthropic tool definitions use `input_schema` as the schema key; Qwen3-Coder's
chat template expects `parameters`. With claude-code's 24-tool requests the model
couldn't ground its tool schemas and fell back to plain-text `<bash>` blocks.

Adds `normalize_tools_for_qwen()` (38 LOC) that handles three input shapes:
- Anthropic (input_schema) → {type:function, function:{name,description,parameters}}
- OpenAI envelope already present → pass through unchanged
- Bare Qwen top-level (name+parameters, no wrapper) → wrap to OpenAI envelope

Wired into request parsing at body["tools"] assignment.

5 new unit tests: anthropic_bare, openai_passthrough, bare_qwen_passthrough,
mixed (both shapes in one array), empty (defensive). All 1454 assertions pass.
…calls

Model emits <bash>CMD</bash>, <ls>PATH</ls> etc. when its system prompt
uses that format. Extend tool_parser (Pattern 6) and sse_emitter hit-
detection to recognise these 7 tags: bash, read, write, edit, ls, grep,
glob. Case-insensitive lookup maps the emitted tag to the canonical tool
name from the request's tools array (e.g. <bash> → "Bash"). Eight new
unit tests added; 1483 assertions all pass.
… Jinja XML collisions

The Unsloth Jinja template's render_extra_keys macro unrolls every JSON-Schema key
as a literal XML tag. Keys like $schema, additionalProperties, and $defs produced
garbage XML (<$schema>...</$schema>, <additionalProperties>False</additionalProperties>)
and crucially a nested <name> tag for each parameter that collided with the outer
function's <name> tag, causing the model to hallucinate function names like
<function=cls> with bogus parameters.

Adds scrub_schema_metadata() (28 LOC) that strips the five metadata keys at every
level of the schema tree (recursive through properties and items). Applied in all
three normalization paths (Anthropic input_schema, OpenAI passthrough, bare Qwen).

3 new unit tests: strips_schema_metadata, strips_metadata_recursively,
preserves_real_fields. All 1504 assertions pass, 0 failures.

End-to-end replay of req_003.json (22.8K-token claude-code request): model now
emits name:Write (real tool), stop_reason:tool_use, finish=tool_calls.
No <function=cls> hallucination.
…e leakage

Cap each tool and parameter description at 500 chars using paragraph-break
> sentence-boundary > hard-cut priority, snapping back past UTF-8 multibyte
sequences. Verified by 6 new unit tests (1529 assertions, 0 failures).
…nking is off

When the Jinja template ends with a bare <|im_start|>assistant\n (e.g. the
official Qwen3.6 template) and the request has thinking disabled, the
hardcoded Qwen renderer appends <think>\n\n</think>\n\n to put the model in
the right decoding state for tool use. The Jinja path was missing this
suffix, so /v1/messages requests rendered through Jinja produced a
different prompt shape than the OpenAI path. Mirror the hardcoded behavior.

Diagnosed by Codex rescue session 019e5fd0 against captured req_003.json
from a real claude-code run. Patch is dormant for templates that already
append their own assistant suffix (Unsloth Qwen3-Coder).
…names

Quantized models (notably Qwen3.6-27B-Q3) emit short forms of canonical
parameter names: <parameter=cmd> instead of <parameter=command>, <path>
instead of <file_path>, <expr> instead of <expression>. The schema-checking
client (claude-code) then rejects the tool call.

Add resolve_param_alias() that maps emitted keys to the schema's actual
keys via case-insensitive direct match, then a small alias table for
common cmd/command, path/file_path, query/pattern, expr/expression,
src/source, dst/destination shortenings. Helper is pure, returns the
original key if no canonical match exists.

Verified: Qwen3.6-27B-Q3_K_S now produces {"command":"ls -lhS /tmp..."}
for claude-code's Bash tool (was {"cmd":...} pre-fix).
…P2-5, P2-8)

P1 blockers:
- P1-1 (tool_parser.cpp): drop std::regex::icase from re_native_tag so
  Pattern 6 alignment with sse_emitter::find_tool_start (case-sensitive).
  Also bound the body quantifier to {0,65536}? to prevent catastrophic
  backtracking on adversarial input.
- P1-2 (tool_parser.cpp): gate Pattern 6 on tools.is_array() && !empty()
  so prose like 'please read the manual' or 'grep for the pattern' doesn't
  get fabricated into phantom tool calls.
- P1-3 (test_server_unit.cpp): rewrite test_truncate_preserves_unicode
  assertion to actually verify the byte before the ellipsis is not a UTF-8
  continuation byte. Add 2-byte (é) and 4-byte (𝄞) coverage too.

P2 fixes:
- P2-1 (http_server.cpp): scrub_schema_metadata now recurses into JSON
  Schema combinators (oneOf, anyOf, allOf, not). Anthropic tool defs use
  these for polymorphic params; without recursion the noise leaks.
- P2-3 (test_server_unit.cpp): add four resolve_param_alias tests
  (cmd→command, path→file_path, case-insensitive direct, passthrough)
  via the public parse_tool_calls API.
- P2-5 (chat_template.cpp): make think-prefill suffix check tolerant of
  trailing whitespace variants (\n\n, trailing space). Trim trailing
  whitespace, check for bare <|im_start|>assistant, then re-emit
  marker + prefill.
- P2-8 (test_server_unit.cpp): fix tautological assertion in
  test_truncate_at_paragraph_break (was checking '\xE2' on result.back()
  which is always the last byte of the ellipsis '\xA6').

Existing tests updated: bash_multiline/ls_with_path now pass tools (the
new P1-2 gate requires it). bash_no_match repurposed; new
no_tools_no_fabrication tests added to lock in the gate.
… only

The kAssistantBare -> kAssistantPrefill post-processing in
render_chat_template_jinja was applied to all Jinja-rendered prompts.
Add arch_hint (ChatFormat) parameter, defaulting to QWEN3, and guard the
block with arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp
passes chat_format_ so other archs (Laguna, Gemma4) are unaffected.

Addresses howard0su's review comment on PR Luce-Org#276.
@dusterbloom dusterbloom force-pushed the fix/qwen36-claude-code-tool-calling branch from 0e3c79a to 5e861b4 Compare May 28, 2026 18:59
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Merge PR Luce-Org#276 as a stack parent while preserving the existing server unit coverage already carried by auto-integration. The PR head's only tree delta duplicated normalize/tool-call tests in test_server_unit.cpp and left invalid duplicate definitions, so restore the pre-merge file and record the reconciliation in the manifest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants