Skip to content

Fix extract_fenced_code_block for language tags with non-word characters#1468

Open
vidigoat wants to merge 1 commit into
simonw:mainfrom
vidigoat:fix-fenced-code-lang-symbols
Open

Fix extract_fenced_code_block for language tags with non-word characters#1468
vidigoat wants to merge 1 commit into
simonw:mainfrom
vidigoat:fix-fenced-code-lang-symbols

Conversation

@vidigoat
Copy link
Copy Markdown

@vidigoat vidigoat commented Jun 2, 2026

extract_fenced_code_block captured the language/info tag with (\w+)?. \w only matches [A-Za-z0-9_], so a fenced block whose info string contains a non-word character fails to match entirely and the function returns None. Common, real-world language tags trigger this: ```c++, ```objective-c, ```c#, ```f#. The effect is that llm -x / --extract / --extract-last and the template extract: / extract_last: options silently extract nothing when a model labels the block with one of these languages.

This widens the capture to [^\n]*so the whole info string up to the newline is consumed, regardless of punctuation. Plain ``` ` (no language) and existing word-only tags are unaffected.

Added parametrized cases to test_extract_fenced_code_block for c++, objective-c, and c#. They fail on main (return None) and pass with this change; the rest of tests/test_utils.py stays green.

The language-tag capture used \w+, which only matches [A-Za-z0-9_]. When a
fenced block was labelled with an info string containing a non-word character
(e.g. ```c++, ```objective-c, ```c#), the pattern failed to match and the
function returned None, silently breaking llm -x/--extract and the template
extract: options. Widen the capture to [^\n`]* so the whole info string is
consumed up to the newline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant