Fix extract_fenced_code_block for language tags with non-word characters#1468
Open
vidigoat wants to merge 1 commit into
Open
Fix extract_fenced_code_block for language tags with non-word characters#1468vidigoat wants to merge 1 commit into
vidigoat wants to merge 1 commit into
Conversation
The language-tag capture used \w+, which only matches [A-Za-z0-9_]. When a fenced block was labelled with an info string containing a non-word character (e.g. ```c++, ```objective-c, ```c#), the pattern failed to match and the function returned None, silently breaking llm -x/--extract and the template extract: options. Widen the capture to [^\n`]* so the whole info string is consumed up to the newline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
extract_fenced_code_blockcaptured the language/info tag with(\w+)?.\wonly matches[A-Za-z0-9_], so a fenced block whose info string contains a non-word character fails to match entirely and the function returnsNone. Common, real-world language tags trigger this:```c++,```objective-c,```c#,```f#. The effect is thatllm -x/--extract/--extract-lastand the templateextract:/extract_last:options silently extract nothing when a model labels the block with one of these languages.This widens the capture to
[^\n]*so the whole info string up to the newline is consumed, regardless of punctuation. Plain``` ` (no language) and existing word-only tags are unaffected.Added parametrized cases to
test_extract_fenced_code_blockforc++,objective-c, andc#. They fail onmain(returnNone) and pass with this change; the rest oftests/test_utils.pystays green.