Skip to content

fix(parsers/c): exclude on path components in extract_all, not a substring of the abspath#77

Open
gadievron wants to merge 2 commits into
masterfrom
fix/parsers-c-exclude-on-path-components-in-extract
Open

fix(parsers/c): exclude on path components in extract_all, not a substring of the abspath#77
gadievron wants to merge 2 commits into
masterfrom
fix/parsers-c-exclude-on-path-components-in-extract

Conversation

@gadievron

Copy link
Copy Markdown
Collaborator

c/function_extractor.py extract_all skipped files via any(excl in str(file_path) for excl in ['.git','build','test','node_modules']) -- an unanchored substring test against the absolute path. So
files whose path merely contained a token were wrongly skipped ('src/latest/main.c' and 'contest/sol.c'
contain 'test'), and an ancestor of repo_path containing a token poisoned the whole scan (a checkout
under '/home/tester/' excluded every file). Match on path COMPONENTS relative to repo_path instead,
using c's own token set.

Scope: c's member of the cross-parser extract_all substring family. The python/php/ruby extract_all
siblings carry DIFFERENT token sets (vendor*/tmp*/venv*) and are not widened here.

Tests: tests/test_c_extract_all_path_components.py (spies process_file: files with a token in a path
segment are processed; real test/ and .git dirs stay excluded). RED 1 failed (all excluded via
ancestor-poison) -> GREEN 1 passed; full suite 177 passed / 63 skipped. The test loads the C
function_extractor under a unique module name via importlib so it does not pollute
sys.modules['function_extractor'] for the sibling python parser tests.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…tring of the abspath

c/function_extractor.py extract_all skipped files via `any(excl in str(file_path) for excl in
['.git','build','test','node_modules'])` -- an unanchored substring test against the absolute path. So
files whose path merely contained a token were wrongly skipped ('src/latest/main.c' and 'contest/sol.c'
contain 'test'), and an ancestor of repo_path containing a token poisoned the whole scan (a checkout
under '/home/tester/' excluded every file). Match on path COMPONENTS relative to repo_path instead,
using c's own token set.

Scope: c's member of the cross-parser extract_all substring family. The python/php/ruby extract_all
siblings carry DIFFERENT token sets (vendor*/tmp*/venv*) and are not widened here.

Tests: tests/test_c_extract_all_path_components.py (spies process_file: files with a token in a path
*segment* are processed; real test/ and .git dirs stay excluded). RED 1 failed (all excluded via
ancestor-poison) -> GREEN 1 passed; full suite 177 passed / 63 skipped. The test loads the C
function_extractor under a unique module name via importlib so it does not pollute
sys.modules['function_extractor'] for the sibling python parser tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ws CI

The extract_all over-exclusion regression test recorded the processed path
via str(Path.relative_to(...)), which yields backslash separators on Windows
and fails the forward-slash 'in procset' assertions. Use .as_posix() so the
comparison is OS-independent. The substring-over-exclusion assertions are
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant