Skip to content

fix(parsers/php): extract procedural top-level + closure units; seed PHP entry points#76

Open
gadievron wants to merge 2 commits into
masterfrom
fix/parsers-php-extract-procedural-top-level-closure-units
Open

fix(parsers/php): extract procedural top-level + closure units; seed PHP entry points#76
gadievron wants to merge 2 commits into
masterfrom
fix/parsers-php-extract-procedural-top-level-closure-units

Conversation

@gadievron

Copy link
Copy Markdown
Collaborator

Two extraction/seeding defects on the PHP analysis path, plus the PHP
entry-point-seeding gap they depend on. All changes are confined to the extraction
layer (parsers/php/function_extractor.py) and the entry-point detector
(utilities/agentic_enhancer/entry_point_detector.py); the PHP call-graph builder is
untouched.

  1. Procedural top-level blackout:
    _extract_functions_from_tree emitted units only for named definitions; top-level
    procedural statements (assignments, echo, add_action(...) hook registrations)
    fell through the catch-all else and produced NO unit, so a WordPress-style
    plugin.php was invisible to reachability seeding. The Python parser has a
    module-level synthesizer (extract_module_level_code -> unit_type='module_level');
    PHP had none. Adds _extract_module_level_unit (called from process_file),
    synthesising a :module unit from program-level statements. Handles
    braceless + braced namespaces; emits nothing for files with no file-scope code.

  2. PHP entry-point seeding:
    entry_point_detector USER_INPUT_PATTERNS / MODULE_LEVEL_INPUT_PATTERNS were
    Python/JS-only, so a PHP handler reading $_POST was never an entry point (Check 3)
    and the module_level unit could not fire Check 4. Adds PHP superglobals
    ($_GET/$_POST/$_REQUEST/$_COOKIE/$_SERVER/$_FILES/$_ENV/$_SESSION), php://input /
    filter_input, and WordPress hook idioms (add_action/add_filter/do_action/
    apply_filters) for the module-level path.

  3. Anonymous closures + arrow functions as units:
    anonymous_function / arrow_function nodes fell through the same else and were
    never modeled; the named-definition walk also did not descend into function/method
    bodies, so nested closures were unreachable. Adds a closure dispatch branch
    (unit_type='closure', synthetic {closure@line:col} name) and makes
    function_definition / method_declaration recurse into their bodies. The
    closure-DISPATCH edge ($cb() -> closure) lives in call_graph_builder.py and is out
    of this file's scope; this fixes only the extraction half.

Out of scope (not fixed here):

  • The use_declaration -> namespace_use_declaration node-type correction is already
    handled by the existing PHP import-extraction code in _extract_imports; re-touching
    it here would duplicate that change. Left untouched.
  • Aliased use Foo\Bar as Baz -> alias-to-FQN translation lives in
    call_graph_builder.py::_resolve_class_call (out of this file's scope). An
    alias-capture in function_extractor alone would be unobservable (the only consumer
    of the imports map is call_graph_builder.py) and would risk regressing
    import-matching. No no-op change made.

Tests: tests/parsers/php/test_php_extractor.py (new; the package had no PHP
extractor tests). Modules loaded under unique importlib names (function_extractor.py
is a basename shared by every parser). Eight tests covering module_level synthesis,
no-false-positive on class-only files, PHP superglobal entry-point seeding (Check 3

  • Check 4), and closure/arrow-function unit extraction. 6 failed pre-fix (2 guard
    tests green at base by design) -> 8 passed after the fix. ruff clean; full suite 184
    passed / 63 skipped (suite excluding the new file is exactly 176/63 — zero
    regression).

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…PHP entry points

Two extraction/seeding defects on the PHP analysis path, plus the PHP
entry-point-seeding gap they depend on. All changes are confined to the extraction
layer (parsers/php/function_extractor.py) and the entry-point detector
(utilities/agentic_enhancer/entry_point_detector.py); the PHP call-graph builder is
untouched.

1. Procedural top-level blackout:
   _extract_functions_from_tree emitted units only for named definitions; top-level
   procedural statements (assignments, echo, add_action(...) hook registrations)
   fell through the catch-all else and produced NO unit, so a WordPress-style
   plugin.php was invisible to reachability seeding. The Python parser has a
   module-level synthesizer (extract_module_level_code -> unit_type='module_level');
   PHP had none. Adds _extract_module_level_unit (called from process_file),
   synthesising a <file>:__module__ unit from program-level statements. Handles
   braceless + braced namespaces; emits nothing for files with no file-scope code.

2. PHP entry-point seeding:
   entry_point_detector USER_INPUT_PATTERNS / MODULE_LEVEL_INPUT_PATTERNS were
   Python/JS-only, so a PHP handler reading $_POST was never an entry point (Check 3)
   and the module_level unit could not fire Check 4. Adds PHP superglobals
   ($_GET/$_POST/$_REQUEST/$_COOKIE/$_SERVER/$_FILES/$_ENV/$_SESSION), php://input /
   filter_input, and WordPress hook idioms (add_action/add_filter/do_action/
   apply_filters) for the module-level path.

3. Anonymous closures + arrow functions as units:
   anonymous_function / arrow_function nodes fell through the same else and were
   never modeled; the named-definition walk also did not descend into function/method
   bodies, so nested closures were unreachable. Adds a closure dispatch branch
   (unit_type='closure', synthetic {closure@line:col} name) and makes
   function_definition / method_declaration recurse into their bodies. The
   closure-DISPATCH edge ($cb() -> closure) lives in call_graph_builder.py and is out
   of this file's scope; this fixes only the extraction half.

Out of scope (not fixed here):
- The use_declaration -> namespace_use_declaration node-type correction is already
  handled by the existing PHP import-extraction code in _extract_imports; re-touching
  it here would duplicate that change. Left untouched.
- Aliased `use Foo\Bar as Baz` -> alias-to-FQN translation lives in
  call_graph_builder.py::_resolve_class_call (out of this file's scope). An
  alias-capture in function_extractor alone would be unobservable (the only consumer
  of the imports map is call_graph_builder.py) and would risk regressing
  import-matching. No no-op change made.

Tests: tests/parsers/php/test_php_extractor.py (new; the package had no PHP
extractor tests). Modules loaded under unique importlib names (function_extractor.py
is a basename shared by every parser). Eight tests covering module_level synthesis,
no-false-positive on class-only files, PHP superglobal entry-point seeding (Check 3
+ Check 4), and closure/arrow-function unit extraction. 6 failed pre-fix (2 guard
tests green at base by design) -> 8 passed after the fix. ruff clean; full suite 184
passed / 63 skipped (suite excluding the new file is exactly 176/63 — zero
regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…th components

extract_all() skipped files whose ABSOLUTE path contained the substring 'tmp' (or
'vendor'/'node_modules'/'.git'/'.cache'). When the analyzed repo lives under such an ancestor
directory — e.g. a Linux /tmp working dir (as on CI), or any path with a 'template'-like
segment — every file was wrongly excluded and zero functions were extracted. Match the
excluded names against the path's components RELATIVE to the repo root instead, so only the
repo's own vendored/transient dirs are skipped, not ancestor directories.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant