Skip to content

Add recursive directory and glob scanning to the CLI for bulk pickle triage#283

Open
DevamShah wants to merge 1 commit into
trailofbits:masterfrom
DevamShah:add-recursive-directory-glob-scanning
Open

Add recursive directory and glob scanning to the CLI for bulk pickle triage#283
DevamShah wants to merge 1 commit into
trailofbits:masterfrom
DevamShah:add-recursive-directory-glob-scanning

Conversation

@DevamShah

Copy link
Copy Markdown

Add recursive directory and glob scanning to the CLI for bulk pickle triage

Summary

Teach the fickling CLI to accept a directory or glob for PICKLE_FILE and add a -R/--recursive flag, so a whole tree of pickle files can be safety-scanned in one command instead of one invocation per file.

Problem / motivation

Today the CLI scans a single file at a time. Triaging anything larger — a downloaded model cache, an unpacked archive, or a set of artifacts pulled off a host during an engagement — means shelling out a find ... -exec fickling --check-safety loop and hand-rolling the exit-code aggregation. The HuggingFace path (--huggingface) already proves the bulk-scan pattern end to end; local directories had no equivalent. This closes that gap with the analysis primitives the codebase already exposes.

Change

  • PICKLE_FILE now accepts a directory or a glob pattern. When it resolves to either (or --recursive is set), the CLI routes to a new bulk-scan path instead of opening a single file. Single-file behavior, STDIN (-), and the --inject / --create / --trace modes are untouched — bulk scanning only triggers when none of those single-file operations were requested.
  • Added -R/--recursive to walk a directory tree depth-first. -r was already taken by --replace-result, so the recursive flag uses the conventional -R (as in grep -R).
  • File classification reuses the existing HF_PICKLE_EXTENSIONS sets already defined in cli.py: raw pickle extensions (.bin, .pkl, .pickle) go through scan_file, and the zip-based .pt/.pth archives go through scan_zip_archive — the same loader.py functions the HuggingFace scanner uses. No new analysis logic is introduced.
  • Per-file verdicts aggregate into a single ClamAV-compatible exit code, matching constants.py semantics: any unsafe file → EXIT_UNSAFE (1); otherwise any scan error → EXIT_ERROR (2); otherwise EXIT_CLEAN (0). Unsafe takes precedence over errors so a malicious file is never masked by an unrelated read failure. A directory/glob that resolves to nothing returns EXIT_ERROR rather than a misleading "clean".
  • Scanning is graceful=True, so a corrupt or unparseable member never aborts the run — it is reported and the walk continues. -p/--print-results prints a per-file breakdown plus a final verdict line, mirroring the HuggingFace output format.
  • All path handling uses pathlib to satisfy the repo's flake8-use-pathlib lint rules; no os.path/glob module use was added.
  • README documents the new usage under "Scanning a directory or glob".

Security rationale

Bulk pickle triage is a recurring need for AI/ML supply-chain defenders and incident responders. A single malicious __reduce__ in one file out of hundreds in a model cache is enough for arbitrary code execution on load — this is the deserialization risk tracked as CWE-502 (Deserialization of Untrusted Data) and called out in the OWASP Machine Learning Security Top 10 (ML06: ML Supply Chain Attacks) and OWASP LLM Top 10 (LLM03: Supply Chain). Maps to MITRE ATT&CK T1195.002 (Supply Chain Compromise: Compromise Software Supply Chain).

The exit-code precedence is deliberately fail-loud: an unsafe verdict dominates errors so CI gates and scan scripts cannot be tricked into a passing status by a single unreadable file. Because the feature only composes existing, already-reviewed scan_file/scan_zip_archive analysis, it introduces no new parsing or deserialization surface — detection fidelity (and false-positive rate) is identical to the per-file CLI; only the iteration and aggregation are new.

Testing / validation

Added test/test_cli_recursive.py (15 unittest.TestCase cases) covering: mixed safe/unsafe directories, all-safe and no-pickle directories, recursive vs. non-recursive descent (a nested unsafe file is correctly ignored without -R and caught with it), -R short flag, nonexistent target → EXIT_ERROR, glob matching (including a glob that deliberately excludes an unsafe .bin to prove no over-matching), recursive ** globs, .pt/.pth zip-member scanning, single-file pass-through, and --print-results summary output. Payloads are built with the __reduce__/os.system pattern already used in test/test_scan_api.py — no hand-assembled pickle bytes.

Run against the repo toolchain (uv sync --all-extras):

uv run pytest -q test/test_cli_recursive.py   # 15 passed
uv run pytest -q test/                         # 130 passed, 5 skipped, 29 subtests passed
uv run ruff format --check fickling/cli.py test/test_cli_recursive.py   # clean
uv run ruff check  fickling/cli.py test/test_cli_recursive.py           # All checks passed
uv run ty check fickling/                      # 80 diagnostics (unchanged baseline; 0 in new code)

Manual end-to-end smoke test confirmed exit codes on a real tree: non-recursive scan of a dir whose only unsafe file is nested returns 0; --recursive returns 1; a nonexistent target returns 2.

The CLI scans one file per invocation, so triaging a model cache or an
unpacked archive means hand-rolling a find -exec loop and aggregating
exit codes manually. The --huggingface path already proves the bulk-scan
pattern; local directories had no equivalent.

PICKLE_FILE now accepts a directory or a glob, and a new -R/--recursive
flag walks a tree depth-first (-r was already --replace-result). File
classification reuses the existing HF_PICKLE_EXTENSIONS sets and the same
loader.scan_file / scan_zip_archive functions, so no new analysis or
deserialization surface is added. Per-file verdicts aggregate into one
ClamAV-compatible exit code: any unsafe file -> 1, else any error -> 2,
else 0, so a malicious file is never masked by an unrelated read failure.

Signed-off-by: Devam Shah <devamshah91@gmail.com>
@DevamShah DevamShah requested a review from ESultanik as a code owner June 23, 2026 01:36
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants