Add recursive directory and glob scanning to the CLI for bulk pickle triage#283
Open
DevamShah wants to merge 1 commit into
Open
Add recursive directory and glob scanning to the CLI for bulk pickle triage#283DevamShah wants to merge 1 commit into
DevamShah wants to merge 1 commit into
Conversation
The CLI scans one file per invocation, so triaging a model cache or an unpacked archive means hand-rolling a find -exec loop and aggregating exit codes manually. The --huggingface path already proves the bulk-scan pattern; local directories had no equivalent. PICKLE_FILE now accepts a directory or a glob, and a new -R/--recursive flag walks a tree depth-first (-r was already --replace-result). File classification reuses the existing HF_PICKLE_EXTENSIONS sets and the same loader.scan_file / scan_zip_archive functions, so no new analysis or deserialization surface is added. Per-file verdicts aggregate into one ClamAV-compatible exit code: any unsafe file -> 1, else any error -> 2, else 0, so a malicious file is never masked by an unrelated read failure. Signed-off-by: Devam Shah <devamshah91@gmail.com>
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add recursive directory and glob scanning to the CLI for bulk pickle triage
Summary
Teach the
ficklingCLI to accept a directory or glob forPICKLE_FILEand add a-R/--recursiveflag, so a whole tree of pickle files can be safety-scanned in one command instead of one invocation per file.Problem / motivation
Today the CLI scans a single file at a time. Triaging anything larger — a downloaded model cache, an unpacked archive, or a set of artifacts pulled off a host during an engagement — means shelling out a
find ... -exec fickling --check-safetyloop and hand-rolling the exit-code aggregation. The HuggingFace path (--huggingface) already proves the bulk-scan pattern end to end; local directories had no equivalent. This closes that gap with the analysis primitives the codebase already exposes.Change
PICKLE_FILEnow accepts a directory or a glob pattern. When it resolves to either (or--recursiveis set), the CLI routes to a new bulk-scan path instead of opening a single file. Single-file behavior, STDIN (-), and the--inject/--create/--tracemodes are untouched — bulk scanning only triggers when none of those single-file operations were requested.-R/--recursiveto walk a directory tree depth-first.-rwas already taken by--replace-result, so the recursive flag uses the conventional-R(as ingrep -R).HF_PICKLE_EXTENSIONSsets already defined incli.py: raw pickle extensions (.bin,.pkl,.pickle) go throughscan_file, and the zip-based.pt/.ptharchives go throughscan_zip_archive— the sameloader.pyfunctions the HuggingFace scanner uses. No new analysis logic is introduced.constants.pysemantics: any unsafe file →EXIT_UNSAFE(1); otherwise any scan error →EXIT_ERROR(2); otherwiseEXIT_CLEAN(0). Unsafe takes precedence over errors so a malicious file is never masked by an unrelated read failure. A directory/glob that resolves to nothing returnsEXIT_ERRORrather than a misleading "clean".graceful=True, so a corrupt or unparseable member never aborts the run — it is reported and the walk continues.-p/--print-resultsprints a per-file breakdown plus a final verdict line, mirroring the HuggingFace output format.pathlibto satisfy the repo'sflake8-use-pathliblint rules; noos.path/globmodule use was added.Security rationale
Bulk pickle triage is a recurring need for AI/ML supply-chain defenders and incident responders. A single malicious
__reduce__in one file out of hundreds in a model cache is enough for arbitrary code execution on load — this is the deserialization risk tracked as CWE-502 (Deserialization of Untrusted Data) and called out in the OWASP Machine Learning Security Top 10 (ML06: ML Supply Chain Attacks) and OWASP LLM Top 10 (LLM03: Supply Chain). Maps to MITRE ATT&CK T1195.002 (Supply Chain Compromise: Compromise Software Supply Chain).The exit-code precedence is deliberately fail-loud: an unsafe verdict dominates errors so CI gates and scan scripts cannot be tricked into a passing status by a single unreadable file. Because the feature only composes existing, already-reviewed
scan_file/scan_zip_archiveanalysis, it introduces no new parsing or deserialization surface — detection fidelity (and false-positive rate) is identical to the per-file CLI; only the iteration and aggregation are new.Testing / validation
Added
test/test_cli_recursive.py(15unittest.TestCasecases) covering: mixed safe/unsafe directories, all-safe and no-pickle directories, recursive vs. non-recursive descent (a nested unsafe file is correctly ignored without-Rand caught with it),-Rshort flag, nonexistent target →EXIT_ERROR, glob matching (including a glob that deliberately excludes an unsafe.binto prove no over-matching), recursive**globs,.pt/.pthzip-member scanning, single-file pass-through, and--print-resultssummary output. Payloads are built with the__reduce__/os.systempattern already used intest/test_scan_api.py— no hand-assembled pickle bytes.Run against the repo toolchain (
uv sync --all-extras):Manual end-to-end smoke test confirmed exit codes on a real tree: non-recursive scan of a dir whose only unsafe file is nested returns
0;--recursivereturns1; a nonexistent target returns2.