ML augmented SMB share hunter. Snaffler successor with a two stage classifier pipeline.
ShareSift ranks files on SMB shares by likelihood of containing credentials or secrets. Stage 1 runs a LightGBM path classifier on every path. Stage 2 runs a Qwen3 1.7B LoRA content classifier on the flagged files to confirm. Run Stage 1 alone or both stages together.
Snaffler catches the obvious patterns like id_rsa, NTDS.dit, and .kdbx. It misses the long tail. Custom scripts on shares, secrets in unusual filenames, and password directories with unconventional names all slip through.
ShareSift adds an ML layer on top. The path classifier beats Snaffler on recall by 29.3 percentage points on the Snaffler blind benchmark. The content classifier closes most of the gap to Biringa and Kul 2025 at one quarter the parameter count.
v0.51 ships a 2525-file Windows NTFS corpus built via Stauffer's
DiskForge — 75
synthetic-but-format-shaped credential files across 16 categories,
mixed with 2420 corporate-share noise files (HR policies, finance
reports, marketing assets, vendor PDFs, software install media,
log archives, project source, public templates) + 20 precision-
stress filenames (password_policy.docx, secrets_management_guidelines.pdf
etc) + 30 windows10 OS-template stubs.
Same paths, same ground truth, both pipelines:
| Tool | Keep policy | P | R | F1 | Caught | Missed | FPs |
|---|---|---|---|---|---|---|---|
| Upstream Snaffler | Yellow+ | 0.800 | 0.213 | 0.337 | 16 | 59 | 4 |
| Upstream Snaffler | Red+ | 0.800 | 0.213 | 0.337 | 16 | 59 | 4 |
| ShareSift v0.51 | Yellow+ | 0.158 | 0.840 | 0.266 | 63 | 12 | 336 |
| ShareSift v0.51 | Red+ | 0.466 | 0.720 | 0.565 | 54 | 21 | 62 |
| ShareSift v0.51 | Black-only | 0.833 | 0.333 | 0.476 | 25 | 50 | 5 |
At Red+ — the operator triage policy: ShareSift catches 3.4× more credentials than Snaffler (54 vs 16 of 75), with F1 0.565 vs 0.337.
The tradeoff is genuine: ShareSift surfaces more false positives
(62 vs 4) because the path classifier is aggressive at Yellow tier
on binary-extension noise (.msi / .iso / .psd). If you only
want findings you're sure about, run Black-only — precision rises
to 0.833. If you don't want 59 real credentials silently missed,
run Red+. The picker is yours.
The full per-category breakdown (which 12/16 categories hit 100% path-stage recall, which 4 need content scan or a v0.52 rule) is in docs/diskforge_winshare_v1_results.md.
| Benchmark | Metric | ShareSift v0.51 |
|---|---|---|
| Metasploitable 3 (Windows AD, 40 creds) | Recall | 100% (40/40) |
| Metasploitable 2 (Linux server, 34 creds) | Recall | 100% (34/34) |
| DiskForge Win10 forensic (13 plants) | Recall | 100% (13/13) |
| Linux rule-blind (500 paths, Snaffler-rule-exclusion benchmark) | F1 | 0.944 |
| Snaffler-blind Windows (500 LLM-labeled paths) | P | 0.984 |
| Snaffler-issues operator-grounded | Pass | 31/31 closed + 7/10 open |
ShareSift uses a discipline-honest research cycle: lock the next test set BEFORE writing rules that close the previous one's failures. Four generations as of v0.51:
| Generation | Source PRs | Pre-rule baseline | Post-rule | Closed at |
|---|---|---|---|---|
| v1 | #78 Cisco, #135 FileZilla, #67 ADO | 36% | 100% | v0.49 |
| v2 | #198 CMD-set, #155 Azure CLI, #98 cred-filename, Chrome/Edge | 50% | 100% | v0.49 |
| v3 | #154 -password, #140 Kerberos, #139 MDE, #112 SCCM | 90%¹ | 100% | v0.50 |
| v4 | OPEN PRs #192 PPK, #186 SCCM-broad | 60%² | 70% | (locked at v0.50) |
¹ v3 baseline at 90% reflects pysnaffler bundling upstream rules from PRs #140 and #112 (both merged). Not a strict generalization test.
² v4 uses OPEN PRs that pysnaffler does NOT bundle. The 60%→70% post-rule lift came from the SCCMContentLib$ rule (authored from v3-locked PR #112) catching PR #186's reg-export probe it was never authored against — a real generalization signal.
See docs/methodology.md for the full discipline cycle explanation and docs/v0p50_benchmark_sweep.md for the complete 12-benchmark scorecard with raw counts and the honest provenance breakdown.
Quick install — drop a binary on Kali (no git clone, no uv setup, no Python required):
# Truly single-file binary — Stage 1 path classifier + rules + tier engine
# + Snaffler-TSV + engagement DB + sort + query + export. ~77 MB, zero
# Python prereq. Available from v0.46 onward.
wget https://github.com/byevincent/ShareSift/releases/latest/download/sharesift
chmod +x sharesift
./sharesift --versionOr via pipx (needs Python 3.10+):
pipx install 'sharesift[smb]' # SMB-direct workflow
pipx install sharesift # Stage 1 onlyFull install from source — if you want to develop, train, or run the content classifier:
# Latest milestone release (recommended)
git clone --branch v0.51.0 https://github.com/byevincent/ShareSift.git
# Or track main for unreleased work
git clone https://github.com/byevincent/ShareSift.git
cd ShareSift
# Stage 1 only (path classifier, ~100MB)
uv sync
# Both stages (adds ~3GB of torch and transformers)
uv sync --group content-inference
# SMB-direct support (no mount required) — adds smbprotocol + pyspnego
uv sync --extra smbAdd --group content-training for LoRA fine-tuning. That pulls another 5GB.
Milestone releases: v0.51.0 (current — real corporate-share benchmark via DiskForge: F1 0.565 vs Snaffler 0.337 at Red+), v0.50.0 (held-out v3 100%, v4 70% baseline + SCCMContentLib$ generalization), v0.49.0 (held-out v1 100%, v2 100%, v3 90% + POSIX FileName bugfix), v0.48.0 (held-out v1 36→91%, v2 70%), v0.47.0 (Snaffler-issues benchmark + MSF2 recall 1.000), v0.46.0 (77MB single-file binary + DB exporters), v0.45.0 (top-K precision 0.20 → 0.70), v0.43.0 (Linux rule gap closure), v0.41.0 (engagement datastore). Intermediate tags are shown as pre-releases on the releases page.
Point ShareSift at a UNC + credentials and it walks the share over SMB2/3 directly. Auth flags match netexec conventions (-u/-p/-H/-k/-d).
# Password auth
uv run sharesift //10.10.10.5/Finance$ -u user -p pass
# Pass-the-hash (NT hash or LM:NT)
uv run sharesift //10.10.10.5/Finance$ -u CORP/user -H 'aad3b435b51404eeaad3b435b51404ee:27c4...'
# Kerberos (reads ticket from KRB5CCNAME)
uv run sharesift //dc01.corp.local/SYSVOL$ -u alice -k --use-kcache
# Anonymous / null session
uv run sharesift //10.10.10.5/Public --no-pass
# Pre-flight: confirm creds work before committing to a long walk
uv run sharesift //10.10.10.5/Finance$ -u user -p pass --checkOutput lands in ./sharesift-<host>-<share>/ by default (override with --output-dir). The full one-shot pipeline runs (enumerate → triage → content scan → verify → HTML report) using the live SMB session for content reads — no local mount required.
uv run sharesift /mnt/downloaded_share
# Or to skip live verification + the report and just get triaged hits:
uv run sharesift /mnt/downloaded_share \
--skip-verify --skip-report
# Explicit subcommand form (legacy v0.18 flag set, still supported):
uv run sharesift scan --share /mnt/downloaded_share --output-dir ./scan-outIntermediates land in --output-dir:
scan-out/
├── files.txt # enumerated paths
├── paths.jsonl # stage 1 scores
├── hits.jsonl # stage 1+2 results
├── verified.jsonl # live credential checks
└── report.html # interactive HTML report
Add --json for a structured end-of-run summary on stderr (useful for CI). Add -q / --quiet to silence progress or -v / --verbose for debug detail.
The individual subcommands below let you run each stage on its own when you need finer control.
Pipe output from your enumeration tool directly into ShareSift.
manspider --target \\fileserver -d corp.local | \
uv run sharesift score-paths --stdin
# Or from a file
uv run sharesift score-paths \
--input enumerated_paths.txt \
--output scored.jsonlOutput is JSONL with path, probability, and tier (Black, Red, Yellow, or null).
{"path": "\\\\fileserver\\Finance\\backups\\creds.kdbx", "probability": 0.987, "tier": "Black"}
{"path": "\\\\fileserver\\Dev\\notes.txt", "probability": 0.523, "tier": "Yellow"}
{"path": "\\\\fileserver\\Marketing\\Q4.pdf", "probability": 0.012, "tier": null}Work through Black first, then Red, then Yellow. Use jq to sort and filter.
find ./downloaded_share -type f | \
uv run sharesift scan-files --stdin \
--output deep_scan.jsonlStage 2 adds content_check and content_excerpt to each record.
{"path": "./downloaded_share/Dev/notes.txt", "path_probability": 0.52, "path_tier": "Yellow", "content_check": "yes", "content_excerpt": "API_KEY = 'sk_live_...'"}Stage 2 runs only on tier flagged paths. Override with --force-content. On CPU this takes 5 to 8 seconds per file. On CUDA it runs in about 150ms.
Two stage pipeline. Each stage runs independently.
┌─────────────────────────────────────────┐
│ Stage 1 router (by path shape) │
│ │
│ UNC path → Windows model │
path list → │ Unix path → Linux model │ → (probability, tier)
│ │
│ LightGBM + char n-grams + │
│ 8 hand features, calibrated │
│ probability, per-model tier band │
└─────────────────────────────────────────┘
│
(tier flagged subset)
↓
┌──────────────────────────┐
│ Stage 2: Qwen3-1.7B LoRA │ → (yes / no on secret presence)
│ via transformers + PEFT │
│ 4-bit base on CUDA │
│ bf16 on CPU │
└──────────────────────────┘
Stage 1 trains on 11,190 Windows and 1,685 Linux records. It scores each path in under one millisecond. Stage 2 is 1.5 to 3.4GB depending on your hardware.
Full design in docs/architecture.md and docs/build_plan.md.
ShareSiftPathRule is a SnaffleRule subclass that plugs the path classifier into pysnaffler's SMB enumeration loop.
uv sync --group pysnaffler-integrationfrom sharesift.pysnaffler_run import build_ruleset
from pysnaffler.snaffler import pySnaffler
# ML only: ShareSift replaces Snaffler's rule pack
ruleset = build_ruleset()
# Hybrid: Snaffler defaults plus ShareSift
ruleset = build_ruleset(include_defaults=True)
snaffler = pySnaffler(ruleset=ruleset, dry_run=True)No validation against real engagement findings. ShareSift labels come from a Claude rule and Codex audit pipeline on public corpus data. That is useful signal, but it is a different class from internal engagement grade ground truth.
Calibration holds in distribution. The tier band precision contracts are reliable on data from the same source as training. On out of distribution data the Windows model ECE rises from 0.007 to 0.30. Treat tier assignments as triage ordering, not probability contracts, when scanning real SMB shares.
Cross source generalization is weaker than the headline numbers suggest. Windows PR AUC drops from 0.97 to 0.76 when you train on Stack Exchange and test on GitHub Code Search. Real SMB shares are a third distribution that neither training nor evaluation covers.
The content classifier sits 13 F1 points below Biringa and Kul 2025. The gap comes from model size. Mistral 7B is four times larger than Qwen3 1.7B, and ShareSift targets an RTX 4070 deployment.
Rare credential categories are undertrained. Private keys, SSH credentials, cloud credentials, and IAC each have three or fewer training records. Recall on those classes is weak.
CPUs without AVX 512 are slow. Benchmarked at 5 to 8 seconds per file on a Ryzen 5 3600.
See docs/audit_2026-05-31.md and docs/audit_2026-05-30.md for the full audit history.
src/sharesift/ runtime package
path.py PathClassifier router (Windows and Linux models)
content.py content classifier wrapper
tier.py probability to tier band
features.py char n-gram and hand features
prompt.py content classifier chat template formatter
pysnaffler_rule.py SnaffleRule plugin
pysnaffler_run.py build_ruleset() helper
cli.py score-paths and scan-files entry point
src/eval/ training and evaluation scripts
tools/ training, dataset builders, audit tools
docs/ engineering log and architecture docs
models/ trained model weights
Apache 2.0. See NOTICE for GPLv3 components (vendored Snaffler ruleset and pysnaffler).
This is an active solo build. Track major design decisions in docs/build_plan.md and docs/journal.md. Open an issue before sending a PR.