You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add an "OWASP LLM02 (Insecure Output Handling)" scorer pack to PyRIT — 4 new TrueFalseScorer subclasses for output-side threat detection, plus an optional companion seed dataset.
This builds directly on the architecture introduced in #1704 (the new RegexScorer base class + CredentialLeakScorer subclass pattern). The proposal extends that pattern across the remaining OWASP LLM02 sub-categories.
Motivation
PyRIT has strong coverage of LLM01 (Prompt Injection — input side): SelfAsk*Scorers, PromptShieldScorer, GandalfScorer, garak scenarios, ATR via #1715, etc.
LLM02 (Insecure Output Handling — output side) is comparatively under-instrumented:
LLM emits a destructive / union / comment-bypass SQL payload
3
ShellCommandOutputScorer
LLM emits a pipe-to-shell, destructive, reverse-shell, or env-exfil shell payload
4
PathTraversalOutputScorer
LLM emits ../../etc/passwd-style path traversal to known-sensitive targets
1 (dual-condition)
Each follows the exact pattern CredentialLeakScorer uses: subclass RegexScorer, set _DEFAULT_PATTERNS, pass categories=["security", "owasp-llm02-<category>"] to super().__init__(). ~50-60 lines per scorer including docstrings.
Companion seed dataset (optional): OWASPLLM02SeedDataset — 33 hand-curated developer-style adversarial requests (29 adversarial + 4 benign controls) calibrated to elicit each of the above payloads. Useful for red-team eval scenarios that want the inputs as well as the scorers.
Out of scope (explicitly to avoid duplicating ongoing work)
A companion arXiv preprint is in draft (cs.CR target, ~17K words, 8-model × 1,584-generation evaluation, with the regex taxonomy as the core methodology). Happy to share full text if useful for review context.
Surfacing the "author also contributor" dynamic up front: if maintainers would rather write the patterns yourselves using our work only as reference, that's a totally acceptable outcome — please say the word.
Questions before any PR
4 scorers or 1? Would you prefer 4 separate RegexScorer subclasses (mirroring CredentialLeakScorer), or a single OWASPLLM02Scorer parameterized by category list?
Scenario or atomic? Do you want a OWASPLLM02RedTeam scenario bundling these + existing scorers, or keep scorers atomic for scenario-author composition?
What we're not asking
Not asking you to merge anything blindly. Not asking to skip review. Happy to iterate on scope, drop any of the four scorers, or pause entirely if this conflicts with anything I've missed.
Proposal
Add an "OWASP LLM02 (Insecure Output Handling)" scorer pack to PyRIT — 4 new
TrueFalseScorersubclasses for output-side threat detection, plus an optional companion seed dataset.This builds directly on the architecture introduced in #1704 (the new
RegexScorerbase class +CredentialLeakScorersubclass pattern). The proposal extends that pattern across the remaining OWASP LLM02 sub-categories.Motivation
PyRIT has strong coverage of LLM01 (Prompt Injection — input side):
SelfAsk*Scorers,PromptShieldScorer,GandalfScorer, garak scenarios, ATR via #1715, etc.LLM02 (Insecure Output Handling — output side) is comparatively under-instrumented:
MarkdownInjectionScorerexists (markdown smuggling)RegexScorer+CredentialLeakScorershipping in FEAT Add RegexScorer and CredentialLeakScorer for regex-based secret detection #1704 (credential leak)These four categories appear directly in OWASP's LLM02 sub-class list. Each is fast and deterministic to detect via the same
RegexScorerpattern.Proposed scope
Four new
RegexScorersubclasses inpyrit/score/true_false/:XSSOutputScorerSQLInjectionOutputScorerShellCommandOutputScorerPathTraversalOutputScorer../../etc/passwd-style path traversal to known-sensitive targetsEach follows the exact pattern
CredentialLeakScoreruses: subclassRegexScorer, set_DEFAULT_PATTERNS, passcategories=["security", "owasp-llm02-<category>"]tosuper().__init__(). ~50-60 lines per scorer including docstrings.Companion seed dataset (optional):
OWASPLLM02SeedDataset— 33 hand-curated developer-style adversarial requests (29 adversarial + 4 benign controls) calibrated to elicit each of the above payloads. Useful for red-team eval scenarios that want the inputs as well as the scorers.Out of scope (explicitly to avoid duplicating ongoing work)
MarkdownInjectionScorer.Disclosure
The proposed patterns are derived from a published MIT-licensed regex catalog I (@ppcvote) maintain:
prompt-defense-audit(npm, TypeScript)prompt-defense-audit-py(PyPI, byte-parity Python port)prompt-defense-eval(Inspect AI eval task, awaiting registry merge at register: add prompt-defense-eval (OWASP LLM02 output-handling eval) UKGovernmentBEIS/inspect_evals#1659)A companion arXiv preprint is in draft (cs.CR target, ~17K words, 8-model × 1,584-generation evaluation, with the regex taxonomy as the core methodology). Happy to share full text if useful for review context.
Surfacing the "author also contributor" dynamic up front: if maintainers would rather write the patterns yourselves using our work only as reference, that's a totally acceptable outcome — please say the word.
Questions before any PR
RegexScorersubclasses (mirroringCredentialLeakScorer), or a singleOWASPLLM02Scorerparameterized by category list?RegexScorerbase class to merge before opening dependent PRs, or is a stacked PR acceptable?OWASPLLM02RedTeamscenario bundling these + existing scorers, or keep scorers atomic for scenario-author composition?What we're not asking
Not asking you to merge anything blindly. Not asking to skip review. Happy to iterate on scope, drop any of the four scorers, or pause entirely if this conflicts with anything I've missed.
cc @rlundeen2 @romanlutz (per recent activity on #1703 / #1702 / #513)