Skip to content

[DRAFT] FEAT add Agent Threat Rules (ATR) adversarial payload dataset loader#1715

Open
eeee2345 wants to merge 6 commits into
microsoft:mainfrom
eeee2345:feat/atr-dataset-loader
Open

[DRAFT] FEAT add Agent Threat Rules (ATR) adversarial payload dataset loader#1715
eeee2345 wants to merge 6 commits into
microsoft:mainfrom
eeee2345:feat/atr-dataset-loader

Conversation

@eeee2345
Copy link
Copy Markdown

@eeee2345 eeee2345 commented May 11, 2026

Description

Draft PR implementing the dataset loader proposed in #1702, per @romanlutz's directional guidance (GitHub-hosted source, taxonomy as-is, scorer kept separate for a follow-up). Now incorporating Roman's review feedback (see "Review feedback addressed" section below).

What this PR adds

A new remote dataset loader at pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py that surfaces the Agent Threat Rules (ATR) autoresearch adversarial-payload corpus as a PyRIT SeedDataset.

ATR is an open MIT-licensed detection standard for AI agent threats. The autoresearch corpus (data/autoresearch/adversarial-samples.json) contains 1,054 attack-payload entries across ten base rule scenarios in six of the ten ATR categories (prompt-injection, tool-poisoning, context-exfiltration, agent-manipulation, privilege-escalation, skill-compromise). Each payload carries an attack technique label (paraphrase, language_switch, encoding, role_play, and 17 others) and the agent surface it targets (user_input, content, tool_args, tool_name, tool_response, agent_output).

Reference: https://github.com/Agent-Threat-Rule/agent-threat-rules
License: MIT

Files touched

  • pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py (new) — the loader, three companion enums (ATRCategory, ATRDetectionField, ATRVariationType), and a _RULE_ID_TO_CATEGORY dict that resolves each rule_id to its ATR taxonomy category (typed as ATRCategory so the enum is the single source of truth)
  • pyrit/datasets/seed_datasets/remote/__init__.py — adds the import to trigger auto-registration via SeedDatasetProvider.__init_subclass__, plus four entries in __all__
  • tests/unit/datasets/test_agent_threat_rules_dataset.py (new) — 19 unit tests covering happy path, missing-key validation, unknown rule_id skip path, all four filter axes, enum-validation errors, empty-filter rejection, per-rule description, and dict-enum source-of-truth invariant

No PyRIT core code is modified. No new dependencies are added. The manual doc/code/datasets/0_dataset.md entry was removed in 44dce8b per Roman's pointer to #1707, which regenerates that listing from the notebook.

Implementation notes

  • Source URL is pinned to the ATR commit db793f9 (current main HEAD when this PR was authored). This mirrors HarmBench's pinning convention (c0423b9) for reproducibility; pass the raw URL on main or a different tag to track upstream.
  • Each row of adversarial-samples.json maps to one SeedPrompt. The payload becomes value. The four upstream metadata fields (original_rule_id, technique, detection_field, variation_type) plus the upstream entry id are preserved on SeedPrompt.metadata. harm_categories is set to a single-element list with the ATR taxonomy category resolved via the loader's _RULE_ID_TO_CATEGORY dict.
  • The _RULE_ID_TO_CATEGORY dict is typed dict[str, ATRCategory] and stores enum members directly — a typo on either side is a static error rather than a silent data-quality bug at SeedPrompt construction.
  • Optional categories, techniques, detection_fields, and variation_types arguments narrow the dataset client-side after fetch. Enum arguments are validated against their expected types via the inherited _validate_enums helper, matching the pattern in _PromptIntelDataset. An empty list is rejected with ValueError (pass None to disable a filter) — the previous "empty list silently disables" behavior was a footgun.
  • Each SeedPrompt's description references the rule's own category (e.g. "Agent Threat Rules (ATR) adversarial payload in the prompt injection family. Rule ATR-2026-00001.") so downstream consumers reading metadata see the family that actually applies, not a corpus-wide list that ignores the rule's specific category.
  • Entries whose original_rule_id is not in the loader's category mapping are skipped (not errored) with an aggregate warning. This handles upstream rule additions that land before the loader's mapping is extended — users get a working dataset minus the unmapped rules, not a runtime failure.
  • The loader extends _RemoteDatasetLoader, so caching, file-type inference, and the public_url/file switch are all inherited — no duplicated infrastructure.

Review feedback addressed

Roman's three inline comments + PR-body note from 5/13:

  1. Enum as single source of truth (line 32 → now defined before the dict). _RULE_ID_TO_CATEGORY is now dict[str, ATRCategory] and stores enum members directly. harm_categories=[category.value] at the SeedPrompt construction site. A new test (test_rule_id_mapping_uses_enum) asserts the invariant statically so future edits to the dict stay aligned with the enum.
  2. Empty-filter footgun (line 187). Empty list now raises ValueError for all four filter arguments (categories, techniques, detection_fields, variation_types). The "pass None to include all" path remains. Four new tests cover the raises.
  3. Per-rule description (line 228). Description is now computed per-seed from the rule's own category label, so a _AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION]) returns seeds whose description references prompt injection only, not all six families. A new test (test_per_rule_description_reflects_category) asserts descriptions differ across categories and each seed's description references its own rule_id.
  4. PR-body note about 0_dataset.md — removed; the manual entry was already dropped in 44dce8b per DOC: Execute 1_loading_datasets notebook to populate cell outputs #1707.

What this PR does NOT include (per #1702 discussion)

  • No scorer. Per @romanlutz's guidance to keep that separate, the ATR taxonomy scorer is a follow-up after this loader lands.
  • No HuggingFace mirror. Source is GitHub-hosted per the initial direction; a HuggingFace sibling release is straightforward to add later if users want it.
  • No taxonomy mapping into other PyRIT category schemas. ATR's taxonomy is preserved on harm_categories as-is per the same guidance.

Optional context for PyRIT users

ATR was recently integrated into MISP at two layers (merged 2026-05-10 by Alexandre Dulaunoy, MISP project lead):

Mentioning since PyRIT users routing red-team output into MISP-compatible threat-intel or CSIRT pipelines could benefit from the original_rule_id metadata on each SeedPrompt resolving natively as MISP machine tags downstream — no translation layer needed. Not required for the loader itself; just a downstream interop note.

Tests and Documentation

19 unit tests in tests/unit/datasets/test_agent_threat_rules_dataset.py covering:

  • Dataset name and SeedDataset construction
  • SeedPrompt field population including metadata
  • Missing-key validation raises
  • Unknown rule_id skipped with aggregate warning
  • All four filter axes (categories, techniques, detection_fields, variation_types)
  • Combined filters
  • Invalid enum types raise (3 tests)
  • Empty filter lists raise (4 new tests, one per filter axis)
  • Per-rule description reflects the seed's own category and rule_id
  • _RULE_ID_TO_CATEGORY values are ATRCategory instances

ruff check and ruff format --check both pass on the new files and the modified __init__.py.

A real-network fetch against the pinned upstream URL was verified locally: 1,054 seeds load with the expected category distribution (prompt-injection 496, context-exfiltration 186, skill-compromise 124, tool-poisoning 93, agent-manipulation 93, privilege-escalation 62).

The new loader will be picked up automatically by tests/end_to_end/test_all_datasets.py via SeedDatasetProvider.get_all_providers() discovery — no parametrization update needed there.

JupyText was not run because this PR does not touch any notebooks or doc/code/ .py files.

Adds a new remote dataset loader at pyrit/datasets/seed_datasets/remote/
agent_threat_rules_dataset.py that surfaces the ATR autoresearch corpus
(1,054 attack-payload entries across six ATR taxonomy categories) as a
PyRIT SeedDataset.

Implements proposal in microsoft#1702, per directional guidance in that issue:
- Source pinned to GitHub (not HuggingFace) for the initial cut
- ATR taxonomy preserved as-is on harm_categories
- Scorer kept separate as a follow-up after this loader lands
- No PyRIT core code modified

Adds 13 unit tests covering happy path, missing-key validation, the
unknown-rule_id skip path, all four filter axes (categories, techniques,
detection_fields, variation_types), and enum-validation errors.

Updates pyrit/datasets/seed_datasets/remote/__init__.py to register the
loader via SeedDatasetProvider.__init_subclass__, and adds a one-line
entry to doc/code/datasets/0_dataset.md.

ruff check + ruff format both clean. Real-network fetch verified locally
against the pinned upstream URL.
Comment thread doc/code/datasets/0_dataset.md Outdated
@romanlutz pointed out the manual entry in 0_dataset.md is a small
hardcoded subset; the canonical list is generated by re-executing
1_loading_datasets.ipynb (which his microsoft#1707 handles). Dropping the
manual line; auto-registration via SeedDatasetProvider already
ensures agent_threat_rules appears in the regenerated notebook
output once microsoft#1707 lands.
Copy link
Copy Markdown
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR — I ran it locally and the diff is small, well-tested, and slots in cleanly. Three inline comments below on the loader plus one note on the PR body.

Verified locally on 44dce8b0: 13 unit tests pass; the e2e test tests/end_to_end/test_all_datasets.py::TestAllDatasets::test_fetch_dataset[_AgentThreatRulesDataset-_AgentThreatRulesDataset] is auto-discovered and passes against the pinned commit; ruff is clean; _parse_metadata() returns a valid SeedDatasetMetadata. The upstream JSON shape matches the enums exactly (10 unique original_rule_ids — all mapped — 6 detection_field values, 2 variation_type values, 21 unique techniques).

One meta note: the PR description still claims a doc/code/datasets/0_dataset.md change, but commit 44dce8b0 ("DOC: drop manual 0_dataset.md entry per #1707") removed it. Could you refresh the PR body so reviewers don't go looking for a file that isn't there?

Comment thread pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py Outdated
Comment thread pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py
Comment thread pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py Outdated
eeee2345 and others added 2 commits May 13, 2026 22:04
Three refactors per @romanlutz's 5/13 review:

1. ATRCategory enum is now the single source of truth for category
   strings. _RULE_ID_TO_CATEGORY is typed dict[str, ATRCategory] and
   stores enum members directly, so a typo on either side becomes a
   static error rather than a silent data-quality bug at SeedPrompt
   construction. harm_categories is built via category.value at the
   construction site.

2. Empty filter lists ([]) now raise ValueError for all four filter
   arguments (categories, techniques, detection_fields, variation_types).
   The previous behavior — empty list silently disabled the filter and
   returned the full dataset — was a footgun. Pass None to disable.

3. Per-SeedPrompt description is computed from the seed's own category
   label and rule_id, so a filtered call returns seeds whose description
   references only the active family, not a corpus-wide list.

Five new unit tests cover the new contracts (empty-list raises x4 and
per-rule description). One additional invariant test asserts that
_RULE_ID_TO_CATEGORY values are ATRCategory instances.
@eeee2345
Copy link
Copy Markdown
Author

Thanks for the thorough review. All three inline points + the PR body note are addressed in 5f4490c:

  1. Enum as single source of truth — _RULE_ID_TO_CATEGORY is now dict[str, ATRCategory] with enum members as values. harm_categories=[category.value] at construction site. New invariant test test_rule_id_mapping_uses_enum so future drift becomes a static error.

  2. Empty-filter footgun — Empty list now raises ValueError for all four filter arguments. Pass None to disable. Four new tests cover the raises.

  3. Per-rule description — Computed per-seed from the rule's own category label and rule_id. A _AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION]) returns seeds whose description references only prompt-injection. New test test_per_rule_description_reflects_category asserts descriptions vary and each references its own rule_id.

  4. PR body — Refreshed to reflect 44dce8b dropping the manual 0_dataset.md entry, and the new 19-test count.

Branch is now caught up with main (via update-branch — local rebase hit unrelated add/add conflicts in tests/unit/scenario/ and uv.lock that are not in any path this PR touches).

Local verification on 5f4490c: 19 unit tests pass (13 original + 6 new), ruff check + ruff format both clean.

Total unit tests in this PR: 19.

@romanlutz
Copy link
Copy Markdown
Contributor

Heads-up: I opened #1735 to rename fetch_dataset -> fetch_dataset_async across the SeedDatasetProvider hierarchy (style-guide _async suffix). It includes a backward-compat shim, so this PR will continue to work as-is even if #1735 lands first — the only side effect is a DeprecationWarning at class-definition time.

Once #1735 is merged, this PR will need a one-line follow-up: rename async def fetch_dataset -> async def fetch_dataset_async in pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py (line 228). No other changes needed; your tests already pass against the new name via the bridge.

No action required from you right now — just flagging so you're not surprised by the warning.

@eeee2345
Copy link
Copy Markdown
Author

Rename to fetch_dataset_async done in 1a0d186, plus branch caught up with main (which includes #1735's parent-class rename). One-line change in the loader (line 216) + 11 call-site updates in the test file. Test function names left at test_fetch_dataset_* per the upstream convention I see in test_aya_redteaming_dataset.py.

@eeee2345 eeee2345 marked this pull request as ready for review May 16, 2026 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants