Skip to content

feat: upgrade pipeline to large-scale production-ready dataset with Pydantic schemas, Ollama semantic merging, and train/val/test splits#2

Draft
Copilot wants to merge 4 commits into
copilot/build-cybersecurity-dataset-pipelinefrom
copilot/upgrade-cybersecurity-dataset-pipeline
Draft

feat: upgrade pipeline to large-scale production-ready dataset with Pydantic schemas, Ollama semantic merging, and train/val/test splits#2
Copilot wants to merge 4 commits into
copilot/build-cybersecurity-dataset-pipelinefrom
copilot/upgrade-cybersecurity-dataset-pipeline

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 3, 2026

Upgrades the RedBlue-Data pipeline from ~300–500 finding small-scale output to a production-ready instruction-tuning dataset with external data merging, stratified splits, Pydantic validation, and benchmark readiness.

New modules

  • scripts/schemas.py — Pydantic v2 Finding, RedTeamPair, BlueTeamPair models with field validators (severity normalization, CWE null-coercion), to_jsonl_dict() helpers
  • scripts/merge_external.py — scans external_datasets/ for JSONL/JSON, maps to unified schema via heuristics or Ollama semantic mapping, deduplicates against existing records using TF-IDF cosine similarity (threshold-configurable)
  • scripts/generate_dataset.py — full orchestration CLI: processes all reports → optional external merge → optional Ollama enrichment → stratified train/val/test splits (default 80/10/10) → optional HF Hub push
# Basic
python scripts/generate_dataset.py --max-reports 100

# With external merge + Ollama
python scripts/generate_dataset.py --merge-external --use-ollama --ollama-model llama3.2

# Custom splits + push
python scripts/generate_dataset.py --split-ratios 0.8 0.1 0.1 --push-to-hub myuser/redblue-data

Output layout:

dataset/
├── raw/    findings.jsonl, red_team.jsonl, blue_team.jsonl
├── train/
├── val/
└── test/

Upgraded: ollama_enhancer.py

  • Added two modes: validation/correction (enhance_finding) and semantic mapping (map_external_record) for external record schema conversion
  • Added opt-in disk cache keyed by sha256(model:prompt) — avoids redundant LLM calls across runs; off by default (cache_dir=None)

New: benchmarks/

Five benchmark task scripts (RB-1 to RB-5): vulnerability classification (Accuracy/Macro-F1), severity prediction (Weighted-F1), exploit generation (ROUGE-L), remediation quality (ROUGE-L), dataset statistics. Shared ROUGE-L implementation extracted to benchmarks/metrics.py.

Documentation

  • dataset/README.md — full HF-style dataset card: schema tables, sample records, datasets.load_dataset + SFTTrainer usage, benchmark table, external dataset citations
  • README.md — updated Quick Start with external dataset download steps and generate_dataset.py examples

Dependencies added

pydantic>=2.0, tqdm>=4.60.0, pyyaml>=6.0, scikit-learn>=1.3.0

Copilot AI and others added 3 commits April 3, 2026 07:34
Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06

Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>
…dataset, benchmarks, update READMEs

Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06

Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>
…act shared ROUGE-L metric

Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06

Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>
@djmahe4
Copy link
Copy Markdown
Owner

djmahe4 commented Apr 3, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive dataset generation pipeline and benchmarking suite, featuring stratified splitting, external dataset integration with TF-IDF deduplication, and Pydantic-based validation. The Ollama integration was enhanced with response caching and semantic mapping. Review feedback focused on optimizing performance by reusing thread pools and vectorizing similarity calculations, exposing the new cache via CLI arguments, and refining regex patterns for more robust JSON parsing from LLM outputs.

Comment on lines +140 to +145
from concurrent.futures import ThreadPoolExecutor
from ollama_enhancer import get_recommended_workers
max_workers = get_recommended_workers(cfg.ollama_model)
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(ollama_enhance, rec, model=cfg.ollama_model) for rec in records]
records = [f.result() for f in futures]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Creating a new ThreadPoolExecutor and calling get_recommended_workers inside _process_report is highly inefficient. get_recommended_workers performs a network benchmark against the Ollama API, which will now execute for every single report processed in the loop at line 474.

Improvement:

  1. Call get_recommended_workers once in main.
  2. Create a single ThreadPoolExecutor in main and pass it (or a mapping function) to _process_report to reuse threads and avoid redundant benchmarking.


sys.path.insert(0, str(Path(__file__).parent))

from config import Config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Config class is imported but not used in this script. GenerateConfig is defined and used locally instead. Removing unused imports helps maintain code clarity.

help="Minimum word count for extracted text")
parser.add_argument("--use-ollama", action="store_true", default=False,
help="Enable Ollama for semantic enrichment")
parser.add_argument("--ollama-model", type=str, default="llama3.2")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ollama_enhancer module now supports an on-disk response cache, but there is no command-line argument in this script to enable it. Consider adding an --ollama-cache-dir argument and passing it to the enhancement functions to leverage the caching feature.

Suggested change
parser.add_argument("--ollama-model", type=str, default="llama3.2")
parser.add_argument("--ollama-model", type=str, default="llama3.2")
parser.add_argument("--ollama-cache-dir", type=Path, default=None,
help="Directory for Ollama response cache (None to disable)")

Comment thread scripts/merge_external.py
Comment on lines +108 to +112
for idx, new_vec in enumerate(new_matrix):
# Compare against existing records
sims_existing = cosine_similarity(new_vec, existing_matrix).flatten()
if sims_existing.max() >= threshold:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop performs a separate cosine_similarity call for every new record against the entire existing_matrix. This is a performance bottleneck for large datasets.

It is more efficient to compute the similarity for the entire new_matrix against existing_matrix in a single vectorized call before the loop, and then check the pre-computed maximum similarity for each record during iteration.

pass

# Extract first JSON object from the response (matches across newlines)
match = re.search(r"\{.*\}", raw, re.DOTALL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex r"\{.*\}" is greedy. If the LLM response contains multiple JSON blocks or trailing text with a closing brace, it will capture everything from the first { to the last }, which often results in invalid JSON. Using a non-greedy match r"\{.*?\}" to find the first JSON object is generally more robust.

Suggested change
match = re.search(r"\{.*\}", raw, re.DOTALL)
match = re.search(r"\{.*?\}", raw, re.DOTALL)

Removed optional red-teaming dataset download instruction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants