feat: upgrade pipeline to large-scale production-ready dataset with Pydantic schemas, Ollama semantic merging, and train/val/test splits by Copilot · Pull Request #2 · djmahe4/RedBlue-Data

Copilot · 2026-04-03T11:16:48Z

Upgrades the RedBlue-Data pipeline from ~300–500 finding small-scale output to a production-ready instruction-tuning dataset with external data merging, stratified splits, Pydantic validation, and benchmark readiness.

New modules

scripts/schemas.py — Pydantic v2 Finding, RedTeamPair, BlueTeamPair models with field validators (severity normalization, CWE null-coercion), to_jsonl_dict() helpers
scripts/merge_external.py — scans external_datasets/ for JSONL/JSON, maps to unified schema via heuristics or Ollama semantic mapping, deduplicates against existing records using TF-IDF cosine similarity (threshold-configurable)
scripts/generate_dataset.py — full orchestration CLI: processes all reports → optional external merge → optional Ollama enrichment → stratified train/val/test splits (default 80/10/10) → optional HF Hub push

# Basic
python scripts/generate_dataset.py --max-reports 100

# With external merge + Ollama
python scripts/generate_dataset.py --merge-external --use-ollama --ollama-model llama3.2

# Custom splits + push
python scripts/generate_dataset.py --split-ratios 0.8 0.1 0.1 --push-to-hub myuser/redblue-data

Output layout:

dataset/
├── raw/    findings.jsonl, red_team.jsonl, blue_team.jsonl
├── train/
├── val/
└── test/

Upgraded: `ollama_enhancer.py`

Added two modes: validation/correction (enhance_finding) and semantic mapping (map_external_record) for external record schema conversion
Added opt-in disk cache keyed by sha256(model:prompt) — avoids redundant LLM calls across runs; off by default (cache_dir=None)

New: `benchmarks/`

Five benchmark task scripts (RB-1 to RB-5): vulnerability classification (Accuracy/Macro-F1), severity prediction (Weighted-F1), exploit generation (ROUGE-L), remediation quality (ROUGE-L), dataset statistics. Shared ROUGE-L implementation extracted to benchmarks/metrics.py.

Documentation

dataset/README.md — full HF-style dataset card: schema tables, sample records, datasets.load_dataset + SFTTrainer usage, benchmark table, external dataset citations
README.md — updated Quick Start with external dataset download steps and generate_dataset.py examples

Dependencies added

pydantic>=2.0, tqdm>=4.60.0, pyyaml>=6.0, scikit-learn>=1.3.0

Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06 Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>

…dataset, benchmarks, update READMEs Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06 Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>

…act shared ROUGE-L metric Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06 Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>

djmahe4 · 2026-04-03T11:21:09Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a comprehensive dataset generation pipeline and benchmarking suite, featuring stratified splitting, external dataset integration with TF-IDF deduplication, and Pydantic-based validation. The Ollama integration was enhanced with response caching and semantic mapping. Review feedback focused on optimizing performance by reusing thread pools and vectorizing similarity calculations, exposing the new cache via CLI arguments, and refining regex patterns for more robust JSON parsing from LLM outputs.

gemini-code-assist · 2026-04-03T11:23:03Z

+        from concurrent.futures import ThreadPoolExecutor
+        from ollama_enhancer import get_recommended_workers
+        max_workers = get_recommended_workers(cfg.ollama_model)
+        with ThreadPoolExecutor(max_workers=max_workers) as executor:
+            futures = [executor.submit(ollama_enhance, rec, model=cfg.ollama_model) for rec in records]
+            records = [f.result() for f in futures]


Creating a new ThreadPoolExecutor and calling get_recommended_workers inside _process_report is highly inefficient. get_recommended_workers performs a network benchmark against the Ollama API, which will now execute for every single report processed in the loop at line 474.

Improvement:

Call get_recommended_workers once in main.

Create a single ThreadPoolExecutor in main and pass it (or a mapping function) to _process_report to reuse threads and avoid redundant benchmarking.

gemini-code-assist · 2026-04-03T11:23:03Z

+
+sys.path.insert(0, str(Path(__file__).parent))
+
+from config import Config


The Config class is imported but not used in this script. GenerateConfig is defined and used locally instead. Removing unused imports helps maintain code clarity.

gemini-code-assist · 2026-04-03T11:23:03Z

+                        help="Minimum word count for extracted text")
+    parser.add_argument("--use-ollama", action="store_true", default=False,
+                        help="Enable Ollama for semantic enrichment")
+    parser.add_argument("--ollama-model", type=str, default="llama3.2")


The ollama_enhancer module now supports an on-disk response cache, but there is no command-line argument in this script to enable it. Consider adding an --ollama-cache-dir argument and passing it to the enhancement functions to leverage the caching feature.

Suggested change

parser.add_argument("--ollama-model", type=str, default="llama3.2")

parser.add_argument("--ollama-model", type=str, default="llama3.2")

parser.add_argument("--ollama-cache-dir", type=Path, default=None,

help="Directory for Ollama response cache (None to disable)")

gemini-code-assist · 2026-04-03T11:23:03Z

+    for idx, new_vec in enumerate(new_matrix):
+        # Compare against existing records
+        sims_existing = cosine_similarity(new_vec, existing_matrix).flatten()
+        if sims_existing.max() >= threshold:
+            continue


This loop performs a separate cosine_similarity call for every new record against the entire existing_matrix. This is a performance bottleneck for large datasets.

It is more efficient to compute the similarity for the entire new_matrix against existing_matrix in a single vectorized call before the loop, and then check the pre-computed maximum similarity for each record during iteration.

gemini-code-assist · 2026-04-03T11:23:03Z

        pass

-    # Extract first JSON object from the response (matches across newlines)
    match = re.search(r"\{.*\}", raw, re.DOTALL)


The regex r"\{.*\}" is greedy. If the LLM response contains multiple JSON blocks or trailing text with a closing brace, it will capture everything from the first { to the last }, which often results in invalid JSON. Using a non-greedy match r"\{.*?\}" to find the first JSON object is generally more robust.

Suggested change

match = re.search(r"\{.*\}", raw, re.DOTALL)

match = re.search(r"\{.*?\}", raw, re.DOTALL)

Removed optional red-teaming dataset download instruction.

Copilot AI and others added 3 commits April 3, 2026 07:34

chore: start large-scale pipeline upgrade

428c9c7

Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06 Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>

feat: add Pydantic schemas, Ollama caching, merge_external, generate_…

b01d0e1

…dataset, benchmarks, update READMEs Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06 Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>

fix: address code review - rename validators, model-keyed cache, extr…

c03a733

…act shared ROUGE-L metric Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06 Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>

Copilot AI assigned Copilot and djmahe4 Apr 3, 2026

Copilot created this pull request from a session on behalf of djmahe4 April 3, 2026 11:16 View session

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

Update README to remove red-teaming dataset section

b8f55af

Removed optional red-teaming dataset download instruction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: upgrade pipeline to large-scale production-ready dataset with Pydantic schemas, Ollama semantic merging, and train/val/test splits#2

feat: upgrade pipeline to large-scale production-ready dataset with Pydantic schemas, Ollama semantic merging, and train/val/test splits#2
Copilot wants to merge 4 commits into
copilot/build-cybersecurity-dataset-pipelinefrom
copilot/upgrade-cybersecurity-dataset-pipeline

Copilot AI commented Apr 3, 2026

Uh oh!

djmahe4 commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		sys.path.insert(0, str(Path(__file__).parent))

		from config import Config

	match = re.search(r"\{.*\}", raw, re.DOTALL)
	match = re.search(r"\{.*?\}", raw, re.DOTALL)

Conversation

Copilot AI commented Apr 3, 2026

New modules

Upgraded: ollama_enhancer.py

New: benchmarks/

Documentation

Dependencies added

Uh oh!

djmahe4 commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Upgraded: `ollama_enhancer.py`

New: `benchmarks/`