feat: upgrade pipeline to large-scale production-ready dataset with Pydantic schemas, Ollama semantic merging, and train/val/test splits#2
Conversation
Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06 Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>
…dataset, benchmarks, update READMEs Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06 Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>
…act shared ROUGE-L metric Agent-Logs-Url: https://github.com/djmahe4/RedBlue-Data/sessions/182102b2-75c2-4f83-9511-83f7db527b06 Co-authored-by: djmahe4 <137691824+djmahe4@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive dataset generation pipeline and benchmarking suite, featuring stratified splitting, external dataset integration with TF-IDF deduplication, and Pydantic-based validation. The Ollama integration was enhanced with response caching and semantic mapping. Review feedback focused on optimizing performance by reusing thread pools and vectorizing similarity calculations, exposing the new cache via CLI arguments, and refining regex patterns for more robust JSON parsing from LLM outputs.
| from concurrent.futures import ThreadPoolExecutor | ||
| from ollama_enhancer import get_recommended_workers | ||
| max_workers = get_recommended_workers(cfg.ollama_model) | ||
| with ThreadPoolExecutor(max_workers=max_workers) as executor: | ||
| futures = [executor.submit(ollama_enhance, rec, model=cfg.ollama_model) for rec in records] | ||
| records = [f.result() for f in futures] |
There was a problem hiding this comment.
Creating a new ThreadPoolExecutor and calling get_recommended_workers inside _process_report is highly inefficient. get_recommended_workers performs a network benchmark against the Ollama API, which will now execute for every single report processed in the loop at line 474.
Improvement:
- Call
get_recommended_workersonce inmain. - Create a single
ThreadPoolExecutorinmainand pass it (or a mapping function) to_process_reportto reuse threads and avoid redundant benchmarking.
|
|
||
| sys.path.insert(0, str(Path(__file__).parent)) | ||
|
|
||
| from config import Config |
| help="Minimum word count for extracted text") | ||
| parser.add_argument("--use-ollama", action="store_true", default=False, | ||
| help="Enable Ollama for semantic enrichment") | ||
| parser.add_argument("--ollama-model", type=str, default="llama3.2") |
There was a problem hiding this comment.
The ollama_enhancer module now supports an on-disk response cache, but there is no command-line argument in this script to enable it. Consider adding an --ollama-cache-dir argument and passing it to the enhancement functions to leverage the caching feature.
| parser.add_argument("--ollama-model", type=str, default="llama3.2") | |
| parser.add_argument("--ollama-model", type=str, default="llama3.2") | |
| parser.add_argument("--ollama-cache-dir", type=Path, default=None, | |
| help="Directory for Ollama response cache (None to disable)") |
| for idx, new_vec in enumerate(new_matrix): | ||
| # Compare against existing records | ||
| sims_existing = cosine_similarity(new_vec, existing_matrix).flatten() | ||
| if sims_existing.max() >= threshold: | ||
| continue |
There was a problem hiding this comment.
This loop performs a separate cosine_similarity call for every new record against the entire existing_matrix. This is a performance bottleneck for large datasets.
It is more efficient to compute the similarity for the entire new_matrix against existing_matrix in a single vectorized call before the loop, and then check the pre-computed maximum similarity for each record during iteration.
| pass | ||
|
|
||
| # Extract first JSON object from the response (matches across newlines) | ||
| match = re.search(r"\{.*\}", raw, re.DOTALL) |
There was a problem hiding this comment.
The regex r"\{.*\}" is greedy. If the LLM response contains multiple JSON blocks or trailing text with a closing brace, it will capture everything from the first { to the last }, which often results in invalid JSON. Using a non-greedy match r"\{.*?\}" to find the first JSON object is generally more robust.
| match = re.search(r"\{.*\}", raw, re.DOTALL) | |
| match = re.search(r"\{.*?\}", raw, re.DOTALL) |
Removed optional red-teaming dataset download instruction.
Upgrades the RedBlue-Data pipeline from ~300–500 finding small-scale output to a production-ready instruction-tuning dataset with external data merging, stratified splits, Pydantic validation, and benchmark readiness.
New modules
scripts/schemas.py— Pydantic v2Finding,RedTeamPair,BlueTeamPairmodels with field validators (severity normalization, CWE null-coercion),to_jsonl_dict()helpersscripts/merge_external.py— scansexternal_datasets/for JSONL/JSON, maps to unified schema via heuristics or Ollama semantic mapping, deduplicates against existing records using TF-IDF cosine similarity (threshold-configurable)scripts/generate_dataset.py— full orchestration CLI: processes all reports → optional external merge → optional Ollama enrichment → stratified train/val/test splits (default 80/10/10) → optional HF Hub pushOutput layout:
Upgraded:
ollama_enhancer.pyenhance_finding) and semantic mapping (map_external_record) for external record schema conversionsha256(model:prompt)— avoids redundant LLM calls across runs; off by default (cache_dir=None)New:
benchmarks/Five benchmark task scripts (RB-1 to RB-5): vulnerability classification (Accuracy/Macro-F1), severity prediction (Weighted-F1), exploit generation (ROUGE-L), remediation quality (ROUGE-L), dataset statistics. Shared ROUGE-L implementation extracted to
benchmarks/metrics.py.Documentation
dataset/README.md— full HF-style dataset card: schema tables, sample records,datasets.load_dataset+SFTTrainerusage, benchmark table, external dataset citationsREADME.md— updated Quick Start with external dataset download steps andgenerate_dataset.pyexamplesDependencies added
pydantic>=2.0,tqdm>=4.60.0,pyyaml>=6.0,scikit-learn>=1.3.0