Summary
Migrate the Vulnerability Enrichment and Intelligence Engine to use native LLM Structured Outputs instead of relying on prompt engineering and manual JSON parsing. This will guarantee that the AI-generated security findings strictly adhere to the required JSON schema, data types, and enumerations.
Problem
Currently, the application extracts JSON responses from the LLM via prompt instructions (e.g., "Return ONLY JSON") and manual string parsing when enriching security findings. This is highly brittle:
- Pipeline Instability: A single malformed JSON property or missing bracket breaks downstream database ingestion and reporting.
- Type Enforcement: Numerical values like
cvss_score are sometimes returned as strings (e.g., "score: 8.5") instead of strict floats, breaking dashboard metrics.
- Enum Hallucinations: The LLM may hallucinate non-standard severity levels (e.g.,
Severe or Warning instead of Critical, High, Medium, etc.), which breaks dashboard filters and routing logic.
- Collection Crashes: Remediation steps might be returned as a single string instead of an array, causing client-side
.map() errors on the frontend.
Proposed solution
Leverage the native Structured Outputs feature (JSON Schema validation) supported by modern LLM APIs.
- Define a strict schema (e.g., using a Pydantic model) for the enriched security finding payload, explicitly typing fields like
severity (Enum), cvss_score (float), and remediation_steps (list of strings).
- Pass this schema directly into the LLM provider's structured output parameter.
- Remove legacy fallback logic, string manipulation, and markdown-stripping (````json`) currently used to salvage AI outputs.
Suggested scope
- Suggested files or directories:
backend/secuscan/finding_intelligence.py and potentially backend/secuscan/models.py for schema definitions.
- Related route, page, component, API, or plugin: The finding intelligence service/API layer handling vulnerability enrichment.
Acceptance criteria
Test plan
- Trigger a scan using a plugin that feeds raw data into the finding intelligence module (e.g., a raw Nuclei or ZAP finding).
- Verify that the resulting enriched data is correctly persisted to the database without parsing exceptions.
- Inspect the API response to the frontend to ensure
cvss_score is a native JSON number (not a string) and remediation_steps is a native JSON array.
- Run the backend test suite (e.g.,
pytest testing/backend/unit/test_finding_intelligence.py if available) to ensure no regressions.
Alternatives considered
- Improved Regex/Parsing logic: We considered writing more robust regex to extract JSON blocks, but this does not solve the issue of the LLM hallucinating incorrect property names or data types inside the block.
- Using a validation library (like Guardrails AI): While this enforces schemas, it adds unnecessary latency and an extra dependency, whereas native structured outputs solve the problem at the API level directly.
Additional context
I am a contributor participating in GSSoC 2026. This architectural optimization will significantly improve the core stability of the SecuScan data pipeline, and I would love to be assigned to implement it!
Summary
Migrate the Vulnerability Enrichment and Intelligence Engine to use native LLM Structured Outputs instead of relying on prompt engineering and manual JSON parsing. This will guarantee that the AI-generated security findings strictly adhere to the required JSON schema, data types, and enumerations.
Problem
Currently, the application extracts JSON responses from the LLM via prompt instructions (e.g., "Return ONLY JSON") and manual string parsing when enriching security findings. This is highly brittle:
cvss_scoreare sometimes returned as strings (e.g.,"score: 8.5") instead of strict floats, breaking dashboard metrics.SevereorWarninginstead ofCritical,High,Medium, etc.), which breaks dashboard filters and routing logic..map()errors on the frontend.Proposed solution
Leverage the native Structured Outputs feature (JSON Schema validation) supported by modern LLM APIs.
severity(Enum),cvss_score(float), andremediation_steps(list of strings).Suggested scope
backend/secuscan/finding_intelligence.pyand potentiallybackend/secuscan/models.pyfor schema definitions.Acceptance criteria
severityfield strictly returns expected enums, andcvss_scorestrictly returns a float.Test plan
cvss_scoreis a native JSON number (not a string) andremediation_stepsis a native JSON array.pytest testing/backend/unit/test_finding_intelligence.pyif available) to ensure no regressions.Alternatives considered
Additional context
I am a contributor participating in GSSoC 2026. This architectural optimization will significantly improve the core stability of the SecuScan data pipeline, and I would love to be assigned to implement it!