Skip to content

Generate the VDR from per-CVE source files#26

Open
ppkarwasz wants to merge 5 commits into
mainfrom
feat/vdr-generation
Open

Generate the VDR from per-CVE source files#26
ppkarwasz wants to merge 5 commits into
mainfrom
feat/vdr-generation

Conversation

@ppkarwasz
Copy link
Copy Markdown
Member

This change replaces our hand-maintained src/site/static/cyclonedx/vdr.xml with a generated artifact assembled from one source file per (CVE, component) pair under src/vulnerabilities/.

To regenerate the VDR after editing any per-CVE file:

uv run scripts/vdr_aggregate.py

To split an existing monolithic VDR back into per-CVE files (one-time migration, or recovery):

uv run scripts/vdr_split.py

Why

The current hand-edited VDR is becoming hard to maintain reliably:

  1. Timestamps drift. In the latest release we forgot to bump metadata.timestamp to the max of every vulnerability.updated. The aggregator now computes this automatically.
  2. Ordering is hard to keep straight. Vulnerabilities in the file are not strictly sorted, and components are listed in an ad-hoc order. The aggregator enforces deterministic order: vulnerabilities by (year DESC, number DESC), components alphabetically by bom-ref.
  3. Merge conflicts on simultaneous additions. Adding seven vulnerabilities in a single batch (as in the most recent disclosure) is error-prone. Per-CVE files let contributors add or edit vulnerabilities independently.

How it works

Each vulnerability lives in its own file at src/vulnerabilities/<CVE-id>/<component>.cdx.xml: a self-contained CycloneDX 1.7 BOM with the affected component as metadata.component and a single <vulnerability> element. log4cxx-conan never gets its own file; its vulnerabilities ride along in the corresponding log4cxx file via a <components> entry plus a <dependencies> edge.

vdr_aggregate.py walks every per-CVE file, dedupes components by bom-ref, dedupes vulnerabilities by CVE id, and emits the monolithic vdr.xml. vdr_split.py performs the inverse for migration. Both scripts share vdr_common.py (constants, namespace handling, comparison, write-if-changed orchestration).

Idempotent writes

Both scripts read the existing output's serialNumber and version, build a candidate at the existing version, and compare it to the file on disk via a structural comparison that ignores comments, inter-element whitespace, and namespace prefixes. If the candidate is equivalent, the file is left untouched: no diff, no version churn. If it differs, the version is bumped by one and the file is rewritten.

This means re-running either script in a clean tree is a no-op, and a content edit produces exactly one version bump per affected file.

Why split per (CVE, component), beyond automation

  1. Path to VEX. A monolithic VDR has no meaningful metadata.component, since it covers many subjects. Per-component files let metadata.component name the analyzing project (e.g. log4j-core), with the vulnerable dependency in vulnerability.affects and the dependency path in <dependencies>. That's the shape required for VEX, CSAF, and OpenVEX, so we can grow into those formats without restructuring our source of truth.
  2. Easier asciidoc generation. Per-CVE files let _vulnerabilities.adoc be assembled from one generated partial per CVE, instead of a single monolithic AsciiDoc.
    We can also later decide to have a separate page per CVE.

Repository layout

scripts/
  vdr_common.py        # shared helpers (constants, clone, serialize, equivalent, write_bom_if_changed)
  vdr_aggregate.py     # per-CVE files -> vdr.xml
  vdr_split.py         # vdr.xml -> per-CVE files
src/vulnerabilities/
  CVE-2017-5645/log4j-core.cdx.xml
  CVE-2018-1285/log4net.cdx.xml
  ...
  template.cdx.xml     # editable template for new CVEs
src/site/static/cyclonedx/vdr.xml   # generated output

Notes for reviewers

  • The aggregated vdr.xml is now CycloneDX 1.7 (was 1.6). The bump is intentional: 1.x is semantically versioned, and the structural change is just a namespace rename.
  • Component ordering changed: alphabetical by bom-ref, so log4j-1.2-api now precedes log4j-core.
  • The aggregated header comment now warns it's generated and points at uv run scripts/vdr_aggregate.py for updates.

Adds a minimal `uv` project at the repo root so Python tooling scripts can run against a locked, reproducible environment alongside the existing Maven-based site build.

The layout consists of:

- `pyproject.toml`: declares `requires-python >= 3.11` and the tooling dependencies (currently just `lxml`). `tool.uv.package = false` since this repo is not itself an installable Python package.
- `.python-version`: pins the interpreter version uv picks up.
- `uv.lock`: committed so dependency versions do not drift between
  contributors or CI runs.

Scripts placed under `scripts/` may also carry PEP 723 inline metadata for standalone invocation via `uv run <script>`.

To bootstrap the environment download `uv` and run:

    uv sync
The `vdr_split` script deterministically and reproducibly splits our monolithic VDR into one file per vulnerability. The files are stored at:

src/vulnerabilities/CVE-XXXX-YYYY/<bom-ref>.cdx.xml

where `bom-ref` is the slug we used as a BOM reference.

The Conan package for Log4cxx has no identity of its own, so its vulnerabilities are recorded in the same file as the corresponding Log4cxx ones.

A template for new CycloneDX files is stored in `src/vulnerabilities/template.cdx.xml`.
The `vdr_aggregate` performs the reverse operation, compared to `vdr_split`:

- It merges all the CycloneDX documents in `src/vulnerabilities`,
- If the result differs from the committed one, it bumps the version.

Comparison does not take whitespace into consideration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant