Skip to content

Metadata automation#246

Open
jim-nvidia wants to merge 4 commits into
NVIDIA:mainfrom
jim-nvidia:metadata-automation
Open

Metadata automation#246
jim-nvidia wants to merge 4 commits into
NVIDIA:mainfrom
jim-nvidia:metadata-automation

Conversation

@jim-nvidia
Copy link
Copy Markdown
Contributor

Onboarding type

  • New product onboarding (new components.d/<slug>.yml file)
  • Other (catalog change, README fix, infrastructure, etc.)

For new product onboarding — author affirmations

By submitting this PR, I confirm on behalf of my team:

  • Skills cleared for open source release per NVIDIA's internal IP review process (six-question check, all answers affirmative)
  • License selected: Apache 2.0 / CC-BY 4.0 / Dual (Apache 2.0 + CC-BY 4.0). Specify: _____
  • No new license or new third-party component introduced beyond what the source repo already carries
  • Source repo is public and under an NVIDIA-owned GitHub org
  • .agents/skills/ or skills/ path used for new entries (or existing path retained for legacy entries per components.d/<slug>.yml)

NVIDIA contributors: see the internal onboarding guide for the IP review process details and license selection.

Reviewer checklist (OSS Skills PIC)

  • Author confirmations above are checked
  • components.d/<slug>.yml entry valid (required fields, unique catalog_dir, path exists in source repo, filename slug matches name)
  • SKILL.md frontmatter spec-compliant (at least one sampled)
  • No new license or third-party dependency requiring OSRB filing

All PRs

  • All commits signed off with DCO (git commit -s).
    If you forgot, run git rebase --signoff origin/main && git push --force-with-lease to retroactively sign all commits in your branch.

Other context (for non-onboarding PRs)

cc: @sayalinvidia @jasonnvidia @
Introduces an end-to-end generator for the catalog-wide metadata.json and skills.sh.json files, plus the schemas, configs, workflow, and docs that support it.

What it does

Generator (.github/scripts/marketplace/generate-skill-metadata.py):

  • Walks skills/ and parses SKILL.md frontmatter
  • Carries forward existing values from the prior metadata.json baseline for byte-stable regen runs (unchanged skills produce zero AI calls)
  • For materially changed skills (name or description changed), runs an "amend" mode: passes existing classification values as context and asks the model per-field whether to keep or update — biased toward preservation
  • Falls back to an OpenAI-compatible LLM enrichment call (NVIDIA Inference API by default) for required fields the deterministic path cannot resolve, with strict JSON contract and no enum invention
  • Per-skill enrichment failures go to a skill_warnings list; valid skills write regardless (partial-success exit code 1)
  • Validates every output against the schemas before writing

Schemas and configs (.github/scripts/marketplace/):

  • metadata.schema.json — single source of truth for controlled vocabulary
  • skills-sh.schema.json — output structure for skills.sh.json
  • skills-subdomains.json — subdomain titles, descriptions, ordering (keys are in canonical curated sequence; generator preserves insertion order)
  • metadata-exclusions.yaml — temporary withholding list (currently empty)

Workflow (.github/workflows/generate-skill-metadata.yml):

  • pull_request: runs --check --no-ai; PRs must produce byte-stable output
  • workflow_dispatch and post-sync: regenerates with AI; opens an auto-PR on changes; opens or updates a tracking issue on validation failure

Docs (.github/scripts/marketplace/README.md): local usage, CI behavior, AI contract.

Setup required (repo admin)

Before the workflow can run against NVIDIA/skills, a repo admin must configure the following in Settings → Secrets and variables → Actions:

Name Type Value
INFERENCE_API_KEY Repository secret NVIDIA Inference API key
INFERENCE_MODEL Repository variable Model name (e.g. openai/gpt-5.5)
INFERENCE_API_URL Repository variable Optional — defaults to NVIDIA Inference API endpoint

Test plan

  • Generator runs locally end-to-end against the full skills tree (gpt-5.5 via Inference API)
  • --check --no-ai mode produces byte-stable output on an unmodified tree
  • Amend mode tested: fill-mode and amend-mode calls verified against live API
  • Partial-success path: a skill failing AI enrichment does not block output for others
  • Group ordering: subdomain sequence in skills.sh.json matches skills-subdomains.json insertion order
  • CI workflow run in NVIDIA/skills (requires secrets configured by admin)

jasonnvidia and others added 4 commits June 5, 2026 15:41
Introduce an end-to-end generator for the catalog-wide metadata.json and
skills.sh.json files, plus the schemas, configs, workflow, and docs that
support it.

Generator (.github/scripts/marketplace/generate-skill-metadata.py):
- walks skills/ and parses SKILL.md frontmatter
- carries forward existing values from the prior metadata.json baseline
  for byte-stable regen runs
- falls back to an OpenAI-compatible LLM enrichment call (NVIDIA Inference
  API by default) for required fields the deterministic path cannot
  resolve, with strict JSON contract and no enum invention
- validates every output against the schemas before writing

Schemas and configs co-located under .github/scripts/marketplace/:
- metadata.schema.json (single source of truth for controlled vocabulary)
- skills-sh.schema.json (output structure for skills.sh.json)
- skills-subdomains.json (subdomain titles, descriptions, ordering)
- metadata-exclusions.yaml (temporary withholding list; currently empty)

Workflow (.github/workflows/generate-skill-metadata.yml):
- pull_request: runs --check --no-ai; PRs must produce byte-stable output
- workflow_dispatch and post-sync: regenerate with AI; opens an auto-PR
  on changes; opens or updates a tracking issue on validation failure

Docs:
- docs/metadata-generation.md (local usage, CI behavior, AI contract)
- docs/metadata-generation-prd.md (design)
- docs/components-d-product-primary-audit.md (one-time audit of the
  components.d/ name vs product.primary enum mismatch surfaced while
  building this; provides decision points for the team)

Signed-off-by: Jim Eagan <jeagan@nvidia.com>
The docs/ directory is the Fern-published external documentation site
(docs.nvidia.com/skills). Internal pipeline documentation does not
belong there.

- docs/metadata-generation.md → .github/scripts/marketplace/README.md
  (auto-renders next to the generator and its schemas).
- Removed docs/metadata-generation-prd.md (kept locally; design doc not
  needed in tree once the pipeline is shipped).
- Removed docs/components-d-product-primary-audit.md (one-time team
  audit, archived elsewhere).

Updated the moved README's relative paths so links to sibling marketplace
files use ./ and the workflow link uses ../../workflows/. Dropped the
PRD cross-link from the README opening paragraph.

Signed-off-by: Jim Eagan <jeagan@nvidia.com>
When a skill's SKILL.md `name` or `description` changes, the generator
previously updated those fields in metadata.json but silently kept the
five MVP classification fields from the baseline, never re-evaluating
them against the new content. This commit teaches the AI client an
"amend" mode: the existing values are passed in as context and the model
is asked, per field, whether to keep the value verbatim or change it
because the new content clearly warrants a different controlled value.
The prompt biases toward preservation — only clear mismatches are
amended — so byte-stability is the common case for routine wording
edits, while genuine content shifts get the metadata they deserve.

Skills classified as `unchanged` still trigger zero AI calls, and
--no-ai still preserves the existing metadata for materially-changed
skills as-is.

Also drops the explicit `temperature: 0` from the API body. The task
is strict controlled-vocabulary classification constrained by
response_format=json_object, so the model's default temperature is
fine, and omitting the field keeps us compatible with deployments
(e.g. gpt-5.x) that reject any explicit value.

Verified end-to-end against the live inference API on gpt-5.5 with
both a fill-mode call and an amend-mode call.

Signed-off-by: Jim Eagan <jeagan@nvidia.com>
…nt failures

Three fixes to the metadata generation pipeline:

1. Group sequence in skills.sh.json was being alphabetized on every run,
   destroying curated subdomain ordering. Generator now iterates
   skills-subdomains.json keys in insertion order rather than sorting by
   title.

2. Subdomain group descriptions were being overwritten with diverged
   values. Updated skills-subdomains.json to use canonical descriptions
   and corrected the $comment to reflect key-order-based emission.

3. A single skill failing AI enrichment blocked output for all other
   skills (all-or-nothing write). Per-skill enrichment failures now go to
   a separate skill_warnings list; valid skills are written and the run
   exits 1 with a PARTIAL SUCCESS report. validate_inventory_round_trip
   updated to exclude intentionally skipped skills.

Also adds a $comment to skills.sh.json marking it as generated and
directing editors to skills-subdomains.json for ordering and description
changes.

Signed-off-by: Jim Eagan <jeagan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants