Update release with new fabrica-based services; remove old services by travisbcotton · Pull Request #50 · OpenCHAMI/release

travisbcotton · 2026-04-02T13:14:40Z

Pull Request Template

Thank you for your contribution! Please ensure the following before submitting:

Checklist

My code follows the style guidelines of this project
I have added/updated comments where needed
I have added tests that prove my fix is effective or my feature works
I have run make test (or equivalent) locally and all tests pass
DCO Sign-off: All commits are signed off (git commit -s) with my real name and email
REUSE Compliance:
- Each new/modified source file has SPDX copyright and license headers
- Any non-commentable files include a <filename>.license sidecar
- All referenced licenses are present in the LICENSES/ directory

Description

Please include a summary of the change and which issue is fixed.
Also include relevant motivation and context.

Fixes #(issue)

Type of Change

Bug fix
New feature
Breaking change
Documentation update

For more info, see Contributing Guidelines.

davidallendj · 2026-04-09T16:03:56Z

Just a couple of other notes before merging. We need to update the *.container files to use the most up-to-date version of our services including:

SMD after this PR is merged.
Tokensmith to v0.3.0 or later
Boot-service after creating a release
Metadata-service after creating a release

We also need to update systemd/targets/openchami.target to require the new services as well.

davidallendj · 2026-04-09T22:05:45Z

Another note...we're going to update the CoreDHCP config in /etc/openchami/configs/coredhcp.yaml to reflect the change from this PR if we upgrade to the latest version.

Here's snippet of the tutorial config should look like after the changes:

    - coresmd: |
        svc_base_uri=https://demo.openchami.cluster:8443 
        ipxe_base_uri=http://172.16.0.254:8081 
        ca_cert=/root_ca/root_ca.crt 
        cache_valid=30s 
        lease_time=1h 
        single_port=false
    - bootloop: |
        lease_file=/tmp/coredhcp.db 
        script_path=default 
        lease_time=5m 
        ipv4_start=172.16.0.200 
        ipv4_end=172.16.0.250

davidallendj · 2026-04-13T20:13:53Z

A couple of changes:

I think OPAAL_URL can be removed
I think JWKS_URL should be updated to use the tokensmith JWKS endpoint. In the tutorial, it will be something like http://tokensmith:8080/.well-known/jwks.json.
The same change needs to be made to SMD_JWKS_URL as well.

davidallendj · 2026-04-15T20:23:10Z

We'll need /etc/openchami/configs/haproxy.cfg to be updated to remove the old service routes and add the new ones for tokensmith, boot-service, and metadata-service.

synackd · 2026-04-15T21:59:08Z

We'll have to note these major changes in the release notes once this is merged. We'll want to bump the minor version on the tag.

davidallendj · 2026-04-16T15:13:36Z

Should we provide a /etc/openchami/configs/boot-service.yaml here alongside the /etc/openchami/configs/tokensmith.json? I think it should go in systemd/configs/boot-service.yaml here to be copied in the appropriate location.

Edit: Just to add, here's the default boot-service config.yaml:

systemd/configs/boot-service.yaml

# SPDX-FileCopyrightText: 2025 OpenCHAMI Contributors
#
# SPDX-License-Identifier: MIT

# OpenCHAMI Boot Service Configuration Example
#
# This is a comprehensive example configuration file for the OpenCHAMI boot service.
# To use this configuration:
#   1. Copy this file to config.yaml: cp config.example.yaml config.yaml
#   2. Customize the settings below for your environment
#   3. Remove or comment out sections you don't need
#
# Configuration precedence (highest to lowest):
#   1. Command-line flags
#   2. Environment variables (e.g., BOOT_SERVICE_PORT=8082)
#   3. Configuration file (config.yaml)
#   4. Default values

# =============================================================================
# SERVER CONFIGURATION
# =============================================================================

# HTTP server settings
port: 8082                    # Port to listen on
host: "0.0.0.0"              # Interface to bind to (0.0.0.0 for all interfaces)
read_timeout: 30             # HTTP read timeout in seconds
write_timeout: 30            # HTTP write timeout in seconds
idle_timeout: 120            # HTTP idle timeout in seconds

# =============================================================================
# STORAGE CONFIGURATION
# =============================================================================

# Data storage settings
data_dir: "./data"           # Directory for storing boot configurations
storage_type: "file"         # Storage backend: "file", "database" (future)

# Database settings (when storage_type: "database")
# database:
#   driver: "postgres"
#   host: "localhost"
#   port: 5432
#   name: "boot_service"
#   user: "boot_user"
#   password: "boot_password"
#   ssl_mode: "require"
#   max_connections: 25
#   connection_timeout: 30

# =============================================================================
# FEATURE TOGGLES
# =============================================================================

# Authentication
enable_auth: false           # Enable TokenSmith JWT authentication
                            # Set to true for production environments

# Metrics and monitoring
enable_metrics: true         # Enable Prometheus metrics endpoint
metrics_port: 9092          # Port for metrics endpoint (/metrics)

# API compatibility
enable_legacy_api: true     # Enable legacy BSS-compatible endpoints
                           # Disable to force use of new API only

# =============================================================================
# AUTHENTICATION CONFIGURATION (when enable_auth: true)
# =============================================================================

auth:
  # Core authentication settings
  enabled: false             # Must match enable_auth above

  # JWT validation method (choose one):

  # Option 1: JWKS URL (recommended for production)
  jwks_url: "https://auth.openchami.org/.well-known/jwks.json"
  jwks_refresh_interval: "1h"  # How often to refresh JWKS cache

  # Option 2: Static RSA public key (for development/testing)
  # jwt_public_key: |
  #   -----BEGIN PUBLIC KEY-----
  #   MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA...
  #   -----END PUBLIC KEY-----

  # JWT validation options
  jwt_issuer: "https://auth.openchami.org"     # Expected token issuer
  jwt_audience: "boot-service"                  # Expected token audience
  validate_expiration: true                     # Check token expiration
  validate_issuer: true                        # Validate issuer claim
  validate_audience: true                      # Validate audience claim

  # Authorization requirements
  required_claims: ["sub", "iss", "aud"]      # Required JWT claims
  required_scopes: ["boot:read"]              # Required OAuth2 scopes

  # Development/testing options (never use in production)
  allow_empty_token: false    # Allow requests without tokens
  non_enforcing: false       # Log auth failures but don't block requests

# =============================================================================
# HARDWARE STATE MANAGER INTEGRATION
# =============================================================================

# HSM (Hardware State Manager) settings
hsm_url: "http://localhost:27779"  # URL of the HSM service
                                   # Set to your HSM endpoint

# TokenSmith-backed HSM service authentication
# When both hsm_url and tokensmith_url are configured, boot-service exchanges a
# bootstrap token for short-lived service tokens and adds them to HSM requests.
# Standardized env vars: TOKENSMITH_URL, TOKENSMITH_BOOTSTRAP_TOKEN,
# TOKENSMITH_TARGET_SERVICE, TOKENSMITH_SCOPES, TOKENSMITH_REFRESH_SKEW_SEC
tokensmith_url: "http://localhost:8080"
tokensmith_target_service: "hsm"
tokensmith_scopes: "hsm:read"
tokensmith_refresh_skew_sec: 120
# tokensmith_bootstrap_token: "<bootstrap-jwt>"  # Prefer env var for secrets
# Environment fallback: TOKENSMITH_BOOTSTRAP_TOKEN

# HSM authentication (when HSM requires auth)
# hsm_auth:
#   type: "service_token"      # Authentication type for HSM
#   service_name: "boot-service"
#   token_endpoint: "http://tokensmith:8080/token"

# =============================================================================
# EXTERNAL SERVICES
# =============================================================================

# TokenSmith authentication service (when enable_auth: true)
tokensmith:
  url: "http://localhost:8080"                    # TokenSmith service URL
  timeout: 30                                    # Request timeout in seconds

  # Service-to-service authentication
  service_auth:
    enabled: false                               # Enable service tokens
    service_name: "boot-service"                 # This service's identifier
    token_endpoint: "/token"                     # Token endpoint path

# BSS (Boot Script Service) integration
bss:
  enabled: false                                 # Enable BSS integration
  url: "http://localhost:27778"                 # BSS service URL
  timeout: 30                                   # Request timeout in seconds

# =============================================================================
# LOGGING AND MONITORING
# =============================================================================

# Logging configuration
logging:
  level: "info"               # Log level: debug, info, warn, error
  format: "json"             # Log format: json, text
  output: "stdout"           # Log output: stdout, stderr, file
  # file: "/var/log/boot-service.log"  # Log file (when output: file)

# Health check configuration
health:
  enabled: true              # Enable health check endpoint
  endpoint: "/health"        # Health check URL path
  timeout: 5                # Health check timeout in seconds

# =============================================================================
# PERFORMANCE AND SCALING
# =============================================================================

# Request limits
limits:
  max_request_size: "10MB"   # Maximum request body size
  max_concurrent: 100        # Maximum concurrent requests
  rate_limit: 1000          # Requests per minute per IP

# Caching (future feature)
# cache:
#   enabled: false
#   type: "memory"           # Cache type: memory, redis
#   ttl: "5m"               # Cache TTL
#   max_size: "100MB"       # Maximum cache size

# =============================================================================
# DEVELOPMENT AND TESTING
# =============================================================================

# Development mode settings
development:
  enabled: false             # Enable development mode
  cors_enabled: true        # Enable CORS for browser testing
  cors_origins: ["*"]       # Allowed CORS origins
  debug_endpoints: false    # Enable debug/diagnostic endpoints
  mock_services: false      # Use mock external services

# =============================================================================
# DEPLOYMENT ENVIRONMENT EXAMPLES
# =============================================================================

# Uncomment and modify one of these sections for your deployment environment:

# --- Development Environment ---
# enable_auth: false
# enable_metrics: true
# logging:
#   level: "debug"
# development:
#   enabled: true
#   debug_endpoints: true

# --- Production Environment ---
# enable_auth: true
# enable_metrics: true
# auth:
#   enabled: true
#   jwks_url: "https://auth.openchami.org/.well-known/jwks.json"
#   jwt_issuer: "https://auth.openchami.org"
#   jwt_audience: "boot-service"
#   required_scopes: ["boot:read"]
# logging:
#   level: "info"
#   format: "json"

# --- Kubernetes/Container Environment ---
# port: 8080
# host: "0.0.0.0"
# data_dir: "/data"
# auth:
#   jwks_url: "http://tokensmith:8080/.well-known/jwks.json"
#   jwt_issuer: "openchami-tokensmith"
#   jwt_audience: "openchami-cluster"
# hsm_url: "http://smd:27779"
# logging:
#   format: "json"
#   output: "stdout"

synackd · 2026-04-16T20:21:09Z

Another note...we're going to update the CoreDHCP config in /etc/openchami/configs/coredhcp.yaml to reflect the change from this PR if we upgrade to the latest version.

Here's snippet of the tutorial config should look like after the changes:

    - coresmd: |
        svc_base_uri=https://demo.openchami.cluster:8443 
        ipxe_base_uri=http://172.16.0.254:8081 
        ca_cert=/root_ca/root_ca.crt 
        cache_valid=30s 
        lease_time=1h 
        single_port=false
    - bootloop: |
        lease_file=/tmp/coredhcp.db 
        script_path=default 
        lease_time=5m 
        ipv4_start=172.16.0.200 
        ipv4_end=172.16.0.250

We may want to add default hostname rules since the default if none is to prefix with unknown-. Maybe something like:

rule=type:Node,hostname:n{04d}
rule=type:NodeBMC,hostname:{id}

The above will make the node hostnames be like n0001 and make the BMC hostnames be their xname.

synackd

Initial code review without testing this yet.

synackd

Testing now. I get:

sed: can't read /etc/containers/systemd/opaal.container: No such file or directory

when running the openchami-certificate-update script.

If getting rid of hydra, we'll want to remove references to it, e.g. in

release/scripts/openchami_profile.sh

Line 27 in d77457c

${CONTAINER_CMD:-docker} exec hydra hydra create client \

We can probably just get rid of those functions.

davidallendj · 2026-05-12T14:56:19Z

We might have to mount a volume now and set --data-dir with some of the upstream changes to how metadata-service works. I get a "permission denied" error when I try to start it with pr-8 with the current Exec.

I tried adding a volume like Volume=/opt/workdir/data:/data and chmod 777 /opt/workdir/data and that fixes the permission denied issue for me.

May 12 14:51:27 openchami-testing.novalocal systemd[1]: Started The metadata-service container. May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: 2026/05/12 14:51:27 Starting github.com/OpenCHAMI/metadata-service server... May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: Error: failed to initialize file storage: failed to create file backend: failed to create base directory /data: mkdir /data: permission denied May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: Usage: May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: ochami-metadata-server serve [flags] May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: Flags: May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --data-dir string Directory for file storage (default "/data") May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: -h, --help help for serve May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --host string Host to bind to (default "0.0.0.0") May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --idle-timeout int Idle timeout in seconds (default 60) May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: -p, --port int Port to listen on (default 8080) May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --read-timeout int Read timeout in seconds (default 15) May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --wireguard-only Restrict access to WireGuard network only May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --wireguard-server string Enable WireGuard userspace controller (CIDR, e.g. 100.97.0.1/16) May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --wireguard-state-file string Path to WireGuard state file for persistence (default "/data/wireguard/state.yaml") May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --write-timeout int Write timeout in seconds (default 15) May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: Global Flags: May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --config string config file (default is $HOME/.ochami-metadata.yaml) May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: --debug Enable debug logging May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: May 12 14:51:27 openchami-testing.novalocal metadata-service[2974370]: 2026/05/12 14:51:27 failed to initialize file storage: failed to create file backend: failed to create base directory /data: mkdir /data: permission denied

We'll also have to remove the --tokensmith-url flag as well at least for now.

I added a volume for metadata to store things. also removed --tokensmith-url until that is added back in

I think this was literally added back today with this PR:
OpenCHAMI/metadata-service#12

Unresolving for further discussion/action. I assume we'll want this.

Posted a comment to bump the linked issue. Looks like it needs to be rebased/resolved before merging.

Signed-off-by: Travis Cotton <trcotton@lanl.gov>

…container to use it Signed-off-by: Travis Cotton <trcotton@lanl.gov>

Signed-off-by: Travis Cotton <trcotton@lanl.gov>

…se it Signed-off-by: Travis Cotton <trcotton@lanl.gov>

Signed-off-by: Travis Cotton <trcotton@lanl.gov>

…ient arg Signed-off-by: Travis Cotton <trcotton@lanl.gov>

Signed-off-by: Travis Cotton <trcotton@lanl.gov>

Signed-off-by: Devon Bautista <17506592+synackd@users.noreply.github.com>

erl-hpe · 2026-06-05T20:18:33Z

Alex asked me to try testing this out. I have tried using the (PR-52) OpenCHAMI Installer that I developed and it got stuck waiting for Hydra:

Error: no container with name or ID "hydra" found: no such container
Error: no container with name or ID "hydra" found: no such container
Error: no container with name or ID "hydra" found: no such container
Error: no container with name or ID "hydra" found: no such container
Error: no container with name or ID "hydra" found: no such container
Error: no container with name or ID "hydra" found: no such container
Error: no container with name or ID "hydra" found: no such container
Error: no container with name or ID "hydra" found: no such container
Error: no container with name or ID "hydra" found: no such container
/opt/workdir/OpenCHAMI-Install.sh:235:[main]: cannot get openchami access token
exiting on error [1] from /opt/workdir/OpenCHAMI-Install.sh:235
ERROR: install script '/opt/workdir/OpenCHAMI-Install.sh' exited failed to run - Command '['su', '-', 'rocky', '/opt/workdir/OpenCHAMI-Install.sh']' returned non-zero exit status 1.

The step in question (which comes from the quadlet tutorial) is trying (10 times) to get the DEMO_ACCESS_TOKEN using:

export DEMO_ACCESS_TOKEN="$(sudo bash -lc 'gen_access_token')"

Is this no longer the correct way to obtain the access token? I notice that Hydra has been removed and ??? replaced with TokenSmith ???. Is there a new way to get the token from TokenSmith?

synackd · 2026-06-05T20:28:38Z

That's because that function hasn't been updated yet. See https://github.com/OpenCHAMI/openchami.org/pull/100/changes/BASE..08154d8ad060aae113378f3e9a6fb3f1d27515a4#diff-b2078f278f9bf7a1610c6ae9134a53c18cc543d5d13f4c4425b9123ca9184e17R1242 and related discussion.

erl-hpe · 2026-06-05T21:16:18Z

Thanks! That helps. From there I should be able to figure out what I will need to do to get the installer up to date...

erl-hpe · 2026-06-08T15:52:19Z

Just wondering, since all of the stuff needed to generate the token is known but different, why not keep the abstraction in the form of gen_access_token and simply modify it instead of having to update the tutorial on that point? It seems to me, the ideal when bringing in big changes like this is, wherever possible, to make them transparent to prior users and anything that might have been built on prior practices. I presume that was considered and rejected for some reason here. Curious why.

davidallendj · 2026-06-08T16:01:49Z

Just wondering, since all of the stuff needed to generate the token is known but different, why not keep the abstraction in the form of gen_access_token and simply modify it instead of having to update the tutorial on that point? It seems to me, the ideal when bringing in big changes like this is, wherever possible, to make them transparent to prior users and anything that might have been built on prior practices. I presume that was considered and rejected for some reason here. Curious why.

That's the plan. We just haven't made the update yet for the tutorial to work. Working changes for the tutorial documentation are here.

erl-hpe · 2026-06-08T16:04:14Z

Ok. So, the fact that this is still in progress is an reflection of future work you are planning on and the tutorial changes are, at least for now, a parallel work in progress that will be minimized as this moves toward completion, but is currently needed to allow people to work with this PR? That makes sense.

That also informs how I will work alongside of this. I will plan to incorporate the changes to the tutorial locally and temporarily into my code so I can get a sense of the deviation, but not plan to update my code with those changes until I see the final result.

davidallendj · 2026-06-08T16:07:39Z

Ok. So, the fact that this is still in progress is an reflection of future work you are planning on and the tutorial changes are, at least for now, a parallel work in progress that will be minimized as this moves toward completion, but is currently needed to allow people to work with this PR? That makes sense.

Yes, we're trying to change everything at once so every works in the tutorial like before and we can keep it as frictionless as possible. We're still figuring out some parts of it and testing the new services here though.

FYI if you need the new command for the access token to continue testing, here it is:

export DEMO_ACCESS_TOKEN=$(sudo podman exec tokensmith /bin/sh -c "/usr/local/bin/tokensmith user-token create --audience smd --key-file /tokensmith/data/keys/private.pem --subject 'admin@example.com' --scopes 'admin' --enable-local-user-mint")

erl-hpe · 2026-06-08T16:08:35Z

Yep. I picked that up from the linked tutorial PR. Thanks!

travisbcotton marked this pull request as draft April 7, 2026 20:51

travisbcotton force-pushed the trcotton/tokensmith-container branch from 9bf779d to ef8d070 Compare April 7, 2026 20:52

travisbcotton changed the title ~~added tokensmith basic config file; update env file~~ Update release with new fabrica-based services; remove old services Apr 7, 2026