Skip to content

Proposal: add pkg/monitor — lightweight per-workload resctrl mon_group lifecycle #190

@cmcantalupo

Description

@cmcantalupo

Summary

We propose a new package in goresctrl that manages the lifecycle of resctrl monitoring groups (mon_groups) on a per-workload basis, independent of the config-driven pkg/rdt allocation model. The primary use case is assigning per-pod (or per-container) RMIDs so that downstream energy-monitoring tools like Kepler can attribute hardware energy counters (Intel AET / RAPL) to individual workloads.

Motivation

Today, getting per-pod energy attribution on Intel platforms requires a runtime-lifecycle hook that:

  1. Creates a resctrl mon_group directory (kernel assigns an RMID)
  2. Writes the container's init PID into the tasks file before user threads fork (so all children inherit the RMID)
  3. Removes the mon_group on pod teardown (kernel releases the RMID)

This logic is inherently runtime-agnostic — only the trigger mechanism differs across runtimes. Today we have a working implementation inside containers/nri-plugins PR #666, but the feature's largest consumer is OpenShift (CRI-O), where the native extension mechanism is OCI createRuntime hooks — not NRI. Rather than maintain two independent copies of the same resctrl logic, we'd like to host the core in goresctrl and have multiple thin adapters import it.

Three consumer adapters (in priority order):

# Adapter Runtime Trigger mechanism
1 OCI createRuntime hook binary CRI-O / Podman hooks.d JSON, container state on stdin
2 NRI plugin (StartContainer callback) containerd / k3s containerd/nri stub
3 Podman standalone Podman (rootful) Same hook binary, non-k8s key path

Why not extend pkg/rdt?

pkg/rdt already models monitoring groups (CtrlGroup.CreateMonGroup, MonGroup, AddPids, GetMonData), but its API is bound to the full Initialize() + SetConfig() lifecycle that owns the entire resctrl hierarchy — it manages ctrl_groups, schemata, and class membership as a coherent whole.

The mon_group lifecycle feature does the opposite: it creates mon_groups under pre-existing ctrl_groups it does not manage (those are created by whatever allocation plugin is running). It tracks lightweight per-key state and does idempotent orphan cleanup on restart. Forcing this into pkg/rdt's config-driven model would either require consumers to Initialize the full RDT subsystem (which they don't want to control), or would compromise pkg/rdt's coherent ownership model.

A separate, lightweight package keeps both concerns clean. Later, we can optionally add a Group.MonData() method that delegates to pkg/rdt's existing reader, bridging the two without coupling them structurally.

Proposed API sketch

package monitor // github.com/intel/goresctrl/pkg/monitor

type Options struct {
    ResctrlRoot  string            // default "/sys/fs/resctrl"
    GroupPrefix  string            // e.g. "nri-" or "oci-" — namespace for ownership
    KeyValidator func(string) bool // e.g. PodUIDValidator for k8s; nil = permissive default
}

type Manager struct { /* internal state */ }

func New(o Options) (*Manager, error)
func SetLogger(l *slog.Logger)

func (m *Manager) EnsureGroup(key, rdtClass string) (*Group, error)
func (m *Manager) AssignPID(key string, pid int) error
func (m *Manager) Remove(key string) error
func (m *Manager) AddMember(key, memberID string)
func (m *Manager) RemoveMember(key, memberID string)
func (m *Manager) MemberCount(key string) int
func (m *Manager) Reconcile(live []string) error
func (m *Manager) List() []string

type Group struct { /* key, dir */ }
func (g *Group) Key() string
func (g *Group) Path() string

// Exported validators for adapter use
func PodUIDValidator(key string) bool
func DefaultKeyValidator(key string) bool

// Typed errors
var ErrNotTracked, ErrNoRMIDs, ErrBadKey, ErrBadClass error

Design notes:

  • GroupPrefix namespaces the on-disk directories so that two co-deployed mechanisms (e.g. OCI hook + NRI plugin on the same node during migration) never fight over the same directory or reap each other's groups during Reconcile.
  • Generic key, not "pod UID", in the API. Pod-UID validation is opt-in via KeyValidator, keeping the core usable for non-Kubernetes workloads (e.g. Podman containers keyed by container ID).
  • Does not create ctrl_groups. If rdtClass is specified and the corresponding directory doesn't exist, EnsureGroup returns an error. Allocation is another plugin's responsibility.
  • Follows goresctrl conventions: log/slog via SetLogger, pkg/path for testable roots, injectable mkdir/rmdir (like pkg/rdt's groupCreateFunc/groupRemoveFunc).

Testing approach

All core tests run without a real resctrl mount — they use t.TempDir() with injected filesystem operations, following the pattern established by pkg/rdt's own unit tests. A later PR may add an integration test gate for Group.MonData() that requires a real mount.

Reference CLI / hook binary

We'd like to include a minimal reference CLI under cmd/resctrl-mon-hook/ (consistent with goresctrl's existing cmd/rdt/, cmd/blockio/, etc.) that demonstrates the OCI hook adapter. Production deployment artifacts (Helm, MachineConfig, DaemonSet) would live in a consumer repository, not here.

Questions for maintainers

  1. Package name: pkg/monitor vs pkg/mongroup vs pkg/podmon — any preference?
  2. cmd/ placement: is a reference OCI hook binary under cmd/resctrl-mon-hook/ appropriate, or should all adapter binaries live externally?
  3. Optional MonData() bridge to pkg/rdt: should this be in-package (creates a compile-time dependency on pkg/rdt), in a sub-package (pkg/monitor/rdt), or entirely external?
  4. Contribution format: single PR or split into (a) skeleton+types, (b) core ops, (c) reconcile, (d) reference CLI?

Related work

  • containers/nri-plugins PR #666 — the feature this library will back
  • Kepler — the downstream consumer of the mon_group counters
  • pkg/rdt ctrlmongroup.go — existing mon-group modeling for the allocation path

Ready to provide a more detailed implementation plan if this direction gets a green light. Happy to iterate on the API surface before writing code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions