Summary
We propose a new package in goresctrl that manages the lifecycle of resctrl monitoring groups (mon_groups) on a per-workload basis, independent of the config-driven pkg/rdt allocation model. The primary use case is assigning per-pod (or per-container) RMIDs so that downstream energy-monitoring tools like Kepler can attribute hardware energy counters (Intel AET / RAPL) to individual workloads.
Motivation
Today, getting per-pod energy attribution on Intel platforms requires a runtime-lifecycle hook that:
- Creates a resctrl
mon_group directory (kernel assigns an RMID)
- Writes the container's init PID into the
tasks file before user threads fork (so all children inherit the RMID)
- Removes the
mon_group on pod teardown (kernel releases the RMID)
This logic is inherently runtime-agnostic — only the trigger mechanism differs across runtimes. Today we have a working implementation inside containers/nri-plugins PR #666, but the feature's largest consumer is OpenShift (CRI-O), where the native extension mechanism is OCI createRuntime hooks — not NRI. Rather than maintain two independent copies of the same resctrl logic, we'd like to host the core in goresctrl and have multiple thin adapters import it.
Three consumer adapters (in priority order):
| # |
Adapter |
Runtime |
Trigger mechanism |
| 1 |
OCI createRuntime hook binary |
CRI-O / Podman |
hooks.d JSON, container state on stdin |
| 2 |
NRI plugin (StartContainer callback) |
containerd / k3s |
containerd/nri stub |
| 3 |
Podman standalone |
Podman (rootful) |
Same hook binary, non-k8s key path |
Why not extend pkg/rdt?
pkg/rdt already models monitoring groups (CtrlGroup.CreateMonGroup, MonGroup, AddPids, GetMonData), but its API is bound to the full Initialize() + SetConfig() lifecycle that owns the entire resctrl hierarchy — it manages ctrl_groups, schemata, and class membership as a coherent whole.
The mon_group lifecycle feature does the opposite: it creates mon_groups under pre-existing ctrl_groups it does not manage (those are created by whatever allocation plugin is running). It tracks lightweight per-key state and does idempotent orphan cleanup on restart. Forcing this into pkg/rdt's config-driven model would either require consumers to Initialize the full RDT subsystem (which they don't want to control), or would compromise pkg/rdt's coherent ownership model.
A separate, lightweight package keeps both concerns clean. Later, we can optionally add a Group.MonData() method that delegates to pkg/rdt's existing reader, bridging the two without coupling them structurally.
Proposed API sketch
package monitor // github.com/intel/goresctrl/pkg/monitor
type Options struct {
ResctrlRoot string // default "/sys/fs/resctrl"
GroupPrefix string // e.g. "nri-" or "oci-" — namespace for ownership
KeyValidator func(string) bool // e.g. PodUIDValidator for k8s; nil = permissive default
}
type Manager struct { /* internal state */ }
func New(o Options) (*Manager, error)
func SetLogger(l *slog.Logger)
func (m *Manager) EnsureGroup(key, rdtClass string) (*Group, error)
func (m *Manager) AssignPID(key string, pid int) error
func (m *Manager) Remove(key string) error
func (m *Manager) AddMember(key, memberID string)
func (m *Manager) RemoveMember(key, memberID string)
func (m *Manager) MemberCount(key string) int
func (m *Manager) Reconcile(live []string) error
func (m *Manager) List() []string
type Group struct { /* key, dir */ }
func (g *Group) Key() string
func (g *Group) Path() string
// Exported validators for adapter use
func PodUIDValidator(key string) bool
func DefaultKeyValidator(key string) bool
// Typed errors
var ErrNotTracked, ErrNoRMIDs, ErrBadKey, ErrBadClass error
Design notes:
GroupPrefix namespaces the on-disk directories so that two co-deployed mechanisms (e.g. OCI hook + NRI plugin on the same node during migration) never fight over the same directory or reap each other's groups during Reconcile.
- Generic
key, not "pod UID", in the API. Pod-UID validation is opt-in via KeyValidator, keeping the core usable for non-Kubernetes workloads (e.g. Podman containers keyed by container ID).
- Does not create ctrl_groups. If
rdtClass is specified and the corresponding directory doesn't exist, EnsureGroup returns an error. Allocation is another plugin's responsibility.
- Follows goresctrl conventions:
log/slog via SetLogger, pkg/path for testable roots, injectable mkdir/rmdir (like pkg/rdt's groupCreateFunc/groupRemoveFunc).
Testing approach
All core tests run without a real resctrl mount — they use t.TempDir() with injected filesystem operations, following the pattern established by pkg/rdt's own unit tests. A later PR may add an integration test gate for Group.MonData() that requires a real mount.
Reference CLI / hook binary
We'd like to include a minimal reference CLI under cmd/resctrl-mon-hook/ (consistent with goresctrl's existing cmd/rdt/, cmd/blockio/, etc.) that demonstrates the OCI hook adapter. Production deployment artifacts (Helm, MachineConfig, DaemonSet) would live in a consumer repository, not here.
Questions for maintainers
- Package name:
pkg/monitor vs pkg/mongroup vs pkg/podmon — any preference?
cmd/ placement: is a reference OCI hook binary under cmd/resctrl-mon-hook/ appropriate, or should all adapter binaries live externally?
- Optional
MonData() bridge to pkg/rdt: should this be in-package (creates a compile-time dependency on pkg/rdt), in a sub-package (pkg/monitor/rdt), or entirely external?
- Contribution format: single PR or split into (a) skeleton+types, (b) core ops, (c) reconcile, (d) reference CLI?
Related work
- containers/nri-plugins PR #666 — the feature this library will back
- Kepler — the downstream consumer of the mon_group counters
pkg/rdt ctrlmongroup.go — existing mon-group modeling for the allocation path
Ready to provide a more detailed implementation plan if this direction gets a green light. Happy to iterate on the API surface before writing code.
Summary
We propose a new package in goresctrl that manages the lifecycle of resctrl monitoring groups (
mon_groups) on a per-workload basis, independent of the config-drivenpkg/rdtallocation model. The primary use case is assigning per-pod (or per-container) RMIDs so that downstream energy-monitoring tools like Kepler can attribute hardware energy counters (Intel AET / RAPL) to individual workloads.Motivation
Today, getting per-pod energy attribution on Intel platforms requires a runtime-lifecycle hook that:
mon_groupdirectory (kernel assigns an RMID)tasksfile before user threads fork (so all children inherit the RMID)mon_groupon pod teardown (kernel releases the RMID)This logic is inherently runtime-agnostic — only the trigger mechanism differs across runtimes. Today we have a working implementation inside containers/nri-plugins PR #666, but the feature's largest consumer is OpenShift (CRI-O), where the native extension mechanism is OCI
createRuntimehooks — not NRI. Rather than maintain two independent copies of the same resctrl logic, we'd like to host the core in goresctrl and have multiple thin adapters import it.Three consumer adapters (in priority order):
createRuntimehook binaryhooks.dJSON, container state on stdinStartContainercallback)Why not extend
pkg/rdt?pkg/rdtalready models monitoring groups (CtrlGroup.CreateMonGroup,MonGroup,AddPids,GetMonData), but its API is bound to the fullInitialize()+SetConfig()lifecycle that owns the entire resctrl hierarchy — it manages ctrl_groups, schemata, and class membership as a coherent whole.The mon_group lifecycle feature does the opposite: it creates
mon_groupsunder pre-existing ctrl_groups it does not manage (those are created by whatever allocation plugin is running). It tracks lightweight per-key state and does idempotent orphan cleanup on restart. Forcing this intopkg/rdt's config-driven model would either require consumers to Initialize the full RDT subsystem (which they don't want to control), or would compromisepkg/rdt's coherent ownership model.A separate, lightweight package keeps both concerns clean. Later, we can optionally add a
Group.MonData()method that delegates topkg/rdt's existing reader, bridging the two without coupling them structurally.Proposed API sketch
Design notes:
GroupPrefixnamespaces the on-disk directories so that two co-deployed mechanisms (e.g. OCI hook + NRI plugin on the same node during migration) never fight over the same directory or reap each other's groups duringReconcile.key, not "pod UID", in the API. Pod-UID validation is opt-in viaKeyValidator, keeping the core usable for non-Kubernetes workloads (e.g. Podman containers keyed by container ID).rdtClassis specified and the corresponding directory doesn't exist,EnsureGroupreturns an error. Allocation is another plugin's responsibility.log/slogviaSetLogger,pkg/pathfor testable roots, injectablemkdir/rmdir(likepkg/rdt'sgroupCreateFunc/groupRemoveFunc).Testing approach
All core tests run without a real resctrl mount — they use
t.TempDir()with injected filesystem operations, following the pattern established bypkg/rdt's own unit tests. A later PR may add an integration test gate forGroup.MonData()that requires a real mount.Reference CLI / hook binary
We'd like to include a minimal reference CLI under
cmd/resctrl-mon-hook/(consistent with goresctrl's existingcmd/rdt/,cmd/blockio/, etc.) that demonstrates the OCI hook adapter. Production deployment artifacts (Helm, MachineConfig, DaemonSet) would live in a consumer repository, not here.Questions for maintainers
pkg/monitorvspkg/mongroupvspkg/podmon— any preference?cmd/placement: is a reference OCI hook binary undercmd/resctrl-mon-hook/appropriate, or should all adapter binaries live externally?MonData()bridge topkg/rdt: should this be in-package (creates a compile-time dependency onpkg/rdt), in a sub-package (pkg/monitor/rdt), or entirely external?Related work
pkg/rdtctrlmongroup.go— existing mon-group modeling for the allocation pathReady to provide a more detailed implementation plan if this direction gets a green light. Happy to iterate on the API surface before writing code.