[26.04_linux-nvidia] linux-nvidia-7.0: SMT-aware asymmetric CPU capacity idle selection by arighi · Pull Request #405 · NVIDIA/NV-Kernels

arighi · 2026-05-05T08:24:41Z

On Vera Rubin, the firmware exposes CPUs with different capacities through ACPI/CPPC. Unlike Grace systems, Vera Rubin also supports SMT. As a result, the Linux scheduler enables the asymmetric CPU capacity idle selection policy, but the current implementation is not SMT-aware. This can lead to suboptimal task placement, where tasks are scheduled on both SMT siblings of the same core even when fully idle SMT cores are available elsewhere in the system.

In CPU-intensive workloads, this behavior can significantly reduce performance, with slowdowns of up to 2x observed in certain CPU-intensive workloads.

This series is a backport of the upstream patch series available at:
https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com

NOTE: the original series includes additional patches that are not needed in linux-nvidia-7.0:

PATCH 1/6 is a refactoring that is valid only in kernel >= 7.0, because it requires 71fedc41c23b ("sched/fair: Switch to rcu_dereference_all()") and it's not worth backporting it,
PATCH 6/6 is incorrect and will be dropped (so it's not backported)

The series is currently under review on the mailing list, but consensus has been reached with the scheduler maintainers and the changes are expected to be merged for v7.2.

Given the potential impact on Vera Rubin performance, it seems reasonable to backport and apply these patches to the linux-nvidia kernel and carry them as NVIDIA SAUCE for now, until the upstream solution becomes available.

Patch series has been validated both on Vera and Grace running DCPerf Mediawiki and benchblas (NVBLAS).

NOTE: the same series has been applied to the linux-nvidia-6.17 kernel (see also #395)

LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-bos/+bug/2150671

github-actions · 2026-05-05T08:35:08Z

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ❌ Errors found

Details

Checking 4 commits...

Cherry-pick digest:
E: 55ee0d2c62b3 ("NVIDIA: VR: SAUCE: sched/fair: Attach sc"): diff MISMATCH with lore patch (add [Author: reason] annotation if intentional)
┌──────────────┬───────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject           │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 9011a2603e7a │ [SAUCE] sched/fair: add sis_util support to s │ N/A        │ N/A     │ arighi, nayak, arighi     │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 55e068ef94cd │ [SAUCE] sched/fair: reject misfit pulls onto  │ N/A        │ N/A     │ arighi, arighi            │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 42459d49ffef │ [SAUCE] sched/fair: prefer fully-idle smt cor │ N/A        │ N/A     │ arighi, arighi            │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 55ee0d2c62b3 │ sched/fair: attach sched_domain_shared to sd_ │ MISMATCH   │ found   │ ok, backporter: arighi    │
└──────────────┴───────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

nvmochs · 2026-05-05T14:36:07Z

@arighi Do you suspect these will be accepted by the maintainer soon? If so, maybe we should wait so we can pick from -next? FYI we're still a couple weeks out from the first 7.0 build.

59a400f NVIDIA: SAUCE: sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity

Also, for this patch, would it make sense to just pick the dependent patches so this patch picks clean?

jamieNguyenNVIDIA

Acked-by: Jamie Nguyen jamien@nvidia.com

arighi · 2026-05-05T20:06:45Z

@arighi Do you suspect these will be accepted by the maintainer soon? If so, maybe we should wait so we can pick from -next? FYI we're still a couple weeks out from the first 7.0 build.

We may still need to refine a few small things before merging upstream, but the core patch is unlikely to change at this point. I'm still aiming for inclusion in v7.1, though it's difficult to be certain, scheduler changes can be unpredictable, and last-minute feedback can easily push things out.

59a400f NVIDIA: SAUCE: sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity

Also, for this patch, would it make sense to just pick the dependent patches so this patch picks clean?

I actually started pulling in all the dependent patches, but it turns out this one needs around 6 additional patches to apply cleanly, which raises some concerns about potential regressions. IMHO at this point it feels safer to make a few targeted adjustments to the original patch, rather than bringing in 6 more patches. And it'd make the code more maintainable as well (less risk of conflicts with stable updates). WDYT?

nvmochs · 2026-05-05T22:45:59Z

@arighi Do you suspect these will be accepted by the maintainer soon? If so, maybe we should wait so we can pick from -next? FYI we're still a couple weeks out from the first 7.0 build.

We may still need to refine a few small things before merging upstream, but the core patch is unlikely to change at this point. I'm still aiming for inclusion in v7.1, though it's difficult to be certain, scheduler changes can be unpredictable, and last-minute feedback can easily push things out.

I see. I guess we can always re-pick the upstream version at a later point in time if we feel it's worth it.

59a400f NVIDIA: SAUCE: sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
Also, for this patch, would it make sense to just pick the dependent patches so this patch picks clean?

I actually started pulling in all the dependent patches, but it turns out this one needs around 6 additional patches to apply cleanly, which raises some concerns about potential regressions. IMHO at this point it feels safer to make a few targeted adjustments to the original patch, rather than bringing in 6 more patches. And it'd make the code more maintainable as well (less risk of conflicts with stable updates). WDYT?

My main concern stems from this being the LTS branch, so these SAUCE patches will be around for a long time. So we would definitely want to approach this in whatever manner will optimally support maintainability.

That said, is there any benefit of having this PR for non-Vera systems? At the moment we're taking the approach of not adding VR-specific support in the LTS kernel. I have a Jira open to add VR support later once all pieces are upstream if it makes sense from a product perspective.

arighi · 2026-05-07T09:06:56Z

@arighi Do you suspect these will be accepted by the maintainer soon? If so, maybe we should wait so we can pick from -next? FYI we're still a couple weeks out from the first 7.0 build.

We may still need to refine a few small things before merging upstream, but the core patch is unlikely to change at this point. I'm still aiming for inclusion in v7.1, though it's difficult to be certain, scheduler changes can be unpredictable, and last-minute feedback can easily push things out.

I see. I guess we can always re-pick the upstream version at a later point in time if we feel it's worth it.

Yes, the idea is to have something that solves the problem for VR, minimizing the changes to the v7.0 sched core. If the upstream series will diverge too much (unlikely) and it's worth backporting the changes, we can still revert these patches and re-apply the new ones (but honestly I don't think it'll be necessary).

59a400f NVIDIA: SAUCE: sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
Also, for this patch, would it make sense to just pick the dependent patches so this patch picks clean?

I actually started pulling in all the dependent patches, but it turns out this one needs around 6 additional patches to apply cleanly, which raises some concerns about potential regressions. IMHO at this point it feels safer to make a few targeted adjustments to the original patch, rather than bringing in 6 more patches. And it'd make the code more maintainable as well (less risk of conflicts with stable updates). WDYT?

My main concern stems from this being the LTS branch, so these SAUCE patches will be around for a long time. So we would definitely want to approach this in whatever manner will optimally support maintainability.

Right, and I think backporting more patches (that will never be backported to v7.0 upstream) will just increase the risk of conflicts when we apply stable updates. This series is essentially the bare minimum set of changes to introduce SMT-awareness to asym-cpu touching the bare minimum amount of code.

That said, is there any benefit of having this PR for non-Vera systems? At the moment we're taking the approach of not adding VR-specific support in the LTS kernel. I have a Jira open to add VR support later once all pieces are upstream if it makes sense from a product perspective.

As I mentioned in #406, the last patch (the SIS_UTIL one) is also affecting Grace. The first 3 instead are only affecting Vera.

But this series is mostly focused on VR, so if the plan is to add VR support later, it might make sense to include this as part of that later patches as well.

BugLink: https://bugs.launchpad.net/bugs/2139460 commit 7458f72 upstream. If of_genpd_add_provider_onecell() fails during probe, the previously created generic power domains are not removed, leading to a memory leak and potential kernel crash later in genpd_debug_add(). Add proper error handling to unwind the initialized domains before returning from probe to ensure all resources are correctly released on failure. Example crash trace observed without this fix: | Unable to handle kernel paging request at virtual address fffffffffffffc70 | CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-rc1 NVIDIA#405 PREEMPT | Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform | pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : genpd_debug_add+0x2c/0x160 | lr : genpd_debug_init+0x74/0x98 | Call trace: | genpd_debug_add+0x2c/0x160 (P) | genpd_debug_init+0x74/0x98 | do_one_initcall+0xd0/0x2d8 | do_initcall_level+0xa0/0x140 | do_initcalls+0x60/0xa8 | do_basic_setup+0x28/0x40 | kernel_init_freeable+0xe8/0x170 | kernel_init+0x2c/0x140 | ret_from_fork+0x10/0x20 Fixes: 898216c ("firmware: arm_scmi: add device power domain support using genpd") Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Reviewed-by: Peng Fan <peng.fan@nxp.com> Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CVE-2025-68204 Signed-off-by: Manuel Diewald <manuel.diewald@canonical.com> Signed-off-by: Edoardo Canepa <edoardo.canepa@canonical.com>

nvmochs · 2026-05-08T00:03:52Z

@arighi Do you suspect these will be accepted by the maintainer soon? If so, maybe we should wait so we can pick from -next? FYI we're still a couple weeks out from the first 7.0 build.

We may still need to refine a few small things before merging upstream, but the core patch is unlikely to change at this point. I'm still aiming for inclusion in v7.1, though it's difficult to be certain, scheduler changes can be unpredictable, and last-minute feedback can easily push things out.

I see. I guess we can always re-pick the upstream version at a later point in time if we feel it's worth it.

Yes, the idea is to have something that solves the problem for VR, minimizing the changes to the v7.0 sched core. If the upstream series will diverge too much (unlikely) and it's worth backporting the changes, we can still revert these patches and re-apply the new ones (but honestly I don't think it'll be necessary).

59a400f NVIDIA: SAUCE: sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
Also, for this patch, would it make sense to just pick the dependent patches so this patch picks clean?

I actually started pulling in all the dependent patches, but it turns out this one needs around 6 additional patches to apply cleanly, which raises some concerns about potential regressions. IMHO at this point it feels safer to make a few targeted adjustments to the original patch, rather than bringing in 6 more patches. And it'd make the code more maintainable as well (less risk of conflicts with stable updates). WDYT?

My main concern stems from this being the LTS branch, so these SAUCE patches will be around for a long time. So we would definitely want to approach this in whatever manner will optimally support maintainability.

Right, and I think backporting more patches (that will never be backported to v7.0 upstream) will just increase the risk of conflicts when we apply stable updates. This series is essentially the bare minimum set of changes to introduce SMT-awareness to asym-cpu touching the bare minimum amount of code.

That said, is there any benefit of having this PR for non-Vera systems? At the moment we're taking the approach of not adding VR-specific support in the LTS kernel. I have a Jira open to add VR support later once all pieces are upstream if it makes sense from a product perspective.

As I mentioned in #406, the last patch (the SIS_UTIL one) is also affecting Grace. The first 3 instead are only affecting Vera.

But this series is mostly focused on VR, so if the plan is to add VR support later, it might make sense to include this as part of that later patches as well.

@arighi Thanks for the clarifications.

We discussed this during our Grace/Vera sync meeting earlier today and decided it makes sense to move forward with this PR in the 7.0-LTS kernel. Once the commit tags are updated we will ack and merge.

…cpucapacity BugLink: https://bugs.launchpad.net/bugs/2150671 On asymmetric CPU capacity systems, the wakeup path uses select_idle_capacity(), which scans the span of sd_asym_cpucapacity rather than sd_llc. The has_idle_cores hint however lives on sd_llc->shared, so the wakeup-time read of has_idle_cores operates on an LLC-scoped blob while the actual scan/decision spans the wider asym domain; nr_busy_cpus also lives in the same shared sched_domain data, but it's never used in the asym CPU capacity scenario. Therefore, move the sched_domain_shared object to sd_asym_cpucapacity whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case the scope of has_idle_cores matches the scope of the wakeup scan. Fall back to attaching the shared object to sd_llc in three cases: 1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere); 2) CPUs in an exclusive cpuset that carves out a symmetric capacity island: has_asym is system-wide but those CPUs have no SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow the symmetric LLC path in select_idle_sibling(); 3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an SD_NUMA-built domain. init_sched_domain_shared() keys the shared blob off cpumask_first(span), which on overlapping NUMA domains would alias unrelated spans onto the same blob. Keep the shared object on the LLC there; select_idle_capacity() gracefully skips the has_idle_cores preference when sd->shared is NULL. While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared, as it is no longer strictly tied to the LLC. Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> (backported from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com) [ arighi: - backport full logic to attach sd->shared in build_sched_domains() - do not rename sd_llc_shared to reduce the risk of conflicts ] Signed-off-by: Andrea Righi <arighi@nvidia.com>

…pacity idle selection BugLink: https://bugs.launchpad.net/bugs/2150671 On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting different per-core frequencies), the wakeup path uses select_idle_capacity() and prioritizes idle CPUs with higher capacity for better task placement. However, when those CPUs belong to SMT cores, their effective capacity can be much lower than the nominal capacity when the sibling thread is busy: SMT siblings compete for shared resources, so a "high capacity" CPU that is idle but whose sibling is busy does not deliver its full capacity. This effective capacity reduction cannot be modeled by the static capacity value alone. Introduce SMT awareness in the asym-capacity idle selection policy: when SMT is active, always prefer fully-idle SMT cores over partially-idle ones. Prioritizing fully-idle SMT cores yields better task placement because the effective capacity of partially-idle SMT cores is reduced; always preferring them when available leads to more accurate capacity usage on task wakeup. On an SMT system with asymmetric CPU capacities, SMT-aware idle selection has been shown to improve throughput by around 15-18% for CPU-bound workloads, running an amount of tasks equal to the amount of SMT cores. Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Christian Loehle <christian.loehle@arm.com> Cc: Koba Ko <kobak@nvidia.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> (cherry picked from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com) Signed-off-by: Andrea Righi <arighi@nvidia.com>

…ings on asym-capacity BugLink: https://bugs.launchpad.net/bugs/2150671 When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task, capacity_of(dst_cpu) can overstate available compute if the SMT sibling is busy: the core does not deliver its full nominal capacity. If SMT is active and dst_cpu is not on a fully idle core, skip this destination so we do not migrate a misfit expecting a capacity upgrade we cannot actually provide. Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Christian Loehle <christian.loehle@arm.com> Cc: Koba Ko <kobak@nvidia.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> (cherry picked from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com) Signed-off-by: Andrea Righi <arighi@nvidia.com>

…pacity() BugLink: https://bugs.launchpad.net/bugs/2150671 Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL) is enabled and the LLC domain has sched_domain_shared data, derive the per-attempt scan limit from sd->shared->nr_idle_scan. That bounds the walk on large LLCs and allows an early return once the scan limit is reached, if we already picked a sufficiently strong idle-core candidate (best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT). Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> (cherry picked from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com) Signed-off-by: Andrea Righi <arighi@nvidia.com>

clsotog

Acked-by: Carol L Soto <csoto@nvidia.com>

nvmochs · 2026-05-08T15:20:32Z

Thanks Andrea!

No further issues from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

nvmochs · 2026-05-08T16:01:40Z

Merged, closing PR.

fa30595374b2 (nresolute/main-next) NVIDIA: VR: SAUCE: sched/fair: Add SIS_UTIL support to select_idle_capacity()
ba0cb6d3c0fa NVIDIA: VR: SAUCE: sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
27869379e41f NVIDIA: VR: SAUCE: sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
6f0e658c9d51 NVIDIA: VR: SAUCE: sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity

arighi requested review from clsotog, ianm-nv, jamieNguyenNVIDIA, nvidia-bfigg and nvmochs May 5, 2026 08:24

arighi force-pushed the linux-nvidia-7.0 branch from f6d6def to 6f6d76d Compare May 5, 2026 08:40

jamieNguyenNVIDIA approved these changes May 5, 2026

View reviewed changes

jamieNguyenNVIDIA mentioned this pull request May 6, 2026

[26.04_linux-nvidia-bos] linux-nvidia-7.0-bos: SMT-aware asymmetric CPU capacity idle selection #406

Closed

arighi and others added 4 commits May 8, 2026 16:43

arighi force-pushed the linux-nvidia-7.0 branch from 6f6d76d to 9011a26 Compare May 8, 2026 14:44

clsotog approved these changes May 8, 2026

View reviewed changes

nvmochs closed this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[26.04_linux-nvidia] linux-nvidia-7.0: SMT-aware asymmetric CPU capacity idle selection#405

[26.04_linux-nvidia] linux-nvidia-7.0: SMT-aware asymmetric CPU capacity idle selection#405
arighi wants to merge 4 commits into
NVIDIA:26.04_linux-nvidiafrom
arighi:linux-nvidia-7.0

arighi commented May 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

nvmochs commented May 5, 2026

Uh oh!

jamieNguyenNVIDIA left a comment

Uh oh!

arighi commented May 5, 2026

Uh oh!

nvmochs commented May 5, 2026

Uh oh!

arighi commented May 7, 2026

Uh oh!

nvmochs commented May 8, 2026

Uh oh!

clsotog left a comment

Uh oh!

nvmochs commented May 8, 2026

Uh oh!

nvmochs commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

arighi commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Validation Report

Patchscan ✅ No Missing Fixes

PR Lint ❌ Errors found

Uh oh!

nvmochs commented May 5, 2026

Uh oh!

jamieNguyenNVIDIA left a comment

Choose a reason for hiding this comment

Uh oh!

arighi commented May 5, 2026

Uh oh!

nvmochs commented May 5, 2026

Uh oh!

arighi commented May 7, 2026

Uh oh!

nvmochs commented May 8, 2026

Uh oh!

clsotog left a comment

Choose a reason for hiding this comment

Uh oh!

nvmochs commented May 8, 2026

Uh oh!

nvmochs commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arighi commented May 5, 2026 •

edited

Loading

github-actions Bot commented May 5, 2026 •

edited

Loading