-
Notifications
You must be signed in to change notification settings - Fork 52
Pull requests: NVIDIA/nvidia-resiliency-ext
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
[CC] Attribution inline processing stop repeated failures
#341
opened May 21, 2026 by
helisha91
Contributor
Loading…
[codex] Apply managed attribution restart decisions
#334
opened May 13, 2026 by
namitdhameja
Contributor
•
Draft
Add dmesg health logging and attribution
#328
opened May 11, 2026 by
namitdhameja
Contributor
Loading…
fix(attribution): refresh FR surface and references
#317
opened Apr 30, 2026 by
sbak5
Contributor
Loading…
Refresh async checkpoint IPC cache on pointer change
ci-approved
Approved to run CI
#314
opened Apr 28, 2026 by
sbak5
Contributor
Loading…
feat: ft_launcher integration of log analysis for restart decisions
#293
opened Apr 7, 2026 by
namitdhameja
Contributor
Loading…
ft_launcher: integrate log analysis attribution for restart decisions
ci-approved
Approved to run CI
#269
opened Feb 27, 2026 by
namitdhameja
Contributor
Loading…
Haim updates new version of logsage and nvdataflow
#243
opened Jan 11, 2026 by
helisha91
Contributor
Loading…
Infra HC service over UDS
ci-approved
Approved to run CI
#227
opened Dec 6, 2025 by
namitdhameja
Contributor
Loading…
Add cycle tracking and REST API for failure attribution
#217
opened Nov 4, 2025 by
hexinw-nvidia
Contributor
•
Draft
feat: add non-retryable exception pattern matching
#212
opened Oct 28, 2025 by
hexinw-nvidia
Contributor
•
Draft
Auto restart
ci-approved
Approved to run CI
#139
opened Aug 6, 2025 by
hexinw-nvidia
Contributor
•
Draft
Add example for multimodal models
ci-approved
Approved to run CI
#131
opened Jul 25, 2025 by
Ava-A4098
Loading…
Added in-process wrapper restart latency
#118
opened Jul 13, 2025 by
namitdhameja
Contributor
Loading…
Test UT.
ci-approved
Approved to run CI
#79
opened May 17, 2025 by
hexinw-nvidia
Contributor
•
Draft
ProTip!
Find all pull requests that aren't related to any open issues with -linked:issue.