补充导出器健康状态与降级信号#16
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new /health HTTP endpoint to the exporter, tracking collector readiness, collection counts, execution times, and errors. The reviewer provided valuable feedback on improving the health check robustness. Specifically, they suggested tracking whether required log monitors are successfully started and marking the health status as degraded if they fail. Additionally, they recommended returning a 200 status code instead of 503 for a degraded state to prevent Kubernetes from unnecessarily restarting the exporter container during transient collection failures. They also provided suggestions for corresponding unit tests.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics()): | ||
| self.kernel_log_monitor.start(mount_point) | ||
| self.kernel_log_monitor_started = self.kernel_log_monitor.start(mount_point) | ||
|
|
||
| if any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics()): | ||
| self.sys_log_monitor.start(mount_point) | ||
| self.sys_log_monitor_started = self.sys_log_monitor.start(mount_point) |
There was a problem hiding this comment.
当配置了需要收集日志相关的指标(如 driver_log_errors 或 sdk_eid_errors),但对应的日志监控器(kernel_log_monitor 或 sys_log_monitor)启动失败(例如日志文件不存在或未挂载)时,目前健康检查接口 /health 依然会返回 200 OK 且状态为 "ok"。这使得配置异常或环境问题被隐瞒。
建议记录这些监控器是否被启用(required),以便后续在 get_health_status() 中校验其是否正常启动。
| if any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics()): | |
| self.kernel_log_monitor.start(mount_point) | |
| self.kernel_log_monitor_started = self.kernel_log_monitor.start(mount_point) | |
| if any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics()): | |
| self.sys_log_monitor.start(mount_point) | |
| self.sys_log_monitor_started = self.sys_log_monitor.start(mount_point) | |
| self.kernel_log_monitor_required = any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics()) | |
| if self.kernel_log_monitor_required: | |
| self.kernel_log_monitor_started = self.kernel_log_monitor.start(mount_point) | |
| self.sys_log_monitor_required = any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics()) | |
| if self.sys_log_monitor_required: | |
| self.sys_log_monitor_started = self.sys_log_monitor.start(mount_point) |
| def get_health_status(self): | ||
| return { | ||
| "ready": self.ready, |
There was a problem hiding this comment.
在 get_health_status() 中,结合之前记录的 required 状态,校验必需的日志监控器是否成功启动。如果必需的监控器启动失败,则将 monitors_healthy 标记为 False。
| def get_health_status(self): | |
| return { | |
| "ready": self.ready, | |
| def get_health_status(self): | |
| monitors_healthy = True | |
| if getattr(self, "kernel_log_monitor_required", False) and not self.kernel_log_monitor_started: | |
| monitors_healthy = False | |
| if getattr(self, "sys_log_monitor_required", False) and not self.sys_log_monitor_started: | |
| monitors_healthy = False | |
| return { | |
| "ready": self.ready, | |
| "monitors_healthy": monitors_healthy, |
| elif details.get("last_collect_error"): | ||
| status = "degraded" |
There was a problem hiding this comment.
| def _handle_health(self): | ||
| collector = EXPORTER_CONTEXT.get("collector") | ||
| payload = build_health_payload(collector) | ||
| status_code = 200 if payload["status"] == "ok" else 503 |
There was a problem hiding this comment.
当 payload["status"] 为 "degraded"(例如单次采集失败 last_collect_error 有值)时,接口返回 503 状态码。
如果用户将 /health 接口用作 Kubernetes 的 Liveness Probe(存活探针),返回 503 会导致 Kubernetes 频繁重启 Exporter 容器。然而,采集失败通常是由于 GPU 驱动、硬件或后端服务异常引起的,重启 Exporter 容器并不能解决这些底层问题,反而可能导致重启循环(Restart Loop)和监控数据彻底丢失。
建议:
对于 "degraded" 状态,建议返回 200 状态码,仅在 "starting"(未初始化完成)时返回 503。或者提供更细粒度的区分,避免因非 Exporter 自身进程崩溃的问题导致容器被误杀。
| status_code = 200 if payload["status"] == "ok" else 503 | |
| status_code = 200 if payload["status"] in ("ok", "degraded") else 503 |
| def test_returns_degraded_when_last_collect_failed(self): | ||
| collector = FakeCollector({"ready": True, "last_collect_error": "boom"}) | ||
| payload = build_health_payload(collector) | ||
| self.assertEqual(payload["status"], "degraded") |
There was a problem hiding this comment.
补充针对 monitors_healthy 为 False 时返回 "degraded" 状态的单元测试,确保逻辑的正确性。
| def test_returns_degraded_when_last_collect_failed(self): | |
| collector = FakeCollector({"ready": True, "last_collect_error": "boom"}) | |
| payload = build_health_payload(collector) | |
| self.assertEqual(payload["status"], "degraded") | |
| def test_returns_degraded_when_last_collect_failed(self): | |
| collector = FakeCollector({"ready": True, "last_collect_error": "boom"}) | |
| payload = build_health_payload(collector) | |
| self.assertEqual(payload["status"], "degraded") | |
| def test_returns_degraded_when_monitors_unhealthy(self): | |
| collector = FakeCollector({"ready": True, "last_collect_error": "", "monitors_healthy": False}) | |
| payload = build_health_payload(collector) | |
| self.assertEqual(payload["status"], "degraded") |
?? API ?? 2 ??????
这个改动让健康检查接口能够更准确地表达导出器当前状态,在监控器非必需、采集错误和监控器异常之间给出更细粒度的健康结果,便于上层系统区分真正不可用和可继续服务的降级状态。
对应测试已经补齐并通过,重点验证了降级状态的返回逻辑和健康载荷中的监控器状态表达。