Skip to content

补充导出器健康状态与降级信号#16

Open
ghangz wants to merge 4 commits into
MetaX-MACA:mainfrom
ghangz:mengz/health-report-status
Open

补充导出器健康状态与降级信号#16
ghangz wants to merge 4 commits into
MetaX-MACA:mainfrom
ghangz:mengz/health-report-status

Conversation

@ghangz

@ghangz ghangz commented Jun 25, 2026

Copy link
Copy Markdown

这个改动让健康检查接口能够更准确地表达导出器当前状态,在监控器非必需、采集错误和监控器异常之间给出更细粒度的健康结果,便于上层系统区分真正不可用和可继续服务的降级状态。

对应测试已经补齐并通过,重点验证了降级状态的返回逻辑和健康载荷中的监控器状态表达。

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new /health HTTP endpoint to the exporter, tracking collector readiness, collection counts, execution times, and errors. The reviewer provided valuable feedback on improving the health check robustness. Specifically, they suggested tracking whether required log monitors are successfully started and marking the health status as degraded if they fail. Additionally, they recommended returning a 200 status code instead of 503 for a degraded state to prevent Kubernetes from unnecessarily restarting the exporter container during transient collection failures. They also provided suggestions for corresponding unit tests.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mx_exporter/mx_exporter.py Outdated
Comment on lines +67 to +71
if any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics()):
self.kernel_log_monitor.start(mount_point)
self.kernel_log_monitor_started = self.kernel_log_monitor.start(mount_point)

if any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics()):
self.sys_log_monitor.start(mount_point)
self.sys_log_monitor_started = self.sys_log_monitor.start(mount_point)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

当配置了需要收集日志相关的指标(如 driver_log_errorssdk_eid_errors),但对应的日志监控器(kernel_log_monitorsys_log_monitor)启动失败(例如日志文件不存在或未挂载)时,目前健康检查接口 /health 依然会返回 200 OK 且状态为 "ok"。这使得配置异常或环境问题被隐瞒。

建议记录这些监控器是否被启用(required),以便后续在 get_health_status() 中校验其是否正常启动。

Suggested change
if any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics()):
self.kernel_log_monitor.start(mount_point)
self.kernel_log_monitor_started = self.kernel_log_monitor.start(mount_point)
if any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics()):
self.sys_log_monitor.start(mount_point)
self.sys_log_monitor_started = self.sys_log_monitor.start(mount_point)
self.kernel_log_monitor_required = any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics())
if self.kernel_log_monitor_required:
self.kernel_log_monitor_started = self.kernel_log_monitor.start(mount_point)
self.sys_log_monitor_required = any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics())
if self.sys_log_monitor_required:
self.sys_log_monitor_started = self.sys_log_monitor.start(mount_point)

Comment on lines +485 to +487
def get_health_status(self):
return {
"ready": self.ready,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

get_health_status() 中,结合之前记录的 required 状态,校验必需的日志监控器是否成功启动。如果必需的监控器启动失败,则将 monitors_healthy 标记为 False

Suggested change
def get_health_status(self):
return {
"ready": self.ready,
def get_health_status(self):
monitors_healthy = True
if getattr(self, "kernel_log_monitor_required", False) and not self.kernel_log_monitor_started:
monitors_healthy = False
if getattr(self, "sys_log_monitor_required", False) and not self.sys_log_monitor_started:
monitors_healthy = False
return {
"ready": self.ready,
"monitors_healthy": monitors_healthy,

Comment thread mx_exporter/__init__.py Outdated
Comment on lines +134 to +135
elif details.get("last_collect_error"):
status = "degraded"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

build_health_payload() 中,如果 monitors_healthyFalse,则将健康状态标记为 "degraded"(降级),从而能够通过健康检查接口暴露监控器启动失败的问题。

Suggested change
elif details.get("last_collect_error"):
status = "degraded"
elif details.get("last_collect_error") or not details.get("monitors_healthy", True):
status = "degraded"

Comment thread mx_exporter/__init__.py Outdated
def _handle_health(self):
collector = EXPORTER_CONTEXT.get("collector")
payload = build_health_payload(collector)
status_code = 200 if payload["status"] == "ok" else 503

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

payload["status"]"degraded"(例如单次采集失败 last_collect_error 有值)时,接口返回 503 状态码。
如果用户将 /health 接口用作 Kubernetes 的 Liveness Probe(存活探针),返回 503 会导致 Kubernetes 频繁重启 Exporter 容器。然而,采集失败通常是由于 GPU 驱动、硬件或后端服务异常引起的,重启 Exporter 容器并不能解决这些底层问题,反而可能导致重启循环(Restart Loop)和监控数据彻底丢失。

建议:
对于 "degraded" 状态,建议返回 200 状态码,仅在 "starting"(未初始化完成)时返回 503。或者提供更细粒度的区分,避免因非 Exporter 自身进程崩溃的问题导致容器被误杀。

Suggested change
status_code = 200 if payload["status"] == "ok" else 503
status_code = 200 if payload["status"] in ("ok", "degraded") else 503

Comment on lines +25 to +28
def test_returns_degraded_when_last_collect_failed(self):
collector = FakeCollector({"ready": True, "last_collect_error": "boom"})
payload = build_health_payload(collector)
self.assertEqual(payload["status"], "degraded")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

补充针对 monitors_healthyFalse 时返回 "degraded" 状态的单元测试,确保逻辑的正确性。

Suggested change
def test_returns_degraded_when_last_collect_failed(self):
collector = FakeCollector({"ready": True, "last_collect_error": "boom"})
payload = build_health_payload(collector)
self.assertEqual(payload["status"], "degraded")
def test_returns_degraded_when_last_collect_failed(self):
collector = FakeCollector({"ready": True, "last_collect_error": "boom"})
payload = build_health_payload(collector)
self.assertEqual(payload["status"], "degraded")
def test_returns_degraded_when_monitors_unhealthy(self):
collector = FakeCollector({"ready": True, "last_collect_error": "", "monitors_healthy": False})
payload = build_health_payload(collector)
self.assertEqual(payload["status"], "degraded")

@ghangz ghangz changed the title 增加导出器健康状态接口 ?????????????? Jun 26, 2026
@ghangz ghangz changed the title ?????????????? 补充导出器健康状态与降级信号 Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant