补充导出器健康状态与降级信号 by ghangz · Pull Request #16 · MetaX-MACA/mxExporter

ghangz · 2026-06-25T10:05:58Z

这个改动让健康检查接口能够更准确地表达导出器当前状态，在监控器非必需、采集错误和监控器异常之间给出更细粒度的健康结果，便于上层系统区分真正不可用和可继续服务的降级状态。

对应测试已经补齐并通过，重点验证了降级状态的返回逻辑和健康载荷中的监控器状态表达。

gemini-code-assist

Code Review

This pull request introduces a new /health HTTP endpoint to the exporter, tracking collector readiness, collection counts, execution times, and errors. The reviewer provided valuable feedback on improving the health check robustness. Specifically, they suggested tracking whether required log monitors are successfully started and marking the health status as degraded if they fail. Additionally, they recommended returning a 200 status code instead of 503 for a degraded state to prevent Kubernetes from unnecessarily restarting the exporter container during transient collection failures. They also provided suggestions for corresponding unit tests.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-25T10:08:06Z

        if any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics()):
-            self.kernel_log_monitor.start(mount_point)
+            self.kernel_log_monitor_started = self.kernel_log_monitor.start(mount_point)

        if any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics()):
-            self.sys_log_monitor.start(mount_point)
+            self.sys_log_monitor_started = self.sys_log_monitor.start(mount_point)


当配置了需要收集日志相关的指标（如 driver_log_errors 或 sdk_eid_errors），但对应的日志监控器（kernel_log_monitor 或 sys_log_monitor）启动失败（例如日志文件不存在或未挂载）时，目前健康检查接口 /health 依然会返回 200 OK 且状态为 "ok"。这使得配置异常或环境问题被隐瞒。

建议记录这些监控器是否被启用（required），以便后续在 get_health_status() 中校验其是否正常启动。

Suggested change

if any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics()):

self.kernel_log_monitor.start(mount_point)

self.kernel_log_monitor_started = self.kernel_log_monitor.start(mount_point)

if any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics()):

self.sys_log_monitor.start(mount_point)

self.sys_log_monitor_started = self.sys_log_monitor.start(mount_point)

self.kernel_log_monitor_required = any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics())

if self.kernel_log_monitor_required:

self.kernel_log_monitor_started = self.kernel_log_monitor.start(mount_point)

self.sys_log_monitor_required = any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics())

if self.sys_log_monitor_required:

self.sys_log_monitor_started = self.sys_log_monitor.start(mount_point)

gemini-code-assist · 2026-06-25T10:08:06Z

+    def get_health_status(self):
+        return {
+            "ready": self.ready,


在 get_health_status() 中，结合之前记录的 required 状态，校验必需的日志监控器是否成功启动。如果必需的监控器启动失败，则将 monitors_healthy 标记为 False。

Suggested change

def get_health_status(self):

return {

"ready": self.ready,

def get_health_status(self):

monitors_healthy = True

if getattr(self, "kernel_log_monitor_required", False) and not self.kernel_log_monitor_started:

monitors_healthy = False

if getattr(self, "sys_log_monitor_required", False) and not self.sys_log_monitor_started:

monitors_healthy = False

return {

"ready": self.ready,

"monitors_healthy": monitors_healthy,

gemini-code-assist · 2026-06-25T10:08:06Z

+    elif details.get("last_collect_error"):
+        status = "degraded"


在 build_health_payload() 中，如果 monitors_healthy 为 False，则将健康状态标记为 "degraded"（降级），从而能够通过健康检查接口暴露监控器启动失败的问题。

Suggested change

elif details.get("last_collect_error"):

status = "degraded"

elif details.get("last_collect_error") or not details.get("monitors_healthy", True):

status = "degraded"

gemini-code-assist · 2026-06-25T10:08:06Z

+    def _handle_health(self):
+        collector = EXPORTER_CONTEXT.get("collector")
+        payload = build_health_payload(collector)
+        status_code = 200 if payload["status"] == "ok" else 503


当 payload["status"] 为 "degraded"（例如单次采集失败 last_collect_error 有值）时，接口返回 503 状态码。
如果用户将 /health 接口用作 Kubernetes 的 Liveness Probe（存活探针），返回 503 会导致 Kubernetes 频繁重启 Exporter 容器。然而，采集失败通常是由于 GPU 驱动、硬件或后端服务异常引起的，重启 Exporter 容器并不能解决这些底层问题，反而可能导致重启循环（Restart Loop）和监控数据彻底丢失。

建议：
对于 "degraded" 状态，建议返回 200 状态码，仅在 "starting"（未初始化完成）时返回 503。或者提供更细粒度的区分，避免因非 Exporter 自身进程崩溃的问题导致容器被误杀。

Suggested change

status_code = 200 if payload["status"] == "ok" else 503

status_code = 200 if payload["status"] in ("ok", "degraded") else 503

gemini-code-assist · 2026-06-25T10:08:06Z

+    def test_returns_degraded_when_last_collect_failed(self):
+        collector = FakeCollector({"ready": True, "last_collect_error": "boom"})
+        payload = build_health_payload(collector)
+        self.assertEqual(payload["status"], "degraded")


补充针对 monitors_healthy 为 False 时返回 "degraded" 状态的单元测试，确保逻辑的正确性。

Suggested change

def test_returns_degraded_when_last_collect_failed(self):

collector = FakeCollector({"ready": True, "last_collect_error": "boom"})

payload = build_health_payload(collector)

self.assertEqual(payload["status"], "degraded")

def test_returns_degraded_when_last_collect_failed(self):

collector = FakeCollector({"ready": True, "last_collect_error": "boom"})

payload = build_health_payload(collector)

self.assertEqual(payload["status"], "degraded")

def test_returns_degraded_when_monitors_unhealthy(self):

collector = FakeCollector({"ready": True, "last_collect_error": "", "monitors_healthy": False})

payload = build_health_payload(collector)

self.assertEqual(payload["status"], "degraded")

?? API ?? 2 ??????

Prepare health-report-status PR

45b3b80

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

report degraded exporter monitor state

1880760

?? API ?? 2 ??????

ghangz changed the title ~~增加导出器健康状态接口~~ ?????????????? Jun 26, 2026

ghangz changed the title ~~??????????????~~ 补充导出器健康状态与降级信号 Jun 26, 2026

ghangz added 2 commits June 29, 2026 17:22

sync review update: remove dep/libmxsml.so

f8a2147

sync review update: remove dep/mxsmlBindings.py

e481a57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

补充导出器健康状态与降级信号#16

补充导出器健康状态与降级信号#16
ghangz wants to merge 4 commits into
MetaX-MACA:mainfrom
ghangz:mengz/health-report-status

ghangz commented Jun 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	status_code = 200 if payload["status"] == "ok" else 503
	status_code = 200 if payload["status"] in ("ok", "degraded") else 503

Uh oh!

Conversation

ghangz commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ghangz commented Jun 25, 2026 •

edited

Loading