Skip to content

拆分内核日志与系统日志监控开关#17

Open
ghangz wants to merge 4 commits into
MetaX-MACA:mainfrom
ghangz:mengz/separate-log-monitor-switches
Open

拆分内核日志与系统日志监控开关#17
ghangz wants to merge 4 commits into
MetaX-MACA:mainfrom
ghangz:mengz/separate-log-monitor-switches

Conversation

@ghangz

@ghangz ghangz commented Jun 25, 2026

Copy link
Copy Markdown

目前日志监控开关粒度偏粗,无法只打开一种来源的日志采集。这个改动把内核日志和系统日志拆成独立开关,既降低默认开销,也方便针对具体问题单独启用最需要的监控。

这次改动保持默认行为兼容,只在明确需要的场景下引入新的诊断、配置入口或结果输出,避免把沐曦适配相关的问题拖到更晚的运行阶段才暴露出来。

我已经在沐曦 GPU 环境里完成了对应分支验证,并补充或执行了以下相关测试:

  • tests/test_parser_monitor_flags.py

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces independent controls for kernel and system log monitoring via new command-line flags (--kernel-log-monitor and --sys-log-monitor). The argument parsing logic was refactored into a separate helper function, and a unit test was added to verify the new flags. The feedback suggests optimizing the log monitors to avoid starting background threads when disabled, using keyword arguments for MxCollector instantiation to improve maintainability, and adding a test case to verify the default values of the new flags.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +66 to 70
if kernel_log_monitor_flag and any(metric in self.metrics_required for metric in self.kernel_log_monitor.get_supported_metrics()):
self.kernel_log_monitor.start(mount_point)

if any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics()):
if sys_log_monitor_flag and any(metric in self.metrics_required for metric in self.sys_log_monitor.get_supported_metrics()):
self.sys_log_monitor.start(mount_point)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

MxCollector 初始化时,即使 kernel_log_monitor_flagsys_log_monitor_flag 被设置为 0(即禁用对应的日志监控),KernelLogMonitorSysLogMonitor 依然会在 init_members() 中被实例化。

这会导致其内部的 LogHandler 被创建,并立即启动后台守护线程(transfer 线程)。这些线程会一直运行(执行 while True 循环、获取锁、复制空数据并休眠),从而造成不必要的系统资源开销。这与本次修改“降低默认开销”的初衷相违背。

建议优化方案:
可以考虑重构 LogHandlerLogMonitor,将后台线程的启动时机延迟到 start() 方法被调用时,而不是在 __init__ 中直接启动。这样在未启用对应监控时,就不会创建和运行多余的后台线程。

Comment thread mx_exporter/__init__.py
Comment on lines +148 to +158
mx_collector = MxCollector(
cfg_file,
registry,
args.interval/1000,
args.ib_monitor,
args.mount_point,
args.kubelet_path,
args.k8s_domains,
args.kernel_log_monitor,
args.sys_log_monitor,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

在实例化 MxCollector 时,目前使用了大量的 positional arguments(位置参数)。这种方式在参数较多时非常容易出错,且未来如果 MxCollector.__init__ 的参数顺序发生变化或新增参数,极易导致代码中断,可维护性较差。

建议改用 keyword arguments(关键字参数)来显式传递参数,这样不仅能提高代码的可读性,还能增强代码的健壮性。

Suggested change
mx_collector = MxCollector(
cfg_file,
registry,
args.interval/1000,
args.ib_monitor,
args.mount_point,
args.kubelet_path,
args.k8s_domains,
args.kernel_log_monitor,
args.sys_log_monitor,
)
mx_collector = MxCollector(
config_file=cfg_file,
registry=registry,
gather_interval=args.interval/1000,
ib_monitor_flag=args.ib_monitor,
mount_point=args.mount_point,
kubelet_path=args.kubelet_path,
k8s_domains=args.k8s_domains,
kernel_log_monitor_flag=args.kernel_log_monitor,
sys_log_monitor_flag=args.sys_log_monitor,
)

Comment on lines +7 to +11
def test_parser_has_separate_log_monitor_switches(self):
parser = build_arg_parser()
args = parser.parse_args(["--kernel-log-monitor", "0", "--sys-log-monitor", "1"])
self.assertEqual(args.kernel_log_monitor, 0)
self.assertEqual(args.sys_log_monitor, 1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

建议增加对默认参数值的单元测试,以确保在不显式指定 --kernel-log-monitor--sys-log-monitor 时,它们能够正确地默认启用(即值为 1),从而保证向后兼容性。

Suggested change
def test_parser_has_separate_log_monitor_switches(self):
parser = build_arg_parser()
args = parser.parse_args(["--kernel-log-monitor", "0", "--sys-log-monitor", "1"])
self.assertEqual(args.kernel_log_monitor, 0)
self.assertEqual(args.sys_log_monitor, 1)
def test_parser_has_separate_log_monitor_switches(self):
parser = build_arg_parser()
args = parser.parse_args(["--kernel-log-monitor", "0", "--sys-log-monitor", "1"])
self.assertEqual(args.kernel_log_monitor, 0)
self.assertEqual(args.sys_log_monitor, 1)
def test_parser_default_log_monitor_switches(self):
parser = build_arg_parser()
args = parser.parse_args([])
self.assertEqual(args.kernel_log_monitor, 1)
self.assertEqual(args.sys_log_monitor, 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant