[GAutoWeb][Controller] 大规模部署时的日志集中收集与排查方案

# [GAutoWeb][Controller] 大规模部署时的日志集中收集与排查方案

> GAutoWeb controller 大规模部署后，需要集中收集和排查各节点日志。本文基于 autoit 仓库现有架构，梳理已有能力、缺口，并按投入规模给出推荐方案。

## 当前架构里已经有什么

| 日志类型 | 产生位置 | 集中方式 | 查看方式 |
|---------|---------|---------|---------|
| **子任务 stdout/stderr** | controller 上跑的 Python/shell | 实时 `POST /api/subtask/log/append` → 后端写 `data/subtask_logs/subtask-{id}.log` | Web UI SSE 实时 tail + 下载 |
| **子任务本地副本** | `<scriptsDir>/logs/*.log` | 仅本机 | 需 SSH 到各台机器 |
| **Controller 自身日志** | Rust `tracing` → stdout | **未上报** | 本机终端 / systemd journal |
| **Backend 日志** | loguru → stdout | 单机 | `journalctl -u gauto.backend` |

### 子任务日志链路

1. Controller 在 `<scriptsDir>/logs/` 写本地文件，同时通过 `log_forward.rs` tee 到后端。
2. 后端 `subtask_service.append_log()` 追加到 `GAUTOWEB_SUBTASK_LOG_ROOT`（默认 `data/subtask_logs/subtask-{id}.log`）。
3. Web UI 通过 `/api/subtask/{id}/log/stream`（SSE）实时查看；`/api/subtask/{id}/log` 下载完整日志。
4. Lark 通知可带 subtask 日志链接。

**结论**：按 subtask 查「脚本跑挂了」——现有 Web UI + Lark 基本够用；按 controller / 跨机器 / 全文检索 / 长期留存——现有方案不够。

## 大规模部署的主要缺口

1. **Controller agent 日志未集中** — heartbeat 失败、adb/ios 发现异常、git sync 失败、claim/report 错误等都在各机 stdout，Web 看不到。
2. **存储是单机 flat file** — controller 一多、任务一多，磁盘、备份、检索都会成为瓶颈。
3. **缺少统一检索维度** — 没有按 `controller_id`、`hostname`、`platform`、`ERROR` 跨机器搜索。
4. **本地副本与中心副本可能不一致** — 网络中断时 `log/append` 可能丢段，排查时需明确「以哪份为准」。

## 推荐方案（按投入从小到大）

### 方案 A：最小改动 — 先把现有链路用满（适合 <20 台）

- **子任务日志**：继续用现有 SSE + Web UI；失败通知走 Lark。
- **Controller 日志**：每台用 **systemd + journald** 跑 controller binary，`journalctl -u controller -f` 查本机 agent 问题。
- **Backend**：已有 `make logs`（journalctl），集中在一台 API 服务器上。
- **运维约定**：日志以 **后端 `subtask-{id}.log` 为准**；本机 `<scriptsDir>/logs/` 仅作补查。

**优点**：零新组件。**缺点**：controller 一多就要逐台 SSH，无法全局搜索。

### 方案 B：推荐 — Agent 采集 + Loki/Grafana（适合 20~500 台）

```
每台 Controller 机器:
  systemd journal (controller 进程)
  + scriptsDir/logs/*.log
  + 可选: backend journal
        ↓
  Promtail / Vector / Fluent Bit (轻量 agent)
        ↓
  Loki (或公司现有 ELK/OpenSearch)
        ↓
  Grafana Explore / Dashboard / Alert
```

**标签设计**：

```text
controller_id=123
controller_name=lab-mac-01
hostname=...
platform=android|ios|macos
log_type=agent|subtask
subtask_id=456   # 子任务日志才有
```

**Grafana LogQL 示例**：

```logql
{log_type="agent"} |= "heartbeat error"
{log_type="subtask", subtask_id="456"}
{hostname=~"lab-.*"} |~ "(?i)(error|failed|traceback)"
```

**与现有系统的关系**：

- **保留** `log/append` → Web UI 实时看 log（产品体验不变）。
- **并行** 把同一份文件/agent 日志送到 Loki（排查、告警、跨机搜索）。
- 子任务日志以 **Loki 长期存储 + 后端文件短期热存储** 分层。

### 方案 C：平台内增强 — 扩展 GAutoWeb 自身（适合不想引入新栈）

| 改动 | 作用 |
|-----|------|
| Controller 增加 `POST /api/controller/log/append`（仿 subtask） | 集中 agent 日志 |
| `tracing` 输出 JSON + 固定字段 `controller_id` | 结构化，便于过滤 |
| 子任务日志迁到 **MinIO/S3/NFS** | 后端磁盘不再成为单点 |
| 后端加简单查询 API：`?controllerId=&since=&q=ERROR` | Web 或 CLI 跨 subtask 搜 |
| 日志 retention cron（如 30 天） | 控磁盘 |

**优点**：和现有模型一致。**缺点**：全文检索、聚合、告警都要自己造。

### 方案 D：公司级可观测（适合已有基础设施）

若公司已有 **ELK、OpenSearch、Datadog、阿里云 SLS、腾讯云 CLS** 等：

- 每台 controller 装 **Filebeat / Vector**，采集 journal + `scriptsDir/logs/`。
- Backend 同样接入同一平台。
- 用现有告警规则（ERROR 率、heartbeat 超时、subtask failed 聚类）。

**原则**：优先复用公司平台，不要另起一套 Loki。

## 排查工作流建议

| 场景 | 第一查 | 第二查 |
|-----|--------|--------|
| 单个 subtask 失败 | Web SSE / `/api/subtask/{id}/log` | Loki `{subtask_id="..."}` |
| 某 controller 不工作 | Controller 页 `last_seen_at` | Agent 日志：heartbeat、adb、claim |
| 批量 adb unauthorized | 按 hostname 聚合 WARN | 设备 USB/授权状态 |
| 后端收不到 log | Controller agent 里 `subtask log append failed` | 网络、API 负载、磁盘满 |

## 推荐组合

**如果近期就要上几十上百台 controller**：

1. **短期（1 天内）** — 所有 controller 用 systemd 托管，日志进 journald；明确 subtask 日志以中心 `subtask_logs/` 为准。
2. **中期（1~2 周，推荐）** — 部署 **Loki + Grafana + Promtail**；采集 journald + `<scriptsDir>/logs/`；标签带 `controller_id`、`CONTROLLER_NAME`、`platform`。
3. **长期（按流量）** — 子任务日志热数据保留 7~30 天，冷数据进对象存储或 Loki retention；可选给 controller 加结构化 JSON tracing 上报。

**如果规模 <10 台、主要是看脚本输出**：现有 Web UI + 本机 journal 通常够用，不必上 Loki。

## 待确认

1. 预计多少台 controller？（10 / 50 / 200+ 会直接影响选型）
2. 公司是否已有 ELK、Loki、SLS 等？有的话应优先接入。


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GAutoWeb][Controller] 大规模部署时的日志集中收集与排查方案 #51

[GAutoWeb][Controller] 大规模部署时的日志集中收集与排查方案

当前架构里已经有什么

子任务日志链路

大规模部署的主要缺口

推荐方案（按投入从小到大）

方案 A：最小改动 — 先把现有链路用满（适合 <20 台）

方案 B：推荐 — Agent 采集 + Loki/Grafana（适合 20~500 台）

方案 C：平台内增强 — 扩展 GAutoWeb 自身（适合不想引入新栈）

方案 D：公司级可观测（适合已有基础设施）

排查工作流建议

推荐组合

待确认

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

日志类型	产生位置	集中方式	查看方式
子任务 stdout/stderr	controller 上跑的 Python/shell	实时 `POST /api/subtask/log/append` → 后端写 `data/subtask_logs/subtask-{id}.log`	Web UI SSE 实时 tail + 下载
子任务本地副本	`<scriptsDir>/logs/*.log`	仅本机	需 SSH 到各台机器
Controller 自身日志	Rust `tracing` → stdout	未上报	本机终端 / systemd journal
Backend 日志	loguru → stdout	单机	`journalctl -u gauto.backend`

改动	作用
Controller 增加 `POST /api/controller/log/append`（仿 subtask）	集中 agent 日志
`tracing` 输出 JSON + 固定字段 `controller_id`	结构化，便于过滤
子任务日志迁到 MinIO/S3/NFS	后端磁盘不再成为单点
后端加简单查询 API：`?controllerId=&since=&q=ERROR`	Web 或 CLI 跨 subtask 搜
日志 retention cron（如 30 天）	控磁盘

场景	第一查	第二查
单个 subtask 失败	Web SSE / `/api/subtask/{id}/log`	Loki `{subtask_id="..."}`
某 controller 不工作	Controller 页 `last_seen_at`	Agent 日志：heartbeat、adb、claim
批量 adb unauthorized	按 hostname 聚合 WARN	设备 USB/授权状态
后端收不到 log	Controller agent 里 `subtask log append failed`	网络、API 负载、磁盘满

[GAutoWeb][Controller] 大规模部署时的日志集中收集与排查方案 #51

Description

[GAutoWeb][Controller] 大规模部署时的日志集中收集与排查方案

当前架构里已经有什么

子任务日志链路

大规模部署的主要缺口

推荐方案（按投入从小到大）

方案 A：最小改动 — 先把现有链路用满（适合 <20 台）

方案 B：推荐 — Agent 采集 + Loki/Grafana（适合 20~500 台）

方案 C：平台内增强 — 扩展 GAutoWeb 自身（适合不想引入新栈）

方案 D：公司级可观测（适合已有基础设施）

排查工作流建议

推荐组合

待确认

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions