English | Chinese
A two-plane monitoring system for Linux hosts running Nginx or Apache, now with lightweight centralized remote monitoring via BT-Panel API — one Agent, many servers, no remote installation required.
Server-Mate is a lightweight server monitoring and AI operations workflow for Linux web hosts running Nginx or Apache.
As of v1.5.x, Server-Mate collects a comprehensive Linux system metrics stack across four layers — CPU detail, memory/swap, disk IOPS, network rates, process accounting, inode usage, TCP connection states, and systemd service health — all via psutil and the standard library with zero new dependencies. It also retains the v1.4.x Centralized Remote Monitoring architecture: a single Agent host can monitor an entire fleet by pulling logs through the BT-Panel (Baota) HTTP API with no remote installation required.
It splits responsibilities into two planes:
- Server Agent: a Python collector that tails local logs (or pulls remote logs through BT-Panel), samples host metrics, and writes SQLite rollups
- AI Analyzer: an OpenClaw-side layer that generates reports, pushes webhooks, explains issues, and drives guarded automation
- Centralized remote collection: pull Nginx / Apache logs from many hosts through BT-Panel API — no agent, daemon, or extra package on the remote machines
- Real-time metrics: CPU, memory, disk, load, and network I/O via
psutil - Extended Linux metrics — Layer 1: CPU user/system/iowait; memory available; swap usage; per-cycle disk IOPS and network Mbps; NIC error and drop counters
- Extended Linux metrics — Layer 2: process count, zombie detection, and top-5 CPU/memory process ranking
- Extended Linux metrics — Layer 3: root-disk inode usage; configurable extra partition monitoring (e.g.
/data,/home) - Extended Linux metrics — Layer 4: TCP connection state breakdown (ESTABLISHED / TIME_WAIT / CLOSE_WAIT); systemd service health probes
- Log parsing: normalized Nginx and Apache access/error log processing
- Traffic analytics: PV, UV, IP count, QPS, bandwidth, and status-code breakdowns
- Spider detection: crawler-family recognition and traffic separation
- Smart alerts: DingTalk, WeCom, Feishu, and Telegram webhook delivery
- 10 new alert kinds:
iowait_high,swap_high,memory_critical,net_errors,high_iops,zombie_process,inode_low,disk_multi_low,tcp_timewait_high,service_down - SSH Security Shield: auth-log brute-force detection linked to auto-ban
- AI diagnosis: plain-language explanations and remediation guidance
- Auto reports: daily, weekly, and monthly PDF reports with AI commentary
- SSL expiry checks: certificate remaining days in PDF summaries and webhook messages
- Guarded Automation: optional auto-ban and auto-heal with cooldowns, allowlists, and audit logs
- Add observability to Linux hosts without replacing your current stack
- Get AI-powered explanations instead of reading raw logs line by line
- Generate daily, weekly, and monthly ops reports automatically
- Detect suspicious IPs, 404 scans, 5xx spikes, and SSH brute-force attempts
- Enable safe automation with allowlists, TTLs, cooldowns, and audit trails
- Dynamic Theme Switcher: Added a sleek drop-down theme menu at the top-right header with Sun, Moon, and System Monitor icons. Allows switching between Light Mode, Dark Mode, and System Auto mode.
- Chart.js Color Adaptation: When toggling the theme, the javascript dynamically adjusts the Chart.js grid lines, tick fonts, legend text, and scale colors (resolves blue/purple accent lines on light grey background) and re-updates the canvas in real-time.
- Persistent Preferences: Saves the user's theme selection in
localStorageto persist across restarts.
- Viewport Grids: Modified CSS styling to collapse the 4 circular gauges into a 2x2 grid on tablets and 1x1 stack on mobile.
- Horizontal Table Scrolling: Wrapped all dashboard data tables inside responsive scrolling blocks (
overflow-x: auto) with minimum widths to prevent tables from being squeezed or broken on mobile screens. - Flexible Typography: Fluid page paddings (shrink from
24pxto12pxon mobile) and font-size scaling for optimal legibility.
Server-Mate now features a built-in, lightweight web dashboard served directly by the Python agent in daemon mode.
- Modern Dashboard UI: Visually stunning dark console with glassmorphic cards, responsive columns, and real-time updating SVG progress circle gauges (CPU, RAM, Swap, Disk).
- Line Charts Trend: Includes dynamic real-time system resource & traffic load charts (CPU utilization vs. QPS history) rendered with Chart.js.
- SRE Command Center: Shows active system alerts, Top 5 CPU processes list, monitored sites traffic statistics table, and the active firewall blocked IPs log in real time.
- Zero Dependencies: Powered by Python's standard
http.serverrunning in a background daemon thread (enabled via--dashboardordashboard.enabled: trueon port8000). Easy to embed in any webpage or reverse-proxy with Nginx.
Elevates the heuristic auto_ban into a smart AI-driven firewall:
- Intelligent Auditing: When a traffic-related warning (e.g.
suspicious_ip_burstor CPU-spike candidate) occurs, the Agent captures the candidate IP's request log context (URIs, methods, statuses, and User-Agents) and queries the LLM for classification. - Spider & Crawler Protection: The LLM determines if it is a scraper/scanner (calls
ban_ip) or a crawler/harmless user (callswhitelistto bypass ban). - Auditing Cache: Cached decisions (
llm_shield_cache) persist in state for 24 hours to prevent repetitive API calls and save tokens.
- Enables running SRE monitoring, recovery tracking, and diagnostics fully standalone without OpenClaw. You can now configure
api_keydirectly inconfig.yamlunder theai_analysisblock.
To prevent token wastage on recurring alerts (like cpu_high triggering every 10 minutes when CPU remains busy), Server-Mate now caches AI-generated diagnoses in its persistent state.
- Cache Reuse: Reuses the cached diagnosis for matching alert categories and domains for up to 1 hour (configurable via
ai_cooldown_seconds: 3600), cutting token spend by up to 90%. - Caching Badge: Alerts using a cached diagnosis will feature a
【AI: 缓存】badge.
Switched to a two-pass psutil sampling method with a 1.0-second delay for system and process metrics.
- False Alarm Elimination: Eliminates transient false-positives caused by the python interpreter startup or log-parsing CPU usage.
- Process Metrics Accuracy: Fixes the issue where all processes reported
0.0%CPU utilization due to single-pass sampling limitations.
Integrated auto_ban with cpu_high and iowait_high warnings to mitigate application-level CC/DDoS attacks on slow endpoints (e.g. /responses).
- Dynamic Mitigation: When CPU is exhausted and
ban_on_cpu_spikeis active, the Agent scans logs for the top client IP. If its request rate exceedscpu_spike_rpm_threshold: 60.0, the IP is automatically banned (e.g., usingiptables).
When hardware or service alerts trigger, the Agent automatically executes a suite of relevant troubleshooting commands locally (via subprocess) or remotely (via BT-Panel API exec_shell). The diagnostic report is appended directly to the webhook alert push message:
- CPU/Memory/Swap alerts: runs
pssorted by usage,uptime,free, anddmesg/journalctlOOM filters. - Disk/Inode alerts: runs
df -hT,df -i, and directory scansdu -shto locate major space/inode consumers. - Network/TCP alerts: runs
ss -s,ss -tn state time-wait,ip -s link. - Service down alerts: runs
systemctl status <unit>andjournalctl -u <unit>logs.
Track active alerts in state and notify webhooks when resolved (e.g. ✅ Server-Mate 已恢复), complete with:
- Duration stats: calculates exactly how long the incident lasted (e.g.
持续时长: 约 2 分 34 秒). - Peak values: displays the highest metric value reached during the incident.
- Jitter mitigation: a minimum duration setting (
recovery_min_duration_seconds, default 30s) prevents noise from fast-jittering transient spikes.
All new metrics are collected via psutil and the Python standard library. Zero new dependencies are required.
Layer 1 — CPU detail, memory/swap, disk IOPS & network rates
cpu_user_pct,cpu_system_pct,cpu_iowait_pctfromcpu_times_percent()memory_used_bytes,memory_available_bytes,swap_used_pct,swap_used_bytes- Per-cycle delta IOPS:
disk_read_iops,disk_write_iops(ops/s),disk_read_bytes_delta,disk_write_bytes_delta net_rx_mbps,net_tx_mbps,net_rx_errs,net_tx_errs,net_rx_drop,net_tx_drop
Layer 2 — Process accounting
process_count,process_running,process_sleeping,process_zombietop_cpu_procsandtop_mem_procs: top-5 processes by CPU and memory usage
Layer 3 — Inode & extra partition monitoring
disk_inode_used_pct: inode saturation on the root mount viaos.statvfs()- Configurable
extra_disk_partitionslist: per-mountused_pct,free_bytes, andinode_used_pct
Layer 4 — TCP states & systemd service health
tcp_established,tcp_time_wait,tcp_close_waitviapsutil.net_connections(kind="tcp")- Configurable
service_probeslist: checks each unit withsystemctl is-activeand returnsservice_failed_units
| Alert kind | Trigger |
|---|---|
iowait_high |
CPU iowait > iowait_pct (default 30%) |
swap_high |
Swap usage > swap_pct (default 60%) |
memory_critical |
Available memory < memory_min_available_mb (default 200 MB) |
net_errors |
NIC errors + drops > net_error_count (default 100) |
high_iops |
Write IOPS > disk_write_iops (default 5 000/s) |
zombie_process |
Any zombie process present |
inode_low |
Inode usage > inode_used_pct (default 90%) |
disk_multi_low |
Extra partition free ratio < disk_free_ratio |
tcp_timewait_high |
TIME_WAIT connections > tcp_timewait_count (default 5 000) |
service_down |
Any service_probes unit reports non-active |
migrate_schema()is called automatically frominit_database()and adds 11 newmetric_rollupscolumns to any existing database usingPRAGMA table_info— no manual migration needed, no data loss.
- Four new
system_metricskeys (collect_processes,collect_tcp_states,service_probes,extra_disk_partitions) all default to safe values; existingconfig.yamlfiles work without changes. - Seven new threshold keys with sensible defaults:
iowait_pct,swap_pct,memory_min_available_mb,net_error_count,disk_write_iops,inode_used_pct,tcp_timewait_count.
- Compliant signing scheme: the client now implements BT-Panel's full standard signing algorithm — an HMAC-like MD5 signature where
request_token = md5(str(request_time) + md5(api_key)). The signature is regenerated on every attempt (including retries), guaranteeing each request carries a freshrequest_timeand never replays a stale token - POST-only transport with merged form data: per BT's official documentation, every call is issued as
POSTwith the auth dict merged into the sameapplication/x-www-form-urlencodedbody as the business parameters - Session pooling for high-frequency collection: each panel is backed by a per-client
requests.Session(). Connection pooling (TCP/TLS reuse) and BT session-cookie persistence are honoured automatically, eliminating per-call handshake overhead in cron-driven workloads with many sites per panel - Agentless multi-site fan-out: remote sites flow into the same
sites[]matrix as local ones. Traffic rollups, spider classification, AI diagnosis, webhook routing, and PDF reports apply to remote hosts with zero additional wiring. Settingpanel_idto empty preserves legacy local-tail behaviour byte-for-byte
- Byte-offset chunking: remote reads are issued as
tail -c +<offset> | head -c <chunk>over BT'sExecShellendpoint, so the full file body never traverses the network - 5 MB single-cycle pull ceiling: each cron tick fetches at most
chunk_bytes(default 5 242 880 bytes). When a remoteerror_logexplodes by hundreds of megabytes due to an upstream incident, the Agent's memory footprint stays constant; HTTP timeout and OOM are categorically prevented and the residual bytes roll over to subsequent cron ticks - Defense-in-depth bounding: the 5 MB ceiling is enforced both at the Python caller layer AND inside the remote shell pipeline itself (
head -c 5242880), so even an anomalous panel response cannot push past the bound - Backlog visibility: a
backlog_bytesfield is stamped onto the persisted remote cursor (status="backlog") and a WARNING is emitted whenever a site is falling behind real-time, so silent under-collection cannot occur
- Command-injection defense: every remote shell path supplied via configuration is hardened with
shlex.quoteplus a structural check that rejects embedded NUL / CR / LF before being spliced into anyExecShellcommand. A maliciousaccess_logvalue such as"/path; rm -rf /"is contained inside a single-quoted shell literal and is never interpreted as a separate token - NTP time-drift auto-detection: BT signature failures most commonly surface as cryptic HTTP 200 +
{"status": false, "msg": "request_token error"}payloads, not as 401/403. The client now recognises both English and Chinese variants of the auth-failure message and emits a precise, actionable hint — "Authentication failed. Please check if the time on the Agent server and the Remote BT panel are synchronized (NTP Time Drift)." — both in the raised exception and in the WARNING log, so operators stop chasing the wrong root cause - Per-site fault isolation: a failure in any one remote panel is caught, logged, and stamped into the cursor's
statusfield; it cannot crash the cron tick or starve the other configured sites
- Auth log parsing: incrementally parses
logs.auth_log, or auto-detects/var/log/auth.logand/var/log/secure, forFailed passwordfingerprints - Linked auto-ban: repeated SSH failures raise
ssh_brute_forcealerts and can flow into the existing whitelist-aware auto-ban pipeline
- Certificate inspection: report generation now checks each configured site certificate with Python
sslandsocket - Visible everywhere: remaining days appear in PDF overview blocks and webhook markdown summaries, with warning markers below 15 days
- URL / Referer truncation: query strings are removed before table rendering, then long text is hard-truncated
- Stable table layouts: oversized tokens no longer break dense PDF pages
- New channel: the webhook center now supports Telegram bot delivery
- Environment fallback:
TELEGRAM_BOT_TOKENandTELEGRAM_CHAT_IDare used when config values are empty
- Auto-provisioning: the report generator automatically downloads the required GeoLite2
.mmdbdatabase from a public mirror if it is missing - MaxMind-first workflow: if
./data/GeoIP.confexists andgeoipupdateis installed, Server-Mate refreshes GeoLite2 from your own MaxMind account before falling back to the public mirror
- Pre-send AI review: warning and critical alerts can call the shared AI endpoint before webhook delivery
- Two-sentence output: alert cards append a compact
AI Diagnosisblock with plain-language cause and next action
--generate-service: the agent can print a host-local systemd unit template for daemon hosting withRestart=always
- Matrix configuration: monitor multiple domains on the same host with a
sites[]array - System metrics: dedicated
system_metricssettings for host-global resources - Scope separation: host-global metrics are separated from site-local traffic rollups via
__host__
- Logrotate support: handles inode changes, file truncation, and temporary file absence
- Incremental reading: robust state tracking across log rotations and restarts
- Dry-run mode: test automation policies before enabling real actions
- Whitelist-aware auto-ban: protects trusted IPs and known spiders
- TTL-based unban: automatic unban after configurable TTL
- Cooldown protection: prevents action storms with per-rule cooldowns
- Mandatory notifications: all automation actions are logged and notified
automation_actionstable: complete audit trail of automation eventsbanned_ipstable: tracks active bans with TTL and metadata
config.example.yaml: recommended starting point for v1.3.2 with multi-site, Telegram, SSH auth monitoring, SSL checks, AI diagnosis, and Guarded Automation pre-configured
# Clone repository
git clone https://github.com/tankeito/server-mate.git
cd server-mate
# Install dependencies
python3 -m pip install psutil pyyaml matplotlib requests
# Optional: GeoIP support
python3 -m pip install geoip2 maxminddb aiohttp
# Optional: official MaxMind updater
# CentOS / Rocky / AlmaLinux: sudo yum install geoipupdate
# Ubuntu / Debian: sudo apt-get install geoipupdateStart by copying config.example.yaml to config.yaml.
In OpenClaw, keep config.yaml, metrics.db, logs/, and reports/ inside the current workspace, meaning under ./, instead of writing into global system directories.
If AI features are enabled, OpenClaw injects OPENAI_API_KEY automatically, so you do not need to run export OPENAI_API_KEY=... manually.
agent:
host_id: web-01
timezone: Asia/Shanghai
mode: once
system_metrics:
enabled: true
logs:
auth_log: ""
sites:
- domain: site-a.example.com
site_host: site-a.example.com
enabled: true
access_log: ./logs/site-a.access.log
error_log: ./logs/site-a.error.log
- domain: site-b.example.com
site_host: site-b.example.com
enabled: true
access_log: ./logs/site-b.access.log
error_log: ./logs/site-b.error.log
storage:
database_file: ./metrics.db
rollup_minutes: [10, 60]
notifications:
webhooks:
dingtalk:
enabled: true
url: https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN
telegram:
enabled: false
bot_token: ""
chat_id: ""
reports:
report_language: zh
report_export_dir: ""
public_base_url: ""
geoip_city_db: ./data/GeoLite2-City.mmdb
geoip_update_config: ./data/GeoIP.conf
ai_analysis:
enabled: true
simulate: false
api_key_env: OPENAI_API_KEY
daily:
enabled: true
push_time: "08:30"
channels: [dingtalk]
output_dir: ./reports
automation:
dry_run: true
auto_ban:
enabled: false
auto_heal:
enabled: false- Put your MaxMind config at
./data/GeoIP.conf - Create
./data/GeoIP.confmanually and keep the real file out of Git - Install
geoip2and its supporting packages such asmaxminddbandaiohttpif you want real region lookups in reports - Free GeoLite2 account: MaxMind GeoLite sign up
- License key guide: Generate a License Key
geoip_update_configis optional, but./data/GeoIP.confis the recommended local path- If you do not want to use MaxMind directly, Server-Mate still falls back to the public
.mmdbmirror - If a MaxMind key was ever exposed in plain text, rotate it before production use
# One-shot collection
python3 scripts/server_agent.py --config config.yaml --once
# View collected rollups
python3 scripts/report_generator.py --config config.yaml daily --date 2026-03-26 --jsoncrontab -eAdd these lines:
# Data collection every 10 minutes
*/10 * * * * /usr/bin/env bash -lc 'python3 ./scripts/server_agent.py --config ./config.yaml --once >> ./logs/server-mate-agent.log 2>&1'
# Daily PDF report at 01:00
0 1 * * * /usr/bin/env bash -lc 'python3 ./scripts/report_generator.py --config ./config.yaml pdf --range daily --send >> ./logs/server-mate-report.log 2>&1'
# Weekly PDF report every Monday at 01:10
10 1 * * 1 /usr/bin/env bash -lc 'python3 ./scripts/report_generator.py --config ./config.yaml pdf --range weekly --send >> ./logs/server-mate-report.log 2>&1'
# Monthly PDF report on the 1st at 01:20
20 1 1 * * /usr/bin/env bash -lc 'python3 ./scripts/report_generator.py --config ./config.yaml pdf --range monthly --send >> ./logs/server-mate-report.log 2>&1'+--------------------------------------------------------------+
| Server Agent (Linux Host) |
| - psutil metrics (CPU / memory / disk / network) |
| - Incremental log reading (Nginx / Apache access + error) |
| - JSON event emission |
| - SQLite rollup writing |
+--------------------------------------------------------------+
|
| SQLite / JSON events
v
+--------------------------------------------------------------+
| AI Analyzer (OpenClaw) |
| - Aggregation and storage |
| - Natural-language query handling |
| - AI error diagnosis |
| - Webhook delivery (DingTalk / WeCom / Feishu / Telegram) |
| - Guarded auto-ban / auto-heal |
| - PDF report generation (daily / weekly / monthly) |
+--------------------------------------------------------------+
- Agent collection: generates
system_snapshot,access_event, anderror_event - SQLite rollups: writes 10-minute and 60-minute buckets
- Report generator: reads rollups and generates PDF or Markdown output
- Webhook center: sends alerts and report summaries
- AI analysis: optionally calls an LLM for explanations and recommendations
| Event Type | Purpose | Key Fields |
|---|---|---|
system_snapshot |
Host health metrics | cpu_pct, memory_pct, disk_free_bytes, load_1m |
access_event |
Parsed access log | client_ip, uri, status, response_ms, user_agent |
error_event |
Parsed error log | severity, component, category, fingerprint, message |
action_event |
Audit trail | action, target, reason, dry_run, result, ttl_seconds |
| Metric | Definition |
|---|---|
| PV | Total request count in the selected window |
| UV | Unique visitor key, typically IP plus user-agent fallback |
| IP Count | Unique client IPs |
| QPS | request_count / window_seconds |
| Slow Request | response_ms > threshold, default 2000ms |
| Bandwidth Out | Sum of response bytes |
| Field | Type | Default | Description |
|---|---|---|---|
host_id |
string | - | Logical host name for alerts and reports |
timezone |
string | UTC |
Local timezone for bucket scheduling |
mode |
string | once |
once or daemon |
poll_interval_seconds |
int | 60 |
Agent loop interval in daemon mode |
state_file |
string | ./server_agent_state.json |
Cursor state file for incremental reading |
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | true |
Whether to collect host-global metrics |
disk_root |
string | / |
Mount point used for disk checks |
collect_network_io |
boolean | true |
Whether to collect network I/O |
| Field | Type | Description |
|---|---|---|
auth_log |
string | SSH auth log path; auto-detected if empty |
| Field | Type | Description |
|---|---|---|
domain |
string | Site domain used for report naming and SSL checks |
site_host |
string | Display name for the site |
enabled |
boolean | Whether this site is enabled |
access_log |
string | Access log path. Local path when panel_id is empty; absolute remote path on the target host when panel_id is set |
error_log |
string | Error log path. Same semantics as access_log |
panel_id |
string | Optional. Binds this site to a remote BT panel defined in remote_panels. Leave empty (or omit) to read local files via the original LocalLogReader. When set, logs are pulled through the bound panel via BTRemoteLogReader |
A top-level mapping from panel_id to a BT-Panel connection profile. Sites reference these profiles via their panel_id field to enable agentless remote log collection.
| Field | Type | Default | Description |
|---|---|---|---|
<panel_id> |
string (key) | - | Logical identifier referenced by sites[].panel_id |
url |
string | - | BT-Panel base URL including port, e.g. https://panel-hk.example.com:8888 |
api_key |
string | "" |
BT-Panel interface key. Plaintext is supported as a fallback only; prefer api_key_env |
api_key_env |
string | "" |
Environment variable to read the api_key from at runtime. Takes precedence over api_key |
timeout_seconds |
int | 15 |
Per-request HTTP timeout |
retries |
int | 2 |
Bounded retry count for transient transport errors. Auth failures are never retried |
chunk_bytes |
int | 5242880 |
Hard upper bound (in bytes) on a single ExecShell pull and a single cron-cycle fetch. Default is 5 MB |
verify_tls |
boolean | true |
TLS certificate verification toggle. Disable only for self-signed panels |
Example:
remote_panels:
bt-prod-hk:
url: https://panel-hk.example.com:8888
api_key_env: BT_PANEL_HK_API_KEY # preferred
timeout_seconds: 15
retries: 2
sites:
- domain: site-local.example.com
enabled: true
access_log: ./logs/site-local.access.log # local site, no panel_id
error_log: ./logs/site-local.error.log
- domain: site-remote.example.com
enabled: true
panel_id: bt-prod-hk # remote site, bound to panel
access_log: /www/wwwlogs/site-remote.example.com.log
error_log: /www/wwwlogs/site-remote.example.com.error.log
⚠️ Security Warning — A BT-Panelapi_keycarries the same authority as root on every host the panel manages. Never commit aconfig.yamlcontaining a plaintextapi_keyto version control. The supported workflow is:
- Add
config.yamlto.gitignore(already the project default).- Inject the key via
api_key_envand export it from a non-tracked location such as/etc/server-mate/env, a systemdEnvironmentFile=, or your secrets manager.- If a key is ever committed accidentally — even to a private fork — rotate it from the BT panel immediately; do not rely on
git rm.
| Field | Type | Default | Description |
|---|---|---|---|
database_file |
string | ./metrics.db |
SQLite database path |
rollup_minutes |
array | [10, 60] |
Rollup bucket granularities |
| Channel | Fields |
|---|---|
dingtalk |
enabled, url, timeout_seconds, at_all |
wecom |
enabled, url, timeout_seconds |
feishu |
enabled, url, timeout_seconds |
telegram |
enabled, bot_token, chat_id, timeout_seconds |
| Field | Type | Description |
|---|---|---|
report_language |
string | zh or en |
report_export_dir |
string | Export directory exposed externally for PDFs |
public_base_url |
string | URL prefix for report download links |
geoip_city_db |
string | GeoLite2 City database path |
geoip_update_config |
string | MaxMind updater config path |
daily.enabled |
boolean | Enable daily reports |
daily.push_time |
string | "08:30" format |
weekly.push_weekday |
int | 1-7, where 1 means Monday |
monthly.push_day |
int | 1-28 |
| Field | Type | Description |
|---|---|---|
dry_run |
boolean | When true, actions are logged and notified but not executed |
auto_ban.enabled |
boolean | Enables automatic banning |
auto_ban.whitelist_ips |
array | IP allowlist |
auto_ban.ban_ttl_seconds |
int | Ban TTL before automatic release |
auto_heal.enabled |
boolean | Enables automatic healing |
auto_heal.cooldown_seconds |
int | Cooldown between service restarts |
Generated: every day at configured push_time
Contents:
- PV, UV, and IP totals for the prior 24 hours
- Top pages with PV/UV columns, top IPs with region, and top referers
- Spider traffic breakdown
- Status-code distribution (
2xx,3xx,4xx,5xx) - Top errors and slow endpoints
- AI health commentary, if enabled
Generated: every Monday at the configured time
Contents:
- 7-day traffic trend
- Blocked IP trends
- Crawler traffic patterns
- Suspicious route clusters
- Recurring error fingerprints
- AI weekly summary
Generated: on the 1st of each month
Contents:
- 30-day traffic and performance trend
- Disk growth analysis
- Bandwidth peak detection
- Capacity warnings
- Remediation summary
- AI monthly review
| Alert Type | Default Threshold | Window |
|---|---|---|
| CPU High | > 85% |
5 consecutive minutes |
| Memory High | > 85% |
5 consecutive minutes |
| Disk Low | < 10% free |
Instant |
| 5xx Burst | > 20 errors |
1 minute |
| Suspicious IP | > 200 RPM |
1 minute |
| 404 Scan Burst | Sudden spike | Short window |
| Slow Routes | > 2000ms average |
Alert window |
Requirements:
- Allowlist support for trusted IPs
- Clear evidence of abuse, not just flash crowds
- Cooldown and per-hour action caps
- TTL, for example 24 hours
- Audit records with exact commands
Good Candidates:
- Repeated request-rate breaches from one IP
- Scanner-like user-agents with 404 spray patterns
- Brute-force hits against admin routes
Requirements:
- Repeated
502or upstream-failure evidence - Failing health checks or a second confirming signal
- One restart attempt per cooldown window
- Post-action verification
- Escalation path when restart fails
Preferred Sequence:
- Alert
- Dry-run recommendation
- One guarded restart of a proven failing service
- Re-check error rate and service health
- Escalate instead of looping forever
server-mate/
├── SKILL.md # Skill definition and triggers
├── README.md # English documentation
├── README_ZH.md # Chinese documentation
├── user-guide.md # Detailed deployment guide
├── config.example.yaml # Full example config template
├── agents/
│ └── openai.yaml # OpenAI agent interface config
├── references/
│ ├── architecture.md # System design and component boundaries
│ ├── data-contracts.md # Event schemas and metric definitions
│ ├── ops-playbook.md # Thresholds, alerts, and automation policies
│ └── sqlite-schema.md # Database schema and query patterns
├── scripts/
│ ├── server_agent.py # Main collector daemon
│ ├── report_generator.py # PDF and Markdown report generator
│ └── webhook_center.py # Webhook delivery service
└── config.yaml # Runtime configuration file
Solution:
# CentOS / Rocky / AlmaLinux
sudo yum install google-noto-sans-cjk-ttc-fonts
# Ubuntu / Debian
sudo apt-get update
sudo apt-get install fonts-noto-cjk
# Refresh font cache
fc-cache -fvSolution:
- Set
report_export_dirin config - Set
public_base_urlin config - Expose the export directory through Nginx or Apache
Solution:
- Verify the
database_filepath - Confirm the agent is writing rollups
- Confirm the configured site identifiers match the stored data scopes
Solution:
- Make sure the current agent version has created
slow_request_rollupsandsuspicious_ip_rollupstables
- GitHub Issues: https://github.com/tankeito/server-mate/issues
- Repository: https://github.com/tankeito/server-mate
- Email: tqd354@gmail.com
Server-Mate | Lightweight Server Monitoring and AI Ops
Developed by tankeito | MIT License | 2026