Stuck-key incident capture: add stdout tail + system load, document GUI-only gap (MAL-57)#907
Conversation
…UI-only gap (MAL-57) The 2026-06-10 MAL-57 incidents exposed three gaps in the stuck-key incident capture: 1. Captures snapshotted kanata stderr, but the pqrs driver-connection evidence (connected / driver connected: false / output backend unavailable / dropping KEY_* Release) lives in stdout. Add a stdout tail (last 200 lines) filtering the high-volume "virtual_hid_keyboard_ready true" heartbeat; "false" lines are kept since they indicate driver disconnects. 2. No system-load context despite CPU starvation being the leading hypothesis. Add loadavg (getloadavg) plus the top 5 CPU-consuming processes via ps; process listing is skipped under tests. 3. The detector only runs while KeyPath.app is open — both 2026-06-10 incidents were missed because the GUI was closed. Documented on the service and in docs/bugs/MAL-57-duplicate-keypresses.md. Log tails now read only the trailing 1MB of the file instead of loading it whole (kanata stdout grows to 175MB+). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…og tail - Use KeyPathConstants.Logs.kanataStdout/kanataStderr instead of re-hardcoding the daemon log paths. - Run ps via SubprocessRunner with a 5s timeout instead of raw Process — bounds how long a slow ps (plausible under the CPU starvation being diagnosed) can stall the snapshot that recovery awaits before restarting kanata. - logTail: decode lossily and drop the partial first line of a truncated read; previously a 1MB seek landing mid-multibyte-character made String(data:encoding:) fail and the whole tail degraded to "(file not readable)". Regression test added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0552a1161
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ations Master independently shipped an overlapping capture (#897: tailOfFile, loadavg/core count, raw ps) plus the test-sandbox redirect for the incident directory (#901). Resolved to a single implementation keeping the best of both: - From master: AppPaths-based diagnosticsDirectory (test sandbox), active_cpu_count, loadavg_1m_5m_15m label, maxBytes test parameter. - From this branch: vhid-ready spam filter (with "false" lines kept), 1MB tail window, UTF-8-safe lossy decode + partial-first-line drop, SubprocessRunner-with-timeout for ps, pid column. - New (Codex review): explicit truncation marker line when the tail window clips older output. Doc: folded the capture-gap notes into the existing #897 telemetry paragraph instead of keeping a duplicate same-named section. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Code Review — PR #907Solid diagnostic improvement overall. The changes directly address the MAL-57 post-mortem gaps (stdout vs stderr, memory-safe tail read, Correctness
The truncation marker is inserted after
Trailing empty line when log files end with
Style
await lines.append(contentsOf: topCPUProcessLines(limit: 5))This compiles (the lines.append(contentsOf: await topCPUProcessLines(limit: 5))Minor, but the second form is what you'd expect to see in Swift concurrency code. What's good
Non-blockingThe two correctness items above are both defensive hardening on diagnostic capture code, not the hot path, so they don't block this. The trailing-empty-line issue in particular is pre-existing. Tagging as suggestions for the next pass if you agree with the direction. |
Hardening from claude-review on #907: a \n-terminated log (the normal case) produced an empty-string artifact as the tail's last element; all empty lines are now dropped. The truncation marker is documented as being on top of maxLines. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Addressed the claude-review feedback in ba0bc38: |
Code Review — PR #907OverviewSolid, well-scoped incident-capture improvement. The 2026-06-10 MAL-57 incidents being useless because (a) the wrong log file was captured and (b) the GUI was closed is exactly the kind of post-incident hardening that pays off. The implementation is clean and the test coverage is thorough. What's working well
Minor observations1. In await lines.append(contentsOf: systemLoadLines())
await lines.append(contentsOf: topCPUProcessLines(limit: 5))This is valid Swift — let loadLines = await systemLoadLines()
lines.append(contentsOf: loadLines)Low priority, but worth a nit if you're touching this file again. 2. Empty result when all lines are filtered (non-truncated read) If the tail window fits under 1MB (so 3. The old 4. Truncation marker count guarantee The docstring says the result "can be SummaryNo blocking issues. The code is correct, the edge cases are handled, and the tests cover the important regressions. The Verdict: Approve — ready to merge. |
Code Review - PR 907OverviewEnriches StuckKeyRecoveryService incident captures with: a filtered stdout tail (1MB cap, heartbeat spam stripped), a bounded system-load snapshot (loadavg + ps via SubprocessRunner with timeout), and a documented GUI-only limitation. The PR correctly identifies why the 2026-06-10 captures were useless and directly addresses it. Implementation is clean and well-tested. Strengths
IssuesMinor: unusual async expression form In Valid Swift (the compiler propagates Not a bug -- readability nit only. Minor: truncation marker reports "1024KB" for the default 1MB window With Test nit: The fixture is ~20KB (under the 1MB cap), so no marker is prepended. But the test asserts Non-Issues
VerdictApprove with minor nits. Core logic is correct, memory-safety improvement is important, and test coverage is solid. The two cosmetic issues ("1024KB" vs "1MB", async expression style) can be addressed here or in a follow-on. |
What
Enriches the stuck-key incident captures written by
StuckKeyRecoveryServiceto~/Library/Logs/KeyPath/stuck-key-incidents/, closing the gaps that made the 2026-06-10 MAL-57 captures useless or absent:connected,driver connected: false,output backend unavailable during write,dropping KEY_* Release) lives in stdout, not the stderr tail the capture previously snapshotted. Incident files now include the last 200 stdout lines with the high-volumevirtual_hid_keyboard_ready trueheartbeat filtered out (1M+ occurrences in the live 175MB log).... falselines are deliberately kept — they indicate driver disconnects.loadavg(viagetloadavg) and the top 5 CPU consumers (viaSubprocessRunnerrunningpswith a 5s timeout).docs/bugs/MAL-57-duplicate-keypresses.md. Closing this gap (daemon-side detection) is left as future work.Implementation notes
logTailhelper reads only the trailing 1MB of a log file viaFileHandleseek — the previous stderr capture loaded the entire file into memory, which would have been 175MB+ for stdout. Truncated reads decode lossily and drop the partial first line, so a seek landing mid-multibyte-character can't degrade the whole tail to "(file not readable)".psruns throughSubprocessRunner.shared(5s timeout) rather than rawProcess(), bounding how long a slowps— plausible under exactly the CPU starvation being diagnosed — can stall the snapshot that recovery awaits before restarting kanata. Process listing is skipped entirely under tests per the KeyPathTestCase rules.KeyPathConstants.Logsinstead of re-hardcoded strings.Reviewer notes
ps)./code-review(medium) against the branch diff; the confirmed findings (shared path constants,SubprocessRunnerreuse + timeout, UTF-8 boundary bug in the tail read) are addressed in the second commit.Testing
false-variant survival), missing file, truncated-read partial-line/UTF-8 regression, and test-environmentpsskip. FullStuckKeyRecoveryServiceTestssuite passes (10 tests, <0.3s).