[copilot] Add CI postmortem skill and weekly agentic workflow#25303
[copilot] Add CI postmortem skill and weekly agentic workflow#25303rolfbjarne wants to merge 19 commits intomainfrom
Conversation
New Copilot CLI skill that analyzes CI builds across recent PRs to identify failures unrelated to any specific PR: - Flaky tests (pass on rerun with same commit) - Shared regressions (same failure across multiple unrelated PRs) - Infrastructure issues (provisioning, timeouts, etc.) The skill operates in 4 phases: discovery, extraction, classification, and issue filing (with user confirmation before any GitHub issue changes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update SKILL.md to reflect lessons learned from running the skill: - Add steps for downloading and parsing HtmlReport artifacts - Add NUnit XML parsing for individual test failures - Add handling for crashes, build failures, and dotnettests - Fix --query-order flag (not supported by az pipelines build list) - Add HTML entity normalization for test name deduplication - Note performance concerns with large artifact downloads Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Exclude AppSizeTest from filing (expected to fail across PRs) - Add rule: always file one issue per test, never group unrelated tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The HtmlReport download step takes 96% of the total analysis time. Make it explicit that HtmlReports should only be downloaded for jobs where TestSummary confirms test failures, not for all failed jobs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… skill - Extract workerName from timeline to correlate failures with bots - Identify bot-specific failures (disproportionate failure rates) - Detect cross-bot infrastructure patterns (timeouts, REST API, paths) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Runs every Sunday (fuzzy schedule) to analyze the past week's CI failures and file issues for flaky tests, infrastructure problems, and bot-specific issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
- Don't reopen fix-closed issues less than 2 weeks old - Require failing builds from main branch - Allow reopen for lack-of-info or debug-instrumentation closures - Always allow commenting on closed issues with explanation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Include actual compiler/linker/assertion errors from NUnit XML. Flag when different PRs show different errors for the same test (likely different root causes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use AzDO URL format with j= and t= parameters from timeline record IDs to link directly to the failing log, not just the build. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
'Path does not exist' on artifact publish is a downstream symptom of earlier failures (Install dotnet workloads, azdev-secrets, etc). Always trace back to the first failed task in the timeline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Emphasize that only the first failed step (without continueOnError) is the root cause. All subsequent failures are cascading and must not be reported as separate issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ [PR Build #3ae047d] Build passed (Build macOS tests) ✅Pipeline on Agent |
🔥 [CI Build #3ae047d] Test results 🔥Test results❌ Tests failed on VSTS: test results 0 tests crashed, 2 tests failed, 173 tests passed. Failures❌ monotouch tests (iOS)1 tests failed, 15 tests passed.Failed tests
Html Report (VSDrops) Download ❌ monotouch tests (tvOS)1 tests failed, 15 tests passed.Failed tests
Html Report (VSDrops) Download Successes✅ cecil: All 1 tests passed. Html Report (VSDrops) Download macOS tests✅ Tests on macOS Monterey (12): All 5 tests passed. Html Report (VSDrops) Download Linux Build VerificationPipeline on Agent |
For failures in the Windows integration stage, always identify the macOS bot from the 'Reserve macOS bot for tests' job, even when the failure is on a Windows bot (e.g. ssh connection failures). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
✅ [PR Build #0ffa72f] Build passed (Detect API changes) ✅Pipeline on Agent |
This comment has been minimized.
This comment has been minimized.
✅ [PR Build #0ffa72f] Build passed (Build packages) ✅Pipeline on Agent |
✅ API diff for current PR / commitNET (empty diffs)✅ API diff vs stableNET (empty diffs)ℹ️ Generator diffGenerator Diff: vsdrops (html) vsdrops (raw diff) gist (raw diff) - Please review changes) Pipeline on Agent |
Summary
Adds a new Copilot skill and agentic workflow for automated weekly CI post-mortem analysis.
What's included
.agents/skills/macios-ci-postmortem/SKILL.md— Skill definition with a 4-phase workflow:ci-postmortem+copilotlabels.agents/skills/macios-ci-postmortem/references/azure-devops-cli.md— AzDO CLI reference.github/workflows/ci-postmortem.md(+ compiled.lock.yml) — Agentic workflow running weekly on SundayKey design decisions
workerNamefrom timelines to detect bot-concentrated failuresValidation
The skill was run manually twice (Apr 21-27 and Apr 21-28 windows). Results:
🤖 Pull request created by Copilot