diff --git a/plugins/aem/cloud-service/skills/aem-workflow/workflow-triaging/SKILL.md b/plugins/aem/cloud-service/skills/aem-workflow/workflow-triaging/SKILL.md index fcaeaf9f..526ca71d 100644 --- a/plugins/aem/cloud-service/skills/aem-workflow/workflow-triaging/SKILL.md +++ b/plugins/aem/cloud-service/skills/aem-workflow/workflow-triaging/SKILL.md @@ -6,21 +6,32 @@ license: Apache-2.0 # AEM Workflow Triaging — Cloud Service -Classify workflow issues, determine what logs and data to gather, and map to the correct runbook or log search. Optimized for **production support** on **AEM as a Cloud Service**. +Classify workflow issues, determine what logs and data to gather, and map to the correct runbook or log search. Optimized for **production support** on **AEM as a Cloud Service (AEMaaCS)**. + +## Audience + +AEMaaCS support and operations engineers (and the IDE LLM acting on their behalf) classifying workflow incidents across environments — environment ID + time-range + Cloud Manager Logs / Splunk context, before drilling into one instance. Use this skill for cross-environment log mining and symptom classification; switch to `workflow-debugging` once the instance and root cause are identified. ## Variant Scope -- This skill is **cloud-service-only**. -- Log access via Cloud Manager download or log streaming. -- No JMX — workflow counts and queue metrics come from logs, APIs, or Developer Console. +- AEM as a Cloud Service only. +- **Not for AEM 6.5 LTS / AMS.** If the target is 6.5 LTS, stop and use the 6.5-lts variant of this skill — Splunk index/sourcetype paths, the JMX surface, and several log signatures there do not apply as written on AEMaaCS. +- Log access via **Cloud Manager** → Environments → Logs (download or streaming), or Splunk if logs are indexed there. +- **No JMX on AEMaaCS production.** Workflow counts and queue metrics come from logs, Developer Console status producers, and the Sling Job Console. JMX MBeans exist on the local AEMaaCS SDK (`localhost:4502`) but must not be assumed available on cloud environments. +- **All remediation lands via Git + Cloud Manager pipeline.** There is no Felix Console write access or Package Manager on production AEMaaCS environments. + +## Dependencies + +- `workflow-debugging` — once a symptom is classified and an environment/instance is identified, route here for the step-by-step runbook and remediation. +- `workflow-debugging/reference.md` — canonical diagnostic tool pointers, log patterns, and external doc links for AEMaaCS. --- ## When to use this skill -- User asks: "Workflow errors on <host> for the past X hours", "Workflow activity on <host>", "Why did workflow X fail?", "What should I collect to debug this workflow ticket?" +- User asks: "Workflow errors on `` for the past X hours", "Workflow activity on ``", "Why did workflow X fail?", "What should I collect to debug this workflow ticket?" - User needs: Symptom classification, log patterns to search, Splunk queries, or required inputs for a runbook. -- Context: AEM Cloud Service (e.g. cm-p12345-e67890). +- Context: AEM Cloud Service (environment ID format: `cm-p-e`). --- @@ -40,8 +51,11 @@ Map the user's description to a **symptom_id** and runbook. | User cannot see work item or complete/delegate/return | user_cannot_see_or_complete_item | runbook-inbox-and-permissions.md | | Cannot delete workflow model (running instances) | cannot_delete_model | runbook-model-delete-and-update.md | | Jobs queued a long time; slow completion; queue depth high | slow_throughput_queue_backlog | runbook-job-throughput-and-concurrency.md | +| Auto-advance / timeout jobs not firing; participant step stuck past its configured timeout | workflow_auto_advance_failure | runbook-job-throughput-and-concurrency.md | | New or changed workflow not starting or step not executing | workflow_setup_validation | runbook-validate-workflow-setup.md | +> **WorkItem vs. TaskManager task — do not confuse these.** Most workflow Inbox items are workflow work items (`WorkItem`), created by Participant steps and managed by the workflow engine; they are stored under `/var/workflow/instances`, not in TaskManager. TaskManager (`/var/taskmanagement/tasks`) only holds tasks created explicitly via the Task API — used by Projects, Assets tasks, and custom integrations. For `task_not_in_inbox` and `user_cannot_see_or_complete_item` symptoms on a workflow: investigate the Participant step assignee configuration, Inbox filters, and workflow permissions — not TaskManager storage. Diagnosing the wrong backend wastes significant time. + --- ## Step 2: Required inputs for triage @@ -50,34 +64,37 @@ Before suggesting a runbook or Splunk search, try to obtain: | Input | Purpose | |-------|---------| -| **Host / instance** | e.g. cm-p163724-e1759416 (Cloud Service program-environment format). | +| **Environment ID** | AEMaaCS format: `cm-p-e` (e.g. `cm-p163724-e1759416`). | | **Time range** | e.g. "past 4 hours", "past 10 hours" – for log/Splunk scope. | | **Workflow model or step name** | e.g. "Dynamic Media Reupload", "DAM Update Asset", "testmodel". | -| **Instance ID** (if known) | From Workflow console URL or payload; ties logs to one instance. | -| **Payload path** (if known) | e.g. /content/dam/...; for path-related errors. | +| **Instance ID** (if known) | From Workflow Console URL or payload; ties logs to one instance. | +| **Payload path** (if known) | e.g. `/content/dam/...`; for path-related errors. | | **Log source** | Cloud Manager log download, log streaming, or Splunk index/sourcetype. | -If the user only provides host + time, respond with the **generic** workflow error searches and note that narrowing by model/instance ID will improve accuracy. +If the user only provides environment ID + time, respond with the **generic** workflow error searches and note that narrowing by model or instance ID will improve accuracy. --- ## Step 3: Log patterns and Splunk (what to search) -Logs on Cloud Service are accessed via **Cloud Manager** → Environments → Logs (download or streaming). When logs are in **Splunk** (or any log aggregator), use these patterns. +Logs on AEMaaCS are accessed via **Cloud Manager** → Environments → Logs (download or streaming). The primary file is `error.log`. When logs are indexed in **Splunk** (or any log aggregator), use these patterns. -| Scenario | Primary log pattern(s) | Splunk hint | -|----------|------------------------|-------------| +| Scenario | Primary log pattern(s) | Note | +|----------|------------------------|------| | Step failed | `Error executing workflow step` | Add instance ID or model name to narrow. | -| Process not found | `getProcess for '*' failed` | Extract process name for OSGi check. | +| Process not found | `getProcess for '*' failed` | Extract process name; check OSGi Components for `process.label` mismatch. | | Stuck at Process step | Same as step failed + `getProcess` | Combine with payload path. | -| Stale workflow | `Cannot archive workitem` | Correlate time with instance. | -| Lock / throughput | `wait for a lock` or `refreshing the session since we had to wait` | Timechart by host. | -| Permission | `Terminate failed` / `Resume failed` / `Suspend failed` + verifyAccess | Or `AccessControlException`. | -| Payload path | `PathNotFoundException` + workflow/payload | Launcher: "launcher config". | +| Stale workflow | `Cannot archive workitem` | Correlate time with instance ID. | +| Lock / throughput | `refreshing the session since we had to wait for a lock` | Reduce effective concurrency — on AEMaaCS, job queue settings are not directly tunable at runtime; address via code changes: split workflows, offload heavy steps asynchronously, or externalize processing. Raising concurrency makes lock contention worse. | +| Permission | `Terminate failed` / `Resume failed` / `Suspend failed` + verifyAccess | Or `AccessControlException`. Check `enforceWorkflowInitiatorPermissions`. | +| Payload path | `PathNotFoundException` (workflow/payload) | Payload deleted, or launcher config path missing. | | Launcher not starting | `Error adding launcher config` / `Error retrieving launcher config entries` | Path: `/conf/global/settings/workflow/launcher/config`. | | Purge failure | `Workflow purge '*' :` | Filter by repository exception / invalid state. | +| Transient workflow retries exhausted | `retrys exceeded - remove isTransient` | Process step kept throwing after `cq.workflow.job.retry` retries. Fix step code; instance persisted for admin handling. | +| Thread pool full | `RejectedExecutionException` | `default` pool saturated with `blockPolicy=ABORT`; timeout/auto-advance jobs dropped. | +| Operation on finished instance | `Workflow is already finished` | Check logic that calls terminate/resume on a completed or aborted instance. | -**Example Splunk searches (replace index/sourcetype/field names as needed):** +**Example Splunk searches (replace index/sourcetype/field names for your environment):** - All workflow step errors (last 24h): `index=aem sourcetype=aem:error "Error executing workflow step" | table _time host message | sort - _time` @@ -86,50 +103,83 @@ Logs on Cloud Service are accessed via **Cloud Manager** → Environments → Lo - By workflow model or instance: `index=aem ("Error executing workflow step" OR WorkflowException) (message=** OR message=**) | sort - _time` - Lock contention: - `index=aem "wait for a lock" OR "refreshing the session since we had to wait" | table _time host message` + `index=aem "refreshing the session since we had to wait for a lock" | table _time host message` +- Thread pool exhaustion (auto-advance impact): + `index=aem "RejectedExecutionException" | table _time host message` + +> **Note:** Indexes and sourcetypes vary by organization; adapt queries accordingly. --- -## Step 4: Example triage prompts and responses +## Step 4: Developer Console and Sling Job diagnostics -| User prompt | Triage response | -|-------------|------------------| -| "Workflow errors on <host> for the past X hours" | Classify as workflow_fails_or_shows_error / step_failed_retries_exhausted. Search Cloud Manager logs or Splunk for "Error executing workflow step", "Error processing workflow job", "getProcess for … failed" on that host. Route to runbook-workflow-fails-or-shows-error. | -| "Workflow activity on <host> for the past X hours" | Clarify: "activity" = counts (started/completed/failed) or list of errors? For errors, use same searches. For counts on Cloud Service, use log aggregation or custom reporting API — no JMX. | -| "Why did <workflow-or-step> fail? Show failure details." | Need: host, time range, and if possible instance ID. Search Cloud Manager logs for "Error executing workflow step" + model/step name or instance ID; return exception type, message, and stack. Route to runbook-workflow-fails-or-shows-error. | -| "Task not in Inbox" | symptom_id: task_not_in_inbox. Route to runbook-task-not-in-inbox. Gather: instance ID, assignee, whether user is initiator/assignee; check Inbox filters and enforceWorkitemAssigneePermissions. | -| "Workflow not starting" | symptom_id: workflow_not_starting_launcher. Route to runbook-launcher-not-starting. Gather: model name, payload path, launcher config path; search logs for launcher errors. | -| "Workflow stuck / not progressing" | symptom_id: workflow_stuck_not_progressing. Route to runbook-workflow-stuck. First: Does instance have a current work item? If no → stale. If yes, follow decision tree by step type. | +On AEMaaCS production, use the **Developer Console** status producers and the **Sling Jobs page** for metrics not available from logs alone. JMX is not available on production AEMaaCS; these are the equivalents. ---- +| What to check | Tool / URL | Purpose | +|---------------|-----------|---------| +| Workflow queue depth and failed jobs | Sling Jobs page: `/system/console/slingevent` | `Queued Jobs > 0` with `Active Jobs = 0` → jobs not being picked up. `Failed Jobs` count per topic. | +| Workflow job topic statistics | Sling Jobs page: topic `com/adobe/granite/workflow/job/var/workflow/models/` | High `Failed Jobs` / low `Finished Jobs` → process step throwing exceptions. | +| Sling `default` thread pool saturation | Thread Pools page: `/system/console/status-slingthreadpools` | `active count = max pool size` AND `blockPolicy = ABORT` → new scheduled tasks (including workflow timeout detection) are silently rejected. | +| Thread stack trace | Thread Dump: `/system/console/status-jstack-threaddump` | All `sling-default-*` threads stuck on same stack → blocking culprit for auto-advance failure. | +| Sling Scheduler status | Scheduler page: `/system/console/status-slingscheduler` | Confirm `ApacheSlingdefault` uses `ThreadPool: default`. Note: `com/adobe/granite/workflow/timeout/job` is a Sling Job topic, not visible here — check the Sling Jobs page instead. | +| OSGi bundle / process registration | OSGi Components: `/system/console/components` | Confirm WorkflowProcess component with matching `process.label` is Active. | +| Instance state | Workflow Console: `/libs/cq/workflow/admin/console/content/instances.html` | Instance status, current work item, history. | + +**Developer Console access:** AEM Cloud Service → Developer Console. Status producers (thread dumps, Sling Jobs, thread pools) are read-only on all tiers. On the local AEMaaCS SDK (`localhost:4502/system/console/jmx`) JMX MBeans are also available — use them for local development only; do not document JMX steps for production. -## Step 5: What logs can and cannot answer +**Safety:** Never recommend remediation operations that bypass Git + Cloud Manager pipeline (e.g. Felix Console config changes) on cloud environments. All config changes go in `ui.config` and deploy via pipeline. -**Can answer (with AEM workflow logs in Cloud Manager / Splunk):** +--- -- Step failures: exception type, message, stack (by host, time, model, step). -- Process not registered: which `process.label` is missing. -- Stuck: step errors, getProcess failures, lock wait, payload/path errors. -- Stale: "Cannot archive workitem" and transition errors. -- Throughput: lock wait, session refresh, JobHandler volume. -- Permission: Terminate/Resume/Suspend failed (verifyAccess), AccessControlException. -- Payload/launcher: PathNotFoundException, launcher config errors. -- Purge: "Workflow purge …" repository exception or invalid state. +## Step 5: Example triage prompts and responses -**Cannot answer directly (Cloud Service limitations):** +| User prompt | Triage response | +|-------------|-----------------| +| "Workflow errors on `` for the past X hours" | Classify as `workflow_fails_or_shows_error` / `step_failed_retries_exhausted`. Download or stream `error.log` from Cloud Manager; search for `Error executing workflow step`, `Error processing workflow job`, `getProcess for … failed`. Check Sling Jobs page for failed job count per topic. Route to `runbook-workflow-fails-or-shows-error`. | +| "Workflow activity on `` for the past X hours" | Clarify: counts (started/completed/failed) or list of errors? For errors, use log searches above. For counts on AEMaaCS, use Cloud Manager log aggregation or the Sling Jobs page — no JMX. | +| "Why did `` fail? Show failure details." | Need: environment ID, time range, instance ID if known. Search `error.log` for `Error executing workflow step` + model/step name or instance ID; return exception type, message, and stack. Route to `runbook-workflow-fails-or-shows-error`. | +| "Task not in Inbox" | `symptom_id: task_not_in_inbox`. Route to `runbook-task-not-in-inbox`. Gather: instance ID, assignee, whether user is initiator/assignee. Check Inbox filters and `enforceWorkitemAssigneePermissions` via Developer Console OSGi config view. | +| "Workflow not starting" | `symptom_id: workflow_not_starting_launcher`. Route to `runbook-launcher-not-starting`. Gather: model name, payload path, launcher config path; search logs for launcher errors. | +| "Workflow stuck / not progressing" | `symptom_id: workflow_stuck_not_progressing`. Route to `runbook-workflow-stuck`. First: does the instance have a current work item? If no → stale. If yes, follow decision tree by step type. | +| "Auto-advance / timeout jobs not firing" | `symptom_id: workflow_auto_advance_failure`. Route to `runbook-job-throughput-and-concurrency`. Check Developer Console thread dump for `sling-default-*` thread saturation; check Sling Jobs page for `com/adobe/granite/workflow/timeout/job` topic; search `error.log` for `RejectedExecutionException`. | -- Console state (e.g. "is there a current work item?"). Use Workflow Console UI or custom API. -- JMX counts (e.g. countStaleWorkflows, queue depth). No JMX on Cloud Service — use log aggregation, custom HTTP APIs, or Developer Console. -- Thread pool metrics. Request thread dump via Developer Console or support. -- Configuration status ZIP. Request from support. +--- -Always pair log-based triage with the appropriate runbook for actions (retry via Inbox, Purge Scheduler config, pipeline deploy). +## Step 6: What logs and Developer Console can and cannot answer + +**Can answer (with AEM workflow logs from Cloud Manager + Developer Console on AEMaaCS):** + +- Step failures: exception type, message, stack (by environment, time, model, step). +- Process not registered: which `process.label` is missing (logs + Developer Console OSGi Components). +- Stuck: step errors, `getProcess` failures, lock wait, payload/path errors. +- Stale: `Cannot archive workitem` and transition errors in logs. +- Queue metrics: Sling Jobs page (`/system/console/slingevent`) → queued, active, failed per topic. +- Thread pool saturation: Thread Pools page (`/system/console/status-slingthreadpools`). +- Throughput: lock wait, session refresh, JobHandler volume in logs. +- Permission: Terminate/Resume/Suspend failed (`verifyAccess`), `AccessControlException` in logs. +- Payload/launcher: `PathNotFoundException`, launcher config errors in logs. +- Purge: `Workflow purge …` repository exception or invalid state in logs. + +**Cannot answer directly (AEMaaCS limitations vs 6.5 LTS):** + +| What is needed | AEMaaCS alternative | +|----------------|---------------------| +| JMX `countStaleWorkflows` | Deploy a custom `StaleWorkflowServlet` (see `workflow-debugging` Step 6); call with `?dryRun=true`. | +| JMX `countRunningWorkflows` | Workflow Console UI, or a custom count servlet. | +| JMX `retryFailedWorkItems` | Inbox UI → Retry (single); or a custom bulk-retry servlet (see `workflow-debugging` Step 6). | +| JMX `purgeCompleted` | `com.adobe.granite.workflow.purge.Scheduler-.cfg.json` deployed via pipeline. | +| JMX `restartStaleWorkflows` | Custom `StaleWorkflowServlet` with `POST ...?dryRun=false`. | +| Config status ZIP | Developer Console status producers; or request from Adobe Support. | +| Console state (current work item) | Workflow Console UI (`/libs/cq/workflow/admin/...`) or custom API. | +| Runtime process step code behavior | Requires code review + log correlation. | +| Pod restart | Adobe Support ticket — Cloud Manager does not expose a customer-facing restart action. | + +Always pair log-based triage with Developer Console diagnostics and the appropriate runbook for actions (Inbox Retry, Purge Scheduler config, Cloud Manager pipeline deploy). --- ## References (in repo) -- **Machine-readable index:** `aem-agent-marketplace-workflow-knowledge-base/docs/debugging-index.md` -- **Decision guide:** `runbooks/runbook-decision-guide.md` -- **Splunk scenarios and queries:** `Workflow-docs/splunk-workflow-triaging.md` -- **Error patterns:** `docs/error-patterns.md` +- **Diagnostic tool pointers and log patterns:** [`../workflow-debugging/reference.md`](../workflow-debugging/reference.md) +- **Step-by-step runbook (per symptom):** [`../workflow-debugging/SKILL.md`](../workflow-debugging/SKILL.md) +- **Cloud Service guardrails (paths, service users, OSGi annotations):** [`../workflow-development/references/workflow-foundation/cloud-service-guardrails.md`](../workflow-development/references/workflow-foundation/cloud-service-guardrails.md)