Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,32 @@ license: Apache-2.0

# AEM Workflow Triaging — Cloud Service

Classify workflow issues, determine what logs and data to gather, and map to the correct runbook or log search. Optimized for **production support** on **AEM as a Cloud Service**.
Classify workflow issues, determine what logs and data to gather, and map to the correct runbook or log search. Optimized for **production support** on **AEM as a Cloud Service (AEMaaCS)**.

## Audience

AEMaaCS support and operations engineers (and the IDE LLM acting on their behalf) classifying workflow incidents across environments — environment ID + time-range + Cloud Manager Logs / Splunk context, before drilling into one instance. Use this skill for cross-environment log mining and symptom classification; switch to `workflow-debugging` once the instance and root cause are identified.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this skill supposed to be Adobe-internal? This audience statement seems to indicate that.

For this repository we aim to have customer-facing skills of general applicability. This skill should be usable to (developer) users of AEM as a Cloud Service as long as it's public.

Additionally, I find the instructions to deploy a custom StaleWorkflowServlet problematic as there is absolutely no guidance on how to securely write and deploy it.


## Variant Scope

- This skill is **cloud-service-only**.
- Log access via Cloud Manager download or log streaming.
- No JMX — workflow counts and queue metrics come from logs, APIs, or Developer Console.
- AEM as a Cloud Service only.
- **Not for AEM 6.5 LTS / AMS.** If the target is 6.5 LTS, stop and use the 6.5-lts variant of this skill — Splunk index/sourcetype paths, the JMX surface, and several log signatures there do not apply as written on AEMaaCS.
- Log access via **Cloud Manager** → Environments → Logs (download or streaming), or Splunk if logs are indexed there.
- **No JMX on AEMaaCS production.** Workflow counts and queue metrics come from logs, Developer Console status producers, and the Sling Job Console. JMX MBeans exist on the local AEMaaCS SDK (`localhost:4502`) but must not be assumed available on cloud environments.
- **All remediation lands via Git + Cloud Manager pipeline.** There is no Felix Console write access or Package Manager on production AEMaaCS environments.

## Dependencies

- `workflow-debugging` — once a symptom is classified and an environment/instance is identified, route here for the step-by-step runbook and remediation.
- `workflow-debugging/reference.md` — canonical diagnostic tool pointers, log patterns, and external doc links for AEMaaCS.

---

## When to use this skill

- User asks: "Workflow errors on <host> for the past X hours", "Workflow activity on <host>", "Why did workflow X fail?", "What should I collect to debug this workflow ticket?"
- User asks: "Workflow errors on `<env-id>` for the past X hours", "Workflow activity on `<env-id>`", "Why did workflow X fail?", "What should I collect to debug this workflow ticket?"
- User needs: Symptom classification, log patterns to search, Splunk queries, or required inputs for a runbook.
- Context: AEM Cloud Service (e.g. cm-p12345-e67890).
- Context: AEM Cloud Service (environment ID format: `cm-p<programId>-e<environmentId>`).

---

Expand All @@ -40,8 +51,11 @@ Map the user's description to a **symptom_id** and runbook.
| User cannot see work item or complete/delegate/return | user_cannot_see_or_complete_item | runbook-inbox-and-permissions.md |
| Cannot delete workflow model (running instances) | cannot_delete_model | runbook-model-delete-and-update.md |
| Jobs queued a long time; slow completion; queue depth high | slow_throughput_queue_backlog | runbook-job-throughput-and-concurrency.md |
| Auto-advance / timeout jobs not firing; participant step stuck past its configured timeout | workflow_auto_advance_failure | runbook-job-throughput-and-concurrency.md |
| New or changed workflow not starting or step not executing | workflow_setup_validation | runbook-validate-workflow-setup.md |

> **WorkItem vs. TaskManager task — do not confuse these.** Most workflow Inbox items are workflow work items (`WorkItem`), created by Participant steps and managed by the workflow engine; they are stored under `/var/workflow/instances`, not in TaskManager. TaskManager (`/var/taskmanagement/tasks`) only holds tasks created explicitly via the Task API — used by Projects, Assets tasks, and custom integrations. For `task_not_in_inbox` and `user_cannot_see_or_complete_item` symptoms on a workflow: investigate the Participant step assignee configuration, Inbox filters, and workflow permissions — not TaskManager storage. Diagnosing the wrong backend wastes significant time.

---

## Step 2: Required inputs for triage
Expand All @@ -50,34 +64,37 @@ Before suggesting a runbook or Splunk search, try to obtain:

| Input | Purpose |
|-------|---------|
| **Host / instance** | e.g. cm-p163724-e1759416 (Cloud Service program-environment format). |
| **Environment ID** | AEMaaCS format: `cm-p<programId>-e<environmentId>` (e.g. `cm-p163724-e1759416`). |
| **Time range** | e.g. "past 4 hours", "past 10 hours" – for log/Splunk scope. |
| **Workflow model or step name** | e.g. "Dynamic Media Reupload", "DAM Update Asset", "testmodel". |
| **Instance ID** (if known) | From Workflow console URL or payload; ties logs to one instance. |
| **Payload path** (if known) | e.g. /content/dam/...; for path-related errors. |
| **Instance ID** (if known) | From Workflow Console URL or payload; ties logs to one instance. |
| **Payload path** (if known) | e.g. `/content/dam/...`; for path-related errors. |
| **Log source** | Cloud Manager log download, log streaming, or Splunk index/sourcetype. |

If the user only provides host + time, respond with the **generic** workflow error searches and note that narrowing by model/instance ID will improve accuracy.
If the user only provides environment ID + time, respond with the **generic** workflow error searches and note that narrowing by model or instance ID will improve accuracy.

---

## Step 3: Log patterns and Splunk (what to search)

Logs on Cloud Service are accessed via **Cloud Manager** → Environments → Logs (download or streaming). When logs are in **Splunk** (or any log aggregator), use these patterns.
Logs on AEMaaCS are accessed via **Cloud Manager** → Environments → Logs (download or streaming). The primary file is `error.log`. When logs are indexed in **Splunk** (or any log aggregator), use these patterns.

| Scenario | Primary log pattern(s) | Splunk hint |
|----------|------------------------|-------------|
| Scenario | Primary log pattern(s) | Note |
|----------|------------------------|------|
| Step failed | `Error executing workflow step` | Add instance ID or model name to narrow. |
| Process not found | `getProcess for '*' failed` | Extract process name for OSGi check. |
| Process not found | `getProcess for '*' failed` | Extract process name; check OSGi Components for `process.label` mismatch. |
| Stuck at Process step | Same as step failed + `getProcess` | Combine with payload path. |
| Stale workflow | `Cannot archive workitem` | Correlate time with instance. |
| Lock / throughput | `wait for a lock` or `refreshing the session since we had to wait` | Timechart by host. |
| Permission | `Terminate failed` / `Resume failed` / `Suspend failed` + verifyAccess | Or `AccessControlException`. |
| Payload path | `PathNotFoundException` + workflow/payload | Launcher: "launcher config". |
| Stale workflow | `Cannot archive workitem` | Correlate time with instance ID. |
| Lock / throughput | `refreshing the session since we had to wait for a lock` | Reduce effective concurrency — on AEMaaCS, job queue settings are not directly tunable at runtime; address via code changes: split workflows, offload heavy steps asynchronously, or externalize processing. Raising concurrency makes lock contention worse. |
| Permission | `Terminate failed` / `Resume failed` / `Suspend failed` + verifyAccess | Or `AccessControlException`. Check `enforceWorkflowInitiatorPermissions`. |
| Payload path | `PathNotFoundException` (workflow/payload) | Payload deleted, or launcher config path missing. |
| Launcher not starting | `Error adding launcher config` / `Error retrieving launcher config entries` | Path: `/conf/global/settings/workflow/launcher/config`. |
| Purge failure | `Workflow purge '*' :` | Filter by repository exception / invalid state. |
| Transient workflow retries exhausted | `retrys exceeded - remove isTransient` | Process step kept throwing after `cq.workflow.job.retry` retries. Fix step code; instance persisted for admin handling. |
| Thread pool full | `RejectedExecutionException` | `default` pool saturated with `blockPolicy=ABORT`; timeout/auto-advance jobs dropped. |
| Operation on finished instance | `Workflow is already finished` | Check logic that calls terminate/resume on a completed or aborted instance. |

**Example Splunk searches (replace index/sourcetype/field names as needed):**
**Example Splunk searches (replace index/sourcetype/field names for your environment):**

- All workflow step errors (last 24h):
`index=aem sourcetype=aem:error "Error executing workflow step" | table _time host message | sort - _time`
Expand All @@ -86,50 +103,83 @@ Logs on Cloud Service are accessed via **Cloud Manager** → Environments → Lo
- By workflow model or instance:
`index=aem ("Error executing workflow step" OR WorkflowException) (message=*<modelName>* OR message=*<instanceId>*) | sort - _time`
- Lock contention:
`index=aem "wait for a lock" OR "refreshing the session since we had to wait" | table _time host message`
`index=aem "refreshing the session since we had to wait for a lock" | table _time host message`
- Thread pool exhaustion (auto-advance impact):
`index=aem "RejectedExecutionException" | table _time host message`

> **Note:** Indexes and sourcetypes vary by organization; adapt queries accordingly.

---

## Step 4: Example triage prompts and responses
## Step 4: Developer Console and Sling Job diagnostics

| User prompt | Triage response |
|-------------|------------------|
| "Workflow errors on &lt;host&gt; for the past X hours" | Classify as workflow_fails_or_shows_error / step_failed_retries_exhausted. Search Cloud Manager logs or Splunk for "Error executing workflow step", "Error processing workflow job", "getProcess for … failed" on that host. Route to runbook-workflow-fails-or-shows-error. |
| "Workflow activity on &lt;host&gt; for the past X hours" | Clarify: "activity" = counts (started/completed/failed) or list of errors? For errors, use same searches. For counts on Cloud Service, use log aggregation or custom reporting API — no JMX. |
| "Why did &lt;workflow-or-step&gt; fail? Show failure details." | Need: host, time range, and if possible instance ID. Search Cloud Manager logs for "Error executing workflow step" + model/step name or instance ID; return exception type, message, and stack. Route to runbook-workflow-fails-or-shows-error. |
| "Task not in Inbox" | symptom_id: task_not_in_inbox. Route to runbook-task-not-in-inbox. Gather: instance ID, assignee, whether user is initiator/assignee; check Inbox filters and enforceWorkitemAssigneePermissions. |
| "Workflow not starting" | symptom_id: workflow_not_starting_launcher. Route to runbook-launcher-not-starting. Gather: model name, payload path, launcher config path; search logs for launcher errors. |
| "Workflow stuck / not progressing" | symptom_id: workflow_stuck_not_progressing. Route to runbook-workflow-stuck. First: Does instance have a current work item? If no → stale. If yes, follow decision tree by step type. |
On AEMaaCS production, use the **Developer Console** status producers and the **Sling Jobs page** for metrics not available from logs alone. JMX is not available on production AEMaaCS; these are the equivalents.

---
| What to check | Tool / URL | Purpose |
|---------------|-----------|---------|
| Workflow queue depth and failed jobs | Sling Jobs page: `/system/console/slingevent` | `Queued Jobs > 0` with `Active Jobs = 0` → jobs not being picked up. `Failed Jobs` count per topic. |
| Workflow job topic statistics | Sling Jobs page: topic `com/adobe/granite/workflow/job/var/workflow/models/<modelName>` | High `Failed Jobs` / low `Finished Jobs` → process step throwing exceptions. |
| Sling `default` thread pool saturation | Thread Pools page: `/system/console/status-slingthreadpools` | `active count = max pool size` AND `blockPolicy = ABORT` → new scheduled tasks (including workflow timeout detection) are silently rejected. |
| Thread stack trace | Thread Dump: `/system/console/status-jstack-threaddump` | All `sling-default-*` threads stuck on same stack → blocking culprit for auto-advance failure. |
| Sling Scheduler status | Scheduler page: `/system/console/status-slingscheduler` | Confirm `ApacheSlingdefault` uses `ThreadPool: default`. Note: `com/adobe/granite/workflow/timeout/job` is a Sling Job topic, not visible here — check the Sling Jobs page instead. |
| OSGi bundle / process registration | OSGi Components: `/system/console/components` | Confirm WorkflowProcess component with matching `process.label` is Active. |
| Instance state | Workflow Console: `/libs/cq/workflow/admin/console/content/instances.html` | Instance status, current work item, history. |

**Developer Console access:** AEM Cloud Service → Developer Console. Status producers (thread dumps, Sling Jobs, thread pools) are read-only on all tiers. On the local AEMaaCS SDK (`localhost:4502/system/console/jmx`) JMX MBeans are also available — use them for local development only; do not document JMX steps for production.

## Step 5: What logs can and cannot answer
**Safety:** Never recommend remediation operations that bypass Git + Cloud Manager pipeline (e.g. Felix Console config changes) on cloud environments. All config changes go in `ui.config` and deploy via pipeline.

**Can answer (with AEM workflow logs in Cloud Manager / Splunk):**
---

- Step failures: exception type, message, stack (by host, time, model, step).
- Process not registered: which `process.label` is missing.
- Stuck: step errors, getProcess failures, lock wait, payload/path errors.
- Stale: "Cannot archive workitem" and transition errors.
- Throughput: lock wait, session refresh, JobHandler volume.
- Permission: Terminate/Resume/Suspend failed (verifyAccess), AccessControlException.
- Payload/launcher: PathNotFoundException, launcher config errors.
- Purge: "Workflow purge …" repository exception or invalid state.
## Step 5: Example triage prompts and responses

**Cannot answer directly (Cloud Service limitations):**
| User prompt | Triage response |
|-------------|-----------------|
| "Workflow errors on `<env-id>` for the past X hours" | Classify as `workflow_fails_or_shows_error` / `step_failed_retries_exhausted`. Download or stream `error.log` from Cloud Manager; search for `Error executing workflow step`, `Error processing workflow job`, `getProcess for … failed`. Check Sling Jobs page for failed job count per topic. Route to `runbook-workflow-fails-or-shows-error`. |
| "Workflow activity on `<env-id>` for the past X hours" | Clarify: counts (started/completed/failed) or list of errors? For errors, use log searches above. For counts on AEMaaCS, use Cloud Manager log aggregation or the Sling Jobs page — no JMX. |
| "Why did `<workflow-or-step>` fail? Show failure details." | Need: environment ID, time range, instance ID if known. Search `error.log` for `Error executing workflow step` + model/step name or instance ID; return exception type, message, and stack. Route to `runbook-workflow-fails-or-shows-error`. |
| "Task not in Inbox" | `symptom_id: task_not_in_inbox`. Route to `runbook-task-not-in-inbox`. Gather: instance ID, assignee, whether user is initiator/assignee. Check Inbox filters and `enforceWorkitemAssigneePermissions` via Developer Console OSGi config view. |
| "Workflow not starting" | `symptom_id: workflow_not_starting_launcher`. Route to `runbook-launcher-not-starting`. Gather: model name, payload path, launcher config path; search logs for launcher errors. |
| "Workflow stuck / not progressing" | `symptom_id: workflow_stuck_not_progressing`. Route to `runbook-workflow-stuck`. First: does the instance have a current work item? If no → stale. If yes, follow decision tree by step type. |
| "Auto-advance / timeout jobs not firing" | `symptom_id: workflow_auto_advance_failure`. Route to `runbook-job-throughput-and-concurrency`. Check Developer Console thread dump for `sling-default-*` thread saturation; check Sling Jobs page for `com/adobe/granite/workflow/timeout/job` topic; search `error.log` for `RejectedExecutionException`. |

- Console state (e.g. "is there a current work item?"). Use Workflow Console UI or custom API.
- JMX counts (e.g. countStaleWorkflows, queue depth). No JMX on Cloud Service — use log aggregation, custom HTTP APIs, or Developer Console.
- Thread pool metrics. Request thread dump via Developer Console or support.
- Configuration status ZIP. Request from support.
---

Always pair log-based triage with the appropriate runbook for actions (retry via Inbox, Purge Scheduler config, pipeline deploy).
## Step 6: What logs and Developer Console can and cannot answer

**Can answer (with AEM workflow logs from Cloud Manager + Developer Console on AEMaaCS):**

- Step failures: exception type, message, stack (by environment, time, model, step).
- Process not registered: which `process.label` is missing (logs + Developer Console OSGi Components).
- Stuck: step errors, `getProcess` failures, lock wait, payload/path errors.
- Stale: `Cannot archive workitem` and transition errors in logs.
- Queue metrics: Sling Jobs page (`/system/console/slingevent`) → queued, active, failed per topic.
- Thread pool saturation: Thread Pools page (`/system/console/status-slingthreadpools`).
- Throughput: lock wait, session refresh, JobHandler volume in logs.
- Permission: Terminate/Resume/Suspend failed (`verifyAccess`), `AccessControlException` in logs.
- Payload/launcher: `PathNotFoundException`, launcher config errors in logs.
- Purge: `Workflow purge …` repository exception or invalid state in logs.

**Cannot answer directly (AEMaaCS limitations vs 6.5 LTS):**

| What is needed | AEMaaCS alternative |
|----------------|---------------------|
| JMX `countStaleWorkflows` | Deploy a custom `StaleWorkflowServlet` (see `workflow-debugging` Step 6); call with `?dryRun=true`. |
| JMX `countRunningWorkflows` | Workflow Console UI, or a custom count servlet. |
| JMX `retryFailedWorkItems` | Inbox UI → Retry (single); or a custom bulk-retry servlet (see `workflow-debugging` Step 6). |
| JMX `purgeCompleted` | `com.adobe.granite.workflow.purge.Scheduler-<alias>.cfg.json` deployed via pipeline. |
| JMX `restartStaleWorkflows` | Custom `StaleWorkflowServlet` with `POST ...?dryRun=false`. |
| Config status ZIP | Developer Console status producers; or request from Adobe Support. |
| Console state (current work item) | Workflow Console UI (`/libs/cq/workflow/admin/...`) or custom API. |
| Runtime process step code behavior | Requires code review + log correlation. |
| Pod restart | Adobe Support ticket — Cloud Manager does not expose a customer-facing restart action. |

Always pair log-based triage with Developer Console diagnostics and the appropriate runbook for actions (Inbox Retry, Purge Scheduler config, Cloud Manager pipeline deploy).

---

## References (in repo)

- **Machine-readable index:** `aem-agent-marketplace-workflow-knowledge-base/docs/debugging-index.md`
- **Decision guide:** `runbooks/runbook-decision-guide.md`
- **Splunk scenarios and queries:** `Workflow-docs/splunk-workflow-triaging.md`
- **Error patterns:** `docs/error-patterns.md`
- **Diagnostic tool pointers and log patterns:** [`../workflow-debugging/reference.md`](../workflow-debugging/reference.md)
- **Step-by-step runbook (per symptom):** [`../workflow-debugging/SKILL.md`](../workflow-debugging/SKILL.md)
- **Cloud Service guardrails (paths, service users, OSGi annotations):** [`../workflow-development/references/workflow-foundation/cloud-service-guardrails.md`](../workflow-development/references/workflow-foundation/cloud-service-guardrails.md)
Loading