feat(s4): UC1 smoke test PASS — close #65 by bjridicodes · Pull Request #100 · aria-aiops/aria

bjridicodes · 2026-06-18T07:54:26Z

Summary

UC1 smoke test ran live against Azure CDP cluster (cdp-master-01)
Agent 1 stubbed; Agents 2 → 3 → 4 executed against real infrastructure
Result: PASS — 1 log line retrieved, classified disk/HIGH (0.93), Slack notified
UC2 and UC3 deferred (GCP billing blocked; Azure HDInsight not deployed yet)

Changes

scripts/smoke_uc1.py — smoke test script (EnvVarVault, Agent 1 stub, live SSH)
infra/terraform/uc_testing/azure/uc1-hadoop-onprem/main.tf — VM type Standard_D2s_v3 (B2ms unavailable in West Europe), skip_provider_registration = true for azurerm 3.x compat
documentation/reports/s4_uc1_smoke_test_2026-06-17.md — full results, infra detail, issues encountered
README.md — S4 smoke test results table; S4 roadmap marked ✅ Done (UC1 passed; UC2/UC3 deferred)

Smoke test output

log_lines:         1        (DISK_FAILURE WARN on /var/log/hadoop/hdfs/)
root_cause:        disk
confidence:        HIGH (0.93)
notification_sent: True
PASS

Deferred

UC2 (Managed Spark / HDInsight) and UC3 (GCP native) blocked on GCP billing resolution
CDP_SSH_USER in Infisical (aria-cdp) needs updating to aria before S5 full run

Test plan

make lint passes (all 4 tools: black, isort, ruff, mypy)
Smoke test passed on live Azure infrastructure
No secrets committed (all via Infisical / env vars)

Closes #65

🤖 Generated with Claude Code

* ci: add ruff and mypy to lint job; add tool config to pyproject.toml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: black formatting + update SNOW error messages to mention both config paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: isort import ordering in 5 files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tooling: add pre-commit config and Makefile for local lint/format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve all ruff violations (E402, E741, UP035, UP045, F401) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: reformat 3 files with black after manual edits Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve all mypy errors across source files and tests - Add type guards (assert not None) in log_extractor and agent1 router - Narrow str | None → str with `or ""` in incident_reader - Add missing type annotations (Any params, return types) in connectors - Suppress third-party library type noise with targeted type: ignore comments - Exclude tests from disallow_untyped_defs in mypy config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tooling: point Makefile to project .venv instead of system Python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(s6): LangGraph orchestration + ReAct loop scaffold (M6) - Add PipelineState.pending_log_request and loop_iterations fields for the ReAct loop (Agent 3 ↔ Agent 2); stub Agent 3 never fires it - New ClassifierAgent stub (core/agents/classifier.py) — always returns error_class="unknown"/LOW; M4 drops in a real LLM-based implementation - New ARIAPipeline (core/orchestrator/pipeline.py) — LangGraph StateGraph wiring A1 → A2 → A3 → A4; conditional edge routes errors to A4 directly; ReAct loop backed at 5 iterations - InMemoryCommunicator for dry-run and unit testing (no Slack token needed) - dry_run() config accessor; ARIA_DRY_RUN=true injects all in-memory stubs - Pipeline REST API: POST /api/v1/pipeline/run + GET /api/v1/pipeline/health - 11 new tests (7 unit, 4 integration); total 201 passing * docs: update README to reflect M6 completion Mark M6 orchestration as done, update Agent 0 and Agent 3 status, add pipeline router to directory tree, mark pipeline endpoint live.

…-check (#22) * ci: add ruff and mypy to lint job; add tool config to pyproject.toml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: black formatting + update SNOW error messages to mention both config paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: isort import ordering in 5 files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tooling: add pre-commit config and Makefile for local lint/format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve all ruff violations (E402, E741, UP035, UP045, F401) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: reformat 3 files with black after manual edits Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: resolve all mypy errors across source files and tests - Add type guards (assert not None) in log_extractor and agent1 router - Narrow str | None → str with `or ""` in incident_reader - Add missing type annotations (Any params, return types) in connectors - Suppress third-party library type noise with targeted type: ignore comments - Exclude tests from disallow_untyped_defs in mypy config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tooling: point Makefile to project .venv instead of system Python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: require host_key_secret for SSHLogConnector — resolves CodeQL alert WarningPolicy accepted unknown SSH host keys silently, enabling MITM. Now raises ValueError when host_key_secret is not configured so the connector fails clearly rather than connecting insecurely. Tests updated to mock _load_known_host_key and pass host_key_secret in _make_connector defaults. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(s7): implement Agent 3 — LLM-based error classifier with REST endpoint Closes #2 (ARI-18), #3 (ARI-19), #5 (ARI-20), #4 (ARI-21), #19 (ARI-63) - Replace M6 stub in ClassifierAgent with real LLM call via LLMClientInterface - JSON prompt → parse → confidence band derivation (≥0.7=HIGH, ≥0.5=MED, <0.5=LOW) - ClassificationError raised on LLM or parse failure; pipeline top-level catch handles it - Stub fallback preserved when no LLM client injected (dry-run compatibility) - POST /api/v1/agent3/run + GET /api/v1/agent3/health endpoints - 11 unit tests (mocked LLM, no Anthropic import needed) - 3 integration tests against Anthropic (CDP disk-full, Databricks OOM, Oracle listener) * docs: add docstrings to all agents, routers, connectors, and tests Adds missing function and method docstrings across the codebase — agents, API routers, connectors, implementations, and test helpers — to bring all modules up to the project's comment-everything standard. * docs: update README and API docs to reflect Agent 3 as implemented - Mark Agent 3 ✅ Implemented in agent section, API table, repo structure, and roadmap - Correct error_class values in aria_apis.md to match implementation (oom|cpu|disk|network|auth|db_lock|pipeline|unknown) - Update Agent 3 API status from 🔜 M4 to ✅ Implemented (S7)

…#30) * feat(s7): implement Agent 3 — LLM-based error classifier with REST endpoint Closes #2 (ARI-18), #3 (ARI-19), #5 (ARI-20), #4 (ARI-21), #19 (ARI-63) - Replace M6 stub in ClassifierAgent with real LLM call via LLMClientInterface - JSON prompt → parse → confidence band derivation (≥0.7=HIGH, ≥0.5=MED, <0.5=LOW) - ClassificationError raised on LLM or parse failure; pipeline top-level catch handles it - Stub fallback preserved when no LLM client injected (dry-run compatibility) - POST /api/v1/agent3/run + GET /api/v1/agent3/health endpoints - 11 unit tests (mocked LLM, no Anthropic import needed) - 3 integration tests against Anthropic (CDP disk-full, Databricks OOM, Oracle listener) * docs: add docstrings to all agents, routers, connectors, and tests Adds missing function and method docstrings across the codebase — agents, API routers, connectors, implementations, and test helpers — to bring all modules up to the project's comment-everything standard. * docs: update README and API docs to reflect Agent 3 as implemented - Mark Agent 3 ✅ Implemented in agent section, API table, repo structure, and roadmap - Correct error_class values in aria_apis.md to match implementation (oom|cpu|disk|network|auth|db_lock|pipeline|unknown) - Update Agent 3 API status from 🔜 M4 to ✅ Implemented (S7) * feat(m7): add DOD test incidents, log fixtures, and cluster hosts file - Create scripts/create_dod_test_data.py — populates ServiceNow dev instance with 13 CMDB CIs, cluster member relationships, and 10 test incidents covering the full DOD test matrix (simple + edge cases) - Create data/cluster_hosts.json — CI name → IP lookup for Agent 2 ReAct loop (consumed when Agent 3 requests logs from a secondary service) - Add cdp_log_dirs() to core/config.py — reads cdp.log_dirs from conf.yaml, falling back to standard /var/log/hadoop-* paths - Update api/dependencies.py to call cfg.cdp_log_dirs() instead of hard-coding the log directory list - Gitignore data/dod_incident_mapping.json (script output, not source) Log files for all 10 incidents live on the VPS at /home/brm/projects/Hadoop/var/log/ (simulated cluster, not committed). conf.yaml updated locally to point cdp.log_dirs at the Hadoop folder. GitHub issue #29 opened for S8: implement ReAct loop trigger in Agent 3.

…gress (#31)

* feat(s8): ReAct loop trigger — Agent 3 requests cross-service logs, Agent 2 resolves and merges - ClassifierAgent: extends LLM prompt with optional log_request field; when the LLM identifies a cross-service root cause it sets pending_log_request instead of classifying, signalling the orchestrator to loop back to Agent 2 - LogExtractorAgent: adds cluster_hosts injection and _run_for_log_request path; resolves the named CI via substring match against cluster_hosts, fetches logs from the resolved host, and merges new lines with existing log_result so Agent 3 sees full combined evidence on the next pass - Unit tests: loop trigger fires/doesn't fire (classifier); pending_log_request path, result merge, unknown CI graceful fallback (log extractor) - Integration tests: DOD-006 (oom), DOD-007 (disk), single-pass regression, budget exhaustion at _MAX_LOOP_ITERATIONS=5 Closes #29 * style: black formatting fixes

* docs: update README and architecture docs for S8 ReAct loop trigger * feat: wire ClaudeCodeLLMClient + add M7 validation report - Add implementations/llm/claude_code/llm_client.py — routes all LLM calls through the Claude Code CLI (claude -p) using subscription auth instead of a credit-based ANTHROPIC_API_KEY. Strips markdown code fences from CLI output, which the CLI sometimes adds despite system prompt instructions. - Rewire api/dependencies.py to use ClaudeCodeLLMClient for all agents (Agents 1, 2, 3, 4) in both production and dry-run modes. - Add documentation/reports/phase1_validation_test1_report.md — full M7 acceptance test report covering all 10 DOD incidents, AC-01–AC-06 assessment, findings, and next steps.

… UC3 GCP native (#34) Adds three UC testing cluster TF modules under infra/terraform/uc_testing/ as infrastructure for Phase 1.5 S4 testing wiring sprint. UC1 (uc1-hadoop-onprem): 5 GCP VMs mimicking on-prem Hadoop cluster (cdp-master-01, cdp-data-01/02, cdp-utility-01, cdp-bus-01); SSH key stored in Secret Manager as aria-uc1-ssh-private-key. UC2 (uc2-dataproc): Dataproc cluster aria-uc2-cluster (1 master + 2 workers, image 2.1-debian12); idle_delete_ttl=3600s; YARN log aggregation to GCS. Fixed: roles/logging.viewer added to ARIA GKE SA IAM binding — required for GCPLogConnector to read Cloud Logging. Fixed: PLATFORM_TAG output changed from "dataproc" to "gcp" to match ARIA PlatformTag enum. UC3 (uc3-gcp-native): GCP project with BQ, GCS, Dataflow, Cloud Run. Fixed: pubsub.googleapis.com and cloudfunctions.googleapis.com APIs added. Fixed: roles/logging.viewer and roles/monitoring.viewer added for ARIA SA to support S6 GCPLogConnector + Cloud Monitoring integration. Shared modules (shared/modules/): VPC, service account, secrets — referenced by UC1 via relative path ../shared/modules/vpc.

* fix(ci): restrict GITHUB_TOKEN to contents:read on all workflows Resolves 3 CodeQL security warnings (actions/missing-workflow-permissions). Both ci.yml and integration.yml only checkout code and run tests — they need no write permissions. Explicit read-only scope follows least-privilege principle and eliminates the default broad token if a workflow is compromised. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: update README for Phase 1.5 — roadmap, tech stack, repo structure - Status badge updated to Phase 1.5 Hardening - Three operating phases: Phase 1 marked complete; Phase 1.5 bridge section added with 6-sprint overview table - Roadmap table: Phase 1 M7 marked done (local validation complete); Phase 1.5 sprints S1–S6 added; Phase 2/3 unchanged - Tech stack: Vertex AI (ADC auth, P1.5 S3) added to LLM row; GCP Secret Manager added to vault row - Plugin architecture diagram: vault row updated to list all implementations - Repo structure: deployment/monolithic/, infra/terraform/uc_testing/, tests/acceptance/, Dockerfile, vertex_ai LLM client, monitoring router added Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: publish LICENSE (Apache 2.0) + update README badge and section - Remove LICENSE from .gitignore (was held back pending finalisation) - LICENSE file already contained the correct Apache 2.0 text - README badge: MIT → Apache 2.0 - README license section: replace placeholder text with link to LICENSE file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: add CONTRIBUTORS file — Bayrem JRIDI

* feat(observability): S1 structured logging — one event stream, three consumers Phase 1.5 S1. Replaces ad-hoc logging.getLogger across 18 files with a single canonical structured event stream (structlog), rendered for ops, machines, and monitoring/corpus reuse from one instrumentation pass. Core: - core/logging_config.py — dual sink: pretty console (ARIA_LOG_FORMAT toggle) + always-JSON rolling file (daily, 30-day retention). PII scrub, schema_version stamping, enum/datetime coercion. Wired through stdlib root so third-party logs share the sinks. Idempotent configure_logging(). - core/observability.py — frozen event vocabulary, run-context binding via contextvars (run_id/incident_number ambient on every event), log_agent_lifecycle decorator (agent_started/completed/failed + duration_ms), RunAccumulator, and build_run_record() (shared with S2 so logging and monitoring never diverge). - core/models.py — RunRecord + RunStatus (full S2 field set); run_id on PipelineState. Instrumentation: - Orchestrator emits pipeline_started/completed (full RunRecord), routing_decision, react_loop_iteration. - Agents 1–4 decorated; each emits one domain event (ci_resolved, log_query_completed, classification_completed, notification_sent). - Anthropic + Claude Code clients emit llm_call_completed (tokens where available). PII: incident free-text (description/long_description/raw_record/caller) redacted before any sink — folds in review finding #87's logging concern. Tests: 18 new unit tests (config processors, file-is-JSON, lifecycle decorator, accumulator, RunRecord assembly). Full gate green (black/isort/ruff/mypy); the one pre-existing test_missing_instance_raises failure is unrelated (#87, conf.yaml precedence) and passes in clean CI. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(observability): drop unused _configured global (CodeQL) CodeQL flagged the module-level _configured flag as an unused global (its write is only read on a subsequent call, which the single-function dataflow can't see). Replace the boolean guard with a check on whether our _aria_managed handlers are already attached to the root logger — removes the global and ties idempotency to real state (robust to module reloads and multiple entry points).

…ting mode scaffold (#89) Closes #42: RunRecord delta — RUNNING status, nullable end_time, confidence_band, dict round-trip helpers shared by SQLite + JSON API. Closes #43: RunStoreInterface + SQLiteRunStore (stdlib sqlite3, per-call connections, time/status/error_class filters, count()). Closes #44: RunStateStoreInterface + InMemoryRunStateStore. Closes #45: orchestrator writes one RunRecord per run (success, partial, failed incl. crash path) and tracks current_agent live. Closes #46: GET /api/v1/runs (+ /{run_id}, /{run_id}/status) with pagination, total count, and server-side filters. Closes #47: ARIA_OPERATING_MODE scaffold — inform implemented, hitm/autonomous raise NotImplementedError naming their phase. Closes #87: ServiceNow connector tests now pin core.config._raw to {} — a populated local conf.yaml (not just env leak) made test_missing_instance_raises non-deterministic; YAML parse failures in core/config now log a warning instead of being swallowed. Tests: 4 new test files (unit run store / state store / operating mode + monitoring API integration via TestClient). httpx added to requirements for fastapi.testclient.

* feat(dashboard): P1.5 S2 — Alpine.js ops dashboard (run list, live view, detail, filters) Closes #48: /dashboard run list — status badges, confidence band, relative timestamps, 30s auto-refresh, pagination. Served by a small request-time-gated router (ARIA_DASHBOARD_ENABLED, off by default) instead of an import-time StaticFiles mount so the flag is testable and off means unreachable. Closes #49: run.html detail view — per-agent accordion, A1→A2→A3→A4 live step indicator polling /status at 1s and stopping on the 404 completion signal; history filters pass through to GET /api/v1/runs server-side. Closes #50: dashboard integration tests complete the S2 test suite (backend halves landed with #89). * fix(dashboard): resolve CodeQL py/path-injection alert Page name is now only ever a dict key into a server-defined path allowlist - the user-supplied string never enters a filesystem path expression. ---------

* feat(s3): Docker + config + LLM portability — P1.5 S3 (#51 #52 #53 #54 #55 #56 #57 #58 #84) - ARIA_CONFIG_PATH: config loader now reads conf.yaml from a configurable path (ARIA_CONFIG_PATH env var), enabling ConfigMap mounts in container deployments. - llm.provider: dynamic LLM client selection (anthropic | claude_code | vertex_ai) via conf.yaml or ARIA_LLM_PROVIDER. Defaults to 'anthropic' — removes ClaudeCodeLLMClient as the hardwired default, closing the tool-exfiltration risk identified in #84. - VertexAILLMClient: new LLMClientInterface for GCP Vertex AI (ADC auth, no API key). Routes to AnthropicVertex for Claude-on-Vertex models and to the Gemini SDK for Gemini models. - GCPSecretManagerVault: new VaultInterface backed by GCP Secret Manager (ADC auth). vault_backend config key selects between env/gcp/hashicorp/aws/azure. - Dockerfile + .dockerignore: python:3.11-slim, non-root aria user (uid 1000), curl health check, uvicorn entrypoint. conf.yaml excluded from image — always mounted at runtime via ARIA_CONFIG_PATH. - deployment/monolithic/: docker-compose.yml with bind-mount config pattern and named log volume; conf.yaml.example for the monolithic deployment. - deployment/README.md: four deployment patterns (Docker CLI, compose, Cloud Run, GKE ConfigMap) with LLM provider and vault backend selection tables. - CI: docker-smoke job builds the image and hits /api/v1/health on every PR and push to main. - 41 new unit tests; 294 total passing. make lint clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: add installation guide (Docker + K8s) and README deployment section - documentation/guides/installation.md: full installation guide covering Docker (local/VM), docker-compose, and Kubernetes paths; conf.yaml prep, LLM provider selection, vault backend options - documentation/index.md: link to new installation guide - README.md: add Deployment section with Docker quickstart, K8s outline, and pointer to the full guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(security): eliminate dynamic SQL construction in SQLiteRunStore Replace _build_filters() + f-string SQL with static query templates using the `? IS NULL OR column = ?` pattern. SQL strings are now module-level constants; user input (HTTP query params) flows only into the parameter tuple and never into the query string. Closes CodeQL alerts #5 and #6. * fix(tests): resolve code-quality bot flags on PR #91 - test_gcp_secret_manager_vault: replace mixed import/import-from for gcp_secret_manager module with consistent from-import + reload via sys.modules[__module__] - test_vertex_llm_client: remove unused PermissionDenied class stub from test_permission_denied_raises_llm_auth_error (side_effect already uses LLMAuthError directly)

…ty nodes Cherry-picked content from PR #82 (Tobi-Adesoye, commit 793c500). Only the three runbook files are brought in — the test regression from that commit (7 deleted KB unit tests) and the orphaned scripts/uc1_parser.py are intentionally excluded. These runbooks will be validated against actual TF log paths as part of #60. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rectness Closes #59 — cluster_hosts.json restructured for UC1 TF node names (cdp-master-01, cdp-data-01/02, cdp-utility-01, cdp-bus-01). IPs placeholder until TF apply. Closes #60 — UC1 KB runbooks validated and enriched against TF log paths. _KEYWORD_RE extended with Kafka, ZooKeeper, NiFi, AuthenticationException, DiskOutOfSpaceException, GC overhead. test_file_kb.py: restored original 8 tests, updated fixture count (2→8), added TestUC1RunbookAcceptance (3 tests). Closes #61 — cdp_ssh_key_secret() config option added (core/config.py); SSH key vault key now configurable via conf.yaml cdp.ssh_key_secret, default CDP_SSH_KEY unchanged. api/dependencies.py wired to cfg.cdp_ssh_key_secret(). conf_template.yaml annotated with UC1 TF secret alignment guidance and full TF log dir paths. Closes #62 — GCPLogConnector: resource_types param adds resource.type OR-clause and cluster_name host label alias for Dataproc. api/dependencies.py sets ['cloud_dataproc_cluster', 'cloud_dataproc_job'] for UC2. 3 new filter tests. Closes #63 — UC2 Dataproc KB runbooks (dataproc_cluster.md, dataproc_job.md) added. TestUC2RunbookAcceptance (2 tests). Closes #64 — gcp_native.md added as UC3 graceful degradation marker. Agent 2 returns LOW confidence / empty logs for native GCP services; Agent 4 notifies with gap message. Closes #85 — _validate_log_paths() in log_extractor.py: drops LLM-planned paths outside /var/log/ before passing to connectors. 4 new unit tests. Closes #83 — ClassificationError caught in _agent3_node (pipeline.py): Agent 4 now always runs, notify-only guarantee preserved. ClassifierAgent adds 1 retry + 1s sleep before raising. 1 retry test + 1 pipeline resilience test added. 309 unit tests green (was 294, +15 new). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Both dataproc_cluster.md and dataproc_job.md scored equally for "cluster" queries due to "cluster_name" appearing in the Log Paths section of dataproc_job.md. This caused non-deterministic test failures in CI where the wrong runbook was returned for cluster-level incidents (YARN missing). Remove the token by replacing the multi-line filter block with a single sentence that doesn't contain "cluster", making dataproc_cluster.md the unique winner for cluster-targeted queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previous fix replaced cluster_name label text but left two more cluster occurrences: the word literal in "UC2 cluster:" and the hyphen-split token from "aria-uc2-cluster". Since _tokenize uses re.findall(r"\w+") — hyphens split but underscores don't — aria-uc2-cluster tokenises to ["aria","uc2","cluster"], still tying with dataproc_cluster.md for "dataproc-cluster gcp" queries. Replace both with "UC2 job runner: aria-uc2-dataproc" so dataproc_job.md scores 0.5 and dataproc_cluster.md scores 0.75 for cluster queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…structure Two architectural fixes in one commit: 1. Split knowledge_base fixtures into resource_kb/ (Agent 2) and analyser_kb/ (Agent 3). Eliminates the design confusion that put failure vocabulary in Agent 2's resource catalog — the root cause of the S4 CI score-tie failures. 2. Consolidate from 8 per-component files to 3 per-cluster files in resource_kb. Each file describes a cluster's physical/logical resources and log paths — no error keywords, no failure descriptions. The cdp_cluster.md covers all 5 UC1 nodes in one file; aria_uc2_cluster.md covers Dataproc logical resources. 3. Add analyser_kb/ with 5 labeled log excerpts (OOM, disk, auth, YARN safe mode, OK baseline) injected into Agent 3's prompt as few-shot examples. These files double as a training corpus for the future fine-tuned Agent 3 model. 4. ClassifierAgent gains analyser_kb_dir param + _load_analyser_kb() loader. cfg.analyser_kb_dir() reads knowledge_base.analyser_kb_dir / ARIA_ANALYSER_KB_DIR. get_agent3() passes the configured dir at construction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GCP billing blocked (OR_BACR2_44); Azure $200 credit available. Ports all 3 UC testing environments to Azure so S4 smoke test can proceed independently while GCP billing resolves. Changes: - Restructure infra/terraform/uc_testing/: GCP configs moved to gcp/ subfolder (pure git rename, no content changes) - New azure/uc1-hadoop-onprem/: 5 Standard_B2ms VMs + VNet + NSG simulating CDP Hadoop cluster; cloud-init installs Hadoop 3.3.6 and creates /var/log/hadoop/* structure; ~$24 for 2 days - New azure/uc2-hdinsight/: HDInsight Spark cluster (aria-uc2-cluster) + Log Analytics workspace; Diagnostic Settings route Syslog to workspace for AzureLogConnector; ~$3 for 3h window - New azure/uc3-azure-native/: Event Hubs + Synapse + Storage; references UC2 Log Analytics workspace; no logs seeded (LOW confidence expected); ~$2 for 2 days - api/dependencies.py: wire AzureLogConnector into Agent 2 registry (PlatformTag.AZURE → AzureLogConnector) and wire azure vault backend in _get_vault(); both implementations already existed (ARI-52) Closes #93 Closes #94 Closes #95 Closes #96 Closes #97 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…or Azure - AzureLogConnector is now implemented and wired (not a stub) — update Agent 2 connector list and implementations/clusters/cloud/azure/ entry - infra/terraform/uc_testing/ now has gcp/ and azure/ subfolders — update repository structure section - S4 roadmap entry updated to reflect Azure UC equivalents addition Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Agent 1 stubbed; Agents 2→3→4 ran live against Azure VM (cdp-master-01, REDACTED-IP). SSH log extraction returned 1 DISK_FAILURE line; Agent 3 classified as disk/HIGH (0.93); Slack notification delivered. - scripts/smoke_uc1.py: UC1 smoke test script (EnvVarVault, Agent 1 stub) - infra/terraform/uc_testing/azure/uc1-hadoop-onprem/main.tf: Standard_D2s_v3, skip_provider_registration (azurerm 3.x compat) - documentation/reports/s4_uc1_smoke_test_2026-06-17.md: full results + issues log - README.md: S4 smoke results table, S4 roadmap marked done Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bjridicodes and others added 23 commits May 21, 2026 15:38

docs: update conf_template with cdp.log_dirs option, set M7 to in pro…

51a0637

…gress (#31)

docs(readme): S4 status in progress

301e334

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-code-quality Bot found potential problems Jun 18, 2026

View reviewed changes

Comment thread implementations/coms/google_chat/connector.py Fixed

Merge branch 'main' into feat/s4-testing-infra

5a946f3

bayrem approved these changes Jun 18, 2026

View reviewed changes

bayrem merged commit 48f0b76 into main Jun 18, 2026
7 checks passed

bayrem deleted the feat/s4-testing-infra branch June 18, 2026 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(s4): UC1 smoke test PASS — close #65#100

feat(s4): UC1 smoke test PASS — close #65#100
bayrem merged 24 commits into
mainfrom
feat/s4-testing-infra

bjridicodes commented Jun 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bjridicodes commented Jun 18, 2026

Summary

Changes

Smoke test output

Deferred

Test plan

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants