Skip to content

feat(s4): UC1 smoke test PASS — close #65#100

Merged
bayrem merged 24 commits into
mainfrom
feat/s4-testing-infra
Jun 18, 2026
Merged

feat(s4): UC1 smoke test PASS — close #65#100
bayrem merged 24 commits into
mainfrom
feat/s4-testing-infra

Conversation

@bjridicodes

Copy link
Copy Markdown
Contributor

Summary

  • UC1 smoke test ran live against Azure CDP cluster (cdp-master-01)
  • Agent 1 stubbed; Agents 2 → 3 → 4 executed against real infrastructure
  • Result: PASS — 1 log line retrieved, classified disk/HIGH (0.93), Slack notified
  • UC2 and UC3 deferred (GCP billing blocked; Azure HDInsight not deployed yet)

Changes

  • scripts/smoke_uc1.py — smoke test script (EnvVarVault, Agent 1 stub, live SSH)
  • infra/terraform/uc_testing/azure/uc1-hadoop-onprem/main.tf — VM type Standard_D2s_v3 (B2ms unavailable in West Europe), skip_provider_registration = true for azurerm 3.x compat
  • documentation/reports/s4_uc1_smoke_test_2026-06-17.md — full results, infra detail, issues encountered
  • README.md — S4 smoke test results table; S4 roadmap marked ✅ Done (UC1 passed; UC2/UC3 deferred)

Smoke test output

log_lines:         1        (DISK_FAILURE WARN on /var/log/hadoop/hdfs/)
root_cause:        disk
confidence:        HIGH (0.93)
notification_sent: True
PASS

Deferred

  • UC2 (Managed Spark / HDInsight) and UC3 (GCP native) blocked on GCP billing resolution
  • CDP_SSH_USER in Infisical (aria-cdp) needs updating to aria before S5 full run

Test plan

  • make lint passes (all 4 tools: black, isort, ruff, mypy)
  • Smoke test passed on live Azure infrastructure
  • No secrets committed (all via Infisical / env vars)

Closes #65

🤖 Generated with Claude Code

bjridicodes and others added 23 commits May 21, 2026 15:38
* ci: add ruff and mypy to lint job; add tool config to pyproject.toml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: black formatting + update SNOW error messages to mention both config paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: isort import ordering in 5 files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* tooling: add pre-commit config and Makefile for local lint/format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve all ruff violations (E402, E741, UP035, UP045, F401)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: reformat 3 files with black after manual edits

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve all mypy errors across source files and tests

- Add type guards (assert not None) in log_extractor and agent1 router
- Narrow str | None → str with `or ""` in incident_reader
- Add missing type annotations (Any params, return types) in connectors
- Suppress third-party library type noise with targeted type: ignore comments
- Exclude tests from disallow_untyped_defs in mypy config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* tooling: point Makefile to project .venv instead of system Python

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(s6): LangGraph orchestration + ReAct loop scaffold (M6)

- Add PipelineState.pending_log_request and loop_iterations fields for
  the ReAct loop (Agent 3 ↔ Agent 2); stub Agent 3 never fires it
- New ClassifierAgent stub (core/agents/classifier.py) — always returns
  error_class="unknown"/LOW; M4 drops in a real LLM-based implementation
- New ARIAPipeline (core/orchestrator/pipeline.py) — LangGraph StateGraph
  wiring A1 → A2 → A3 → A4; conditional edge routes errors to A4 directly;
  ReAct loop backed at 5 iterations
- InMemoryCommunicator for dry-run and unit testing (no Slack token needed)
- dry_run() config accessor; ARIA_DRY_RUN=true injects all in-memory stubs
- Pipeline REST API: POST /api/v1/pipeline/run + GET /api/v1/pipeline/health
- 11 new tests (7 unit, 4 integration); total 201 passing

* docs: update README to reflect M6 completion

Mark M6 orchestration as done, update Agent 0 and Agent 3 status,
add pipeline router to directory tree, mark pipeline endpoint live.
…-check (#22)

* ci: add ruff and mypy to lint job; add tool config to pyproject.toml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: black formatting + update SNOW error messages to mention both config paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: isort import ordering in 5 files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* tooling: add pre-commit config and Makefile for local lint/format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve all ruff violations (E402, E741, UP035, UP045, F401)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: reformat 3 files with black after manual edits

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve all mypy errors across source files and tests

- Add type guards (assert not None) in log_extractor and agent1 router
- Narrow str | None → str with `or ""` in incident_reader
- Add missing type annotations (Any params, return types) in connectors
- Suppress third-party library type noise with targeted type: ignore comments
- Exclude tests from disallow_untyped_defs in mypy config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* tooling: point Makefile to project .venv instead of system Python

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: require host_key_secret for SSHLogConnector — resolves CodeQL alert

WarningPolicy accepted unknown SSH host keys silently, enabling MITM.
Now raises ValueError when host_key_secret is not configured so the
connector fails clearly rather than connecting insecurely.

Tests updated to mock _load_known_host_key and pass host_key_secret in
_make_connector defaults.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(s7): implement Agent 3 — LLM-based error classifier with REST endpoint

Closes #2 (ARI-18), #3 (ARI-19), #5 (ARI-20), #4 (ARI-21), #19 (ARI-63)

- Replace M6 stub in ClassifierAgent with real LLM call via LLMClientInterface
- JSON prompt → parse → confidence band derivation (≥0.7=HIGH, ≥0.5=MED, <0.5=LOW)
- ClassificationError raised on LLM or parse failure; pipeline top-level catch handles it
- Stub fallback preserved when no LLM client injected (dry-run compatibility)
- POST /api/v1/agent3/run + GET /api/v1/agent3/health endpoints
- 11 unit tests (mocked LLM, no Anthropic import needed)
- 3 integration tests against Anthropic (CDP disk-full, Databricks OOM, Oracle listener)

* docs: add docstrings to all agents, routers, connectors, and tests

Adds missing function and method docstrings across the codebase — agents,
API routers, connectors, implementations, and test helpers — to bring all
modules up to the project's comment-everything standard.

* docs: update README and API docs to reflect Agent 3 as implemented

- Mark Agent 3 ✅ Implemented in agent section, API table, repo structure, and roadmap
- Correct error_class values in aria_apis.md to match implementation (oom|cpu|disk|network|auth|db_lock|pipeline|unknown)
- Update Agent 3 API status from 🔜 M4 to ✅ Implemented (S7)
…#30)

* feat(s7): implement Agent 3 — LLM-based error classifier with REST endpoint

Closes #2 (ARI-18), #3 (ARI-19), #5 (ARI-20), #4 (ARI-21), #19 (ARI-63)

- Replace M6 stub in ClassifierAgent with real LLM call via LLMClientInterface
- JSON prompt → parse → confidence band derivation (≥0.7=HIGH, ≥0.5=MED, <0.5=LOW)
- ClassificationError raised on LLM or parse failure; pipeline top-level catch handles it
- Stub fallback preserved when no LLM client injected (dry-run compatibility)
- POST /api/v1/agent3/run + GET /api/v1/agent3/health endpoints
- 11 unit tests (mocked LLM, no Anthropic import needed)
- 3 integration tests against Anthropic (CDP disk-full, Databricks OOM, Oracle listener)

* docs: add docstrings to all agents, routers, connectors, and tests

Adds missing function and method docstrings across the codebase — agents,
API routers, connectors, implementations, and test helpers — to bring all
modules up to the project's comment-everything standard.

* docs: update README and API docs to reflect Agent 3 as implemented

- Mark Agent 3 ✅ Implemented in agent section, API table, repo structure, and roadmap
- Correct error_class values in aria_apis.md to match implementation (oom|cpu|disk|network|auth|db_lock|pipeline|unknown)
- Update Agent 3 API status from 🔜 M4 to ✅ Implemented (S7)

* feat(m7): add DOD test incidents, log fixtures, and cluster hosts file

- Create scripts/create_dod_test_data.py — populates ServiceNow dev
  instance with 13 CMDB CIs, cluster member relationships, and 10 test
  incidents covering the full DOD test matrix (simple + edge cases)
- Create data/cluster_hosts.json — CI name → IP lookup for Agent 2 ReAct
  loop (consumed when Agent 3 requests logs from a secondary service)
- Add cdp_log_dirs() to core/config.py — reads cdp.log_dirs from conf.yaml,
  falling back to standard /var/log/hadoop-* paths
- Update api/dependencies.py to call cfg.cdp_log_dirs() instead of
  hard-coding the log directory list
- Gitignore data/dod_incident_mapping.json (script output, not source)

Log files for all 10 incidents live on the VPS at
/home/brm/projects/Hadoop/var/log/ (simulated cluster, not committed).
conf.yaml updated locally to point cdp.log_dirs at the Hadoop folder.

GitHub issue #29 opened for S8: implement ReAct loop trigger in Agent 3.
* feat(s8): ReAct loop trigger — Agent 3 requests cross-service logs, Agent 2 resolves and merges

- ClassifierAgent: extends LLM prompt with optional log_request field; when the LLM
  identifies a cross-service root cause it sets pending_log_request instead of
  classifying, signalling the orchestrator to loop back to Agent 2
- LogExtractorAgent: adds cluster_hosts injection and _run_for_log_request path;
  resolves the named CI via substring match against cluster_hosts, fetches logs
  from the resolved host, and merges new lines with existing log_result so Agent 3
  sees full combined evidence on the next pass
- Unit tests: loop trigger fires/doesn't fire (classifier); pending_log_request
  path, result merge, unknown CI graceful fallback (log extractor)
- Integration tests: DOD-006 (oom), DOD-007 (disk), single-pass regression,
  budget exhaustion at _MAX_LOOP_ITERATIONS=5

Closes #29

* style: black formatting fixes
* docs: update README and architecture docs for S8 ReAct loop trigger

* feat: wire ClaudeCodeLLMClient + add M7 validation report

- Add implementations/llm/claude_code/llm_client.py — routes all LLM
  calls through the Claude Code CLI (claude -p) using subscription auth
  instead of a credit-based ANTHROPIC_API_KEY. Strips markdown code
  fences from CLI output, which the CLI sometimes adds despite system
  prompt instructions.
- Rewire api/dependencies.py to use ClaudeCodeLLMClient for all agents
  (Agents 1, 2, 3, 4) in both production and dry-run modes.
- Add documentation/reports/phase1_validation_test1_report.md — full
  M7 acceptance test report covering all 10 DOD incidents, AC-01–AC-06
  assessment, findings, and next steps.
… UC3 GCP native (#34)

Adds three UC testing cluster TF modules under infra/terraform/uc_testing/
as infrastructure for Phase 1.5 S4 testing wiring sprint.

UC1 (uc1-hadoop-onprem): 5 GCP VMs mimicking on-prem Hadoop cluster
(cdp-master-01, cdp-data-01/02, cdp-utility-01, cdp-bus-01); SSH key
stored in Secret Manager as aria-uc1-ssh-private-key.

UC2 (uc2-dataproc): Dataproc cluster aria-uc2-cluster (1 master + 2 workers,
image 2.1-debian12); idle_delete_ttl=3600s; YARN log aggregation to GCS.
Fixed: roles/logging.viewer added to ARIA GKE SA IAM binding — required
for GCPLogConnector to read Cloud Logging. Fixed: PLATFORM_TAG output
changed from "dataproc" to "gcp" to match ARIA PlatformTag enum.

UC3 (uc3-gcp-native): GCP project with BQ, GCS, Dataflow, Cloud Run.
Fixed: pubsub.googleapis.com and cloudfunctions.googleapis.com APIs added.
Fixed: roles/logging.viewer and roles/monitoring.viewer added for ARIA SA
to support S6 GCPLogConnector + Cloud Monitoring integration.

Shared modules (shared/modules/): VPC, service account, secrets — referenced
by UC1 via relative path ../shared/modules/vpc.
* fix(ci): restrict GITHUB_TOKEN to contents:read on all workflows

Resolves 3 CodeQL security warnings (actions/missing-workflow-permissions).
Both ci.yml and integration.yml only checkout code and run tests — they
need no write permissions. Explicit read-only scope follows least-privilege
principle and eliminates the default broad token if a workflow is compromised.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: update README for Phase 1.5 — roadmap, tech stack, repo structure

- Status badge updated to Phase 1.5 Hardening
- Three operating phases: Phase 1 marked complete; Phase 1.5 bridge section
  added with 6-sprint overview table
- Roadmap table: Phase 1 M7 marked done (local validation complete); Phase 1.5
  sprints S1–S6 added; Phase 2/3 unchanged
- Tech stack: Vertex AI (ADC auth, P1.5 S3) added to LLM row;
  GCP Secret Manager added to vault row
- Plugin architecture diagram: vault row updated to list all implementations
- Repo structure: deployment/monolithic/, infra/terraform/uc_testing/,
  tests/acceptance/, Dockerfile, vertex_ai LLM client, monitoring router added

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: publish LICENSE (Apache 2.0) + update README badge and section

- Remove LICENSE from .gitignore (was held back pending finalisation)
- LICENSE file already contained the correct Apache 2.0 text
- README badge: MIT → Apache 2.0
- README license section: replace placeholder text with link to LICENSE file

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: add CONTRIBUTORS file — Bayrem JRIDI
* feat(observability): S1 structured logging — one event stream, three consumers

Phase 1.5 S1. Replaces ad-hoc logging.getLogger across 18 files with a single
canonical structured event stream (structlog), rendered for ops, machines, and
monitoring/corpus reuse from one instrumentation pass.

Core:
- core/logging_config.py — dual sink: pretty console (ARIA_LOG_FORMAT toggle) +
  always-JSON rolling file (daily, 30-day retention). PII scrub, schema_version
  stamping, enum/datetime coercion. Wired through stdlib root so third-party logs
  share the sinks. Idempotent configure_logging().
- core/observability.py — frozen event vocabulary, run-context binding via
  contextvars (run_id/incident_number ambient on every event), log_agent_lifecycle
  decorator (agent_started/completed/failed + duration_ms), RunAccumulator, and
  build_run_record() (shared with S2 so logging and monitoring never diverge).
- core/models.py — RunRecord + RunStatus (full S2 field set); run_id on PipelineState.

Instrumentation:
- Orchestrator emits pipeline_started/completed (full RunRecord), routing_decision,
  react_loop_iteration.
- Agents 1–4 decorated; each emits one domain event (ci_resolved,
  log_query_completed, classification_completed, notification_sent).
- Anthropic + Claude Code clients emit llm_call_completed (tokens where available).

PII: incident free-text (description/long_description/raw_record/caller) redacted
before any sink — folds in review finding #87's logging concern.

Tests: 18 new unit tests (config processors, file-is-JSON, lifecycle decorator,
accumulator, RunRecord assembly). Full gate green (black/isort/ruff/mypy); the one
pre-existing test_missing_instance_raises failure is unrelated (#87, conf.yaml
precedence) and passes in clean CI.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(observability): drop unused _configured global (CodeQL)

CodeQL flagged the module-level _configured flag as an unused global (its write
is only read on a subsequent call, which the single-function dataflow can't see).
Replace the boolean guard with a check on whether our _aria_managed handlers are
already attached to the root logger — removes the global and ties idempotency to
real state (robust to module reloads and multiple entry points).
…ting mode scaffold (#89)

Closes #42: RunRecord delta — RUNNING status, nullable end_time,
confidence_band, dict round-trip helpers shared by SQLite + JSON API.
Closes #43: RunStoreInterface + SQLiteRunStore (stdlib sqlite3,
per-call connections, time/status/error_class filters, count()).
Closes #44: RunStateStoreInterface + InMemoryRunStateStore.
Closes #45: orchestrator writes one RunRecord per run (success,
partial, failed incl. crash path) and tracks current_agent live.
Closes #46: GET /api/v1/runs (+ /{run_id}, /{run_id}/status) with
pagination, total count, and server-side filters.
Closes #47: ARIA_OPERATING_MODE scaffold — inform implemented,
hitm/autonomous raise NotImplementedError naming their phase.
Closes #87: ServiceNow connector tests now pin core.config._raw to {}
— a populated local conf.yaml (not just env leak) made
test_missing_instance_raises non-deterministic; YAML parse failures
in core/config now log a warning instead of being swallowed.

Tests: 4 new test files (unit run store / state store / operating
mode + monitoring API integration via TestClient). httpx added to
requirements for fastapi.testclient.
* feat(dashboard): P1.5 S2 — Alpine.js ops dashboard (run list, live view, detail, filters)

Closes #48: /dashboard run list — status badges, confidence band,
relative timestamps, 30s auto-refresh, pagination. Served by a small
request-time-gated router (ARIA_DASHBOARD_ENABLED, off by default)
instead of an import-time StaticFiles mount so the flag is testable
and off means unreachable.
Closes #49: run.html detail view — per-agent accordion, A1→A2→A3→A4
live step indicator polling /status at 1s and stopping on the 404
completion signal; history filters pass through to GET /api/v1/runs
server-side.
Closes #50: dashboard integration tests complete the S2 test suite
(backend halves landed with #89).

* fix(dashboard): resolve CodeQL py/path-injection alert

Page name is now only ever a dict key into a server-defined path
allowlist - the user-supplied string never enters a filesystem path
expression.

---------
* feat(s3): Docker + config + LLM portability — P1.5 S3 (#51 #52 #53 #54 #55 #56 #57 #58 #84)

- ARIA_CONFIG_PATH: config loader now reads conf.yaml from a configurable path
  (ARIA_CONFIG_PATH env var), enabling ConfigMap mounts in container deployments.
- llm.provider: dynamic LLM client selection (anthropic | claude_code | vertex_ai)
  via conf.yaml or ARIA_LLM_PROVIDER. Defaults to 'anthropic' — removes
  ClaudeCodeLLMClient as the hardwired default, closing the tool-exfiltration
  risk identified in #84.
- VertexAILLMClient: new LLMClientInterface for GCP Vertex AI (ADC auth, no API
  key). Routes to AnthropicVertex for Claude-on-Vertex models and to the Gemini
  SDK for Gemini models.
- GCPSecretManagerVault: new VaultInterface backed by GCP Secret Manager (ADC
  auth). vault_backend config key selects between env/gcp/hashicorp/aws/azure.
- Dockerfile + .dockerignore: python:3.11-slim, non-root aria user (uid 1000),
  curl health check, uvicorn entrypoint. conf.yaml excluded from image —
  always mounted at runtime via ARIA_CONFIG_PATH.
- deployment/monolithic/: docker-compose.yml with bind-mount config pattern and
  named log volume; conf.yaml.example for the monolithic deployment.
- deployment/README.md: four deployment patterns (Docker CLI, compose, Cloud Run,
  GKE ConfigMap) with LLM provider and vault backend selection tables.
- CI: docker-smoke job builds the image and hits /api/v1/health on every PR and
  push to main.
- 41 new unit tests; 294 total passing. make lint clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: add installation guide (Docker + K8s) and README deployment section

- documentation/guides/installation.md: full installation guide covering
  Docker (local/VM), docker-compose, and Kubernetes paths; conf.yaml prep,
  LLM provider selection, vault backend options
- documentation/index.md: link to new installation guide
- README.md: add Deployment section with Docker quickstart, K8s outline,
  and pointer to the full guide

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(security): eliminate dynamic SQL construction in SQLiteRunStore

Replace _build_filters() + f-string SQL with static query templates using
the `? IS NULL OR column = ?` pattern. SQL strings are now module-level
constants; user input (HTTP query params) flows only into the parameter
tuple and never into the query string. Closes CodeQL alerts #5 and #6.

* fix(tests): resolve code-quality bot flags on PR #91

- test_gcp_secret_manager_vault: replace mixed import/import-from for
  gcp_secret_manager module with consistent from-import + reload via
  sys.modules[__module__]
- test_vertex_llm_client: remove unused PermissionDenied class stub from
  test_permission_denied_raises_llm_auth_error (side_effect already uses
  LLMAuthError directly)
…ty nodes

Cherry-picked content from PR #82 (Tobi-Adesoye, commit 793c500).
Only the three runbook files are brought in — the test regression from
that commit (7 deleted KB unit tests) and the orphaned scripts/uc1_parser.py
are intentionally excluded.

These runbooks will be validated against actual TF log paths as part of #60.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rectness

Closes #59 — cluster_hosts.json restructured for UC1 TF node names (cdp-master-01,
cdp-data-01/02, cdp-utility-01, cdp-bus-01). IPs placeholder until TF apply.

Closes #60 — UC1 KB runbooks validated and enriched against TF log paths.
_KEYWORD_RE extended with Kafka, ZooKeeper, NiFi, AuthenticationException,
DiskOutOfSpaceException, GC overhead. test_file_kb.py: restored original 8 tests,
updated fixture count (2→8), added TestUC1RunbookAcceptance (3 tests).

Closes #61 — cdp_ssh_key_secret() config option added (core/config.py); SSH key vault
key now configurable via conf.yaml cdp.ssh_key_secret, default CDP_SSH_KEY unchanged.
api/dependencies.py wired to cfg.cdp_ssh_key_secret(). conf_template.yaml annotated
with UC1 TF secret alignment guidance and full TF log dir paths.

Closes #62 — GCPLogConnector: resource_types param adds resource.type OR-clause and
cluster_name host label alias for Dataproc. api/dependencies.py sets
['cloud_dataproc_cluster', 'cloud_dataproc_job'] for UC2. 3 new filter tests.

Closes #63 — UC2 Dataproc KB runbooks (dataproc_cluster.md, dataproc_job.md) added.
TestUC2RunbookAcceptance (2 tests).

Closes #64 — gcp_native.md added as UC3 graceful degradation marker. Agent 2 returns
LOW confidence / empty logs for native GCP services; Agent 4 notifies with gap message.

Closes #85 — _validate_log_paths() in log_extractor.py: drops LLM-planned paths
outside /var/log/ before passing to connectors. 4 new unit tests.

Closes #83 — ClassificationError caught in _agent3_node (pipeline.py): Agent 4 now
always runs, notify-only guarantee preserved. ClassifierAgent adds 1 retry + 1s sleep
before raising. 1 retry test + 1 pipeline resilience test added.

309 unit tests green (was 294, +15 new).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both dataproc_cluster.md and dataproc_job.md scored equally for "cluster"
queries due to "cluster_name" appearing in the Log Paths section of
dataproc_job.md. This caused non-deterministic test failures in CI where
the wrong runbook was returned for cluster-level incidents (YARN missing).

Remove the token by replacing the multi-line filter block with a single
sentence that doesn't contain "cluster", making dataproc_cluster.md the
unique winner for cluster-targeted queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous fix replaced cluster_name label text but left two more cluster
occurrences: the word literal in "UC2 cluster:" and the hyphen-split
token from "aria-uc2-cluster". Since _tokenize uses re.findall(r"\w+")
— hyphens split but underscores don't — aria-uc2-cluster tokenises to
["aria","uc2","cluster"], still tying with dataproc_cluster.md for
"dataproc-cluster gcp" queries.

Replace both with "UC2 job runner: aria-uc2-dataproc" so dataproc_job.md
scores 0.5 and dataproc_cluster.md scores 0.75 for cluster queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…structure

Two architectural fixes in one commit:

1. Split knowledge_base fixtures into resource_kb/ (Agent 2) and analyser_kb/
   (Agent 3). Eliminates the design confusion that put failure vocabulary in
   Agent 2's resource catalog — the root cause of the S4 CI score-tie failures.

2. Consolidate from 8 per-component files to 3 per-cluster files in resource_kb.
   Each file describes a cluster's physical/logical resources and log paths — no
   error keywords, no failure descriptions. The cdp_cluster.md covers all 5 UC1
   nodes in one file; aria_uc2_cluster.md covers Dataproc logical resources.

3. Add analyser_kb/ with 5 labeled log excerpts (OOM, disk, auth, YARN safe mode,
   OK baseline) injected into Agent 3's prompt as few-shot examples. These files
   double as a training corpus for the future fine-tuned Agent 3 model.

4. ClassifierAgent gains analyser_kb_dir param + _load_analyser_kb() loader.
   cfg.analyser_kb_dir() reads knowledge_base.analyser_kb_dir / ARIA_ANALYSER_KB_DIR.
   get_agent3() passes the configured dir at construction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCP billing blocked (OR_BACR2_44); Azure $200 credit available.
Ports all 3 UC testing environments to Azure so S4 smoke test can
proceed independently while GCP billing resolves.

Changes:
- Restructure infra/terraform/uc_testing/: GCP configs moved to gcp/
  subfolder (pure git rename, no content changes)
- New azure/uc1-hadoop-onprem/: 5 Standard_B2ms VMs + VNet + NSG
  simulating CDP Hadoop cluster; cloud-init installs Hadoop 3.3.6 and
  creates /var/log/hadoop/* structure; ~$24 for 2 days
- New azure/uc2-hdinsight/: HDInsight Spark cluster (aria-uc2-cluster)
  + Log Analytics workspace; Diagnostic Settings route Syslog to
  workspace for AzureLogConnector; ~$3 for 3h window
- New azure/uc3-azure-native/: Event Hubs + Synapse + Storage;
  references UC2 Log Analytics workspace; no logs seeded (LOW confidence
  expected); ~$2 for 2 days
- api/dependencies.py: wire AzureLogConnector into Agent 2 registry
  (PlatformTag.AZURE → AzureLogConnector) and wire azure vault backend
  in _get_vault(); both implementations already existed (ARI-52)

Closes #93
Closes #94
Closes #95
Closes #96
Closes #97

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or Azure

- AzureLogConnector is now implemented and wired (not a stub) — update
  Agent 2 connector list and implementations/clusters/cloud/azure/ entry
- infra/terraform/uc_testing/ now has gcp/ and azure/ subfolders — update
  repository structure section
- S4 roadmap entry updated to reflect Azure UC equivalents addition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Agent 1 stubbed; Agents 2→3→4 ran live against Azure VM (cdp-master-01,
REDACTED-IP). SSH log extraction returned 1 DISK_FAILURE line; Agent 3
classified as disk/HIGH (0.93); Slack notification delivered.

- scripts/smoke_uc1.py: UC1 smoke test script (EnvVarVault, Agent 1 stub)
- infra/terraform/uc_testing/azure/uc1-hadoop-onprem/main.tf: Standard_D2s_v3,
  skip_provider_registration (azurerm 3.x compat)
- documentation/reports/s4_uc1_smoke_test_2026-06-17.md: full results + issues log
- README.md: S4 smoke results table, S4 roadmap marked done

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread implementations/coms/google_chat/connector.py Fixed
@bayrem bayrem merged commit 48f0b76 into main Jun 18, 2026
7 checks passed
@bayrem bayrem deleted the feat/s4-testing-infra branch June 18, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1.5-S4-ARI-31] CMDB validation + end-to-end smoke test (1 incident per UC)

2 participants