Skip to content

feat(infra): Azure UC equivalents + infra directory restructure [S4]#98

Merged
bayrem merged 9 commits into
mainfrom
feat/s4-testing-infra
Jun 17, 2026
Merged

feat(infra): Azure UC equivalents + infra directory restructure [S4]#98
bayrem merged 9 commits into
mainfrom
feat/s4-testing-infra

Conversation

@bjridicodes

Copy link
Copy Markdown
Contributor

Summary

  • GCP billing blocked (OR_BACR2_44); Azure $200 credit available via MS AI Fest
  • Restructures infra/terraform/uc_testing/ — GCP configs moved to gcp/ subfolder (pure rename, no content changes)
  • Adds Azure equivalents of all 3 UC testing environments so S4 smoke test can proceed independently
  • Wires already-implemented AzureLogConnector (ARI-52) and Azure vault into the ARIA dependency factory

Changes

Area What changed
infra/terraform/uc_testing/gcp/ GCP configs moved here (renamed only)
infra/terraform/uc_testing/azure/uc1-hadoop-onprem/ 5 × Standard_B2ms VMs + VNet + NSG, cloud-init installs Hadoop 3.3.6 + CDP log dirs (~$24/2 days)
infra/terraform/uc_testing/azure/uc2-hdinsight/ HDInsight Spark cluster aria-uc2-cluster + Log Analytics workspace; Syslog → workspace via Diagnostic Settings (~$3/3h)
infra/terraform/uc_testing/azure/uc3-azure-native/ Event Hubs + Synapse + Storage; no logs seeded (→ LOW confidence expected) (~$2/2 days)
api/dependencies.py Wire PlatformTag.AZURE → AzureLogConnector in Agent 2 registry; wire azure vault backend in _get_vault()

Closes

Closes #93 Closes #94 Closes #95 Closes #96 Closes #97

Test plan

  • make lint passes (CI)
  • Unit tests pass — no ARIA code logic changed, only wiring
  • az login with bjridi.codes@gmail.com account (once Azure credit confirmed active)
  • terraform init && terraform apply in each azure/ UC directory
  • UC1: SSH into cdp-master-01, inject log entry, run pipeline smoke test
  • UC2: Submit HDInsight job, verify Syslog appears in Log Analytics, run pipeline smoke test
  • UC3: Run pipeline smoke test, confirm confidence_band=LOW and notification_sent=True
  • terraform destroy all 3 UC resource groups after smoke tests pass

🤖 Generated with Claude Code

bjridicodes and others added 8 commits June 16, 2026 13:48
…ty nodes

Cherry-picked content from PR #82 (Tobi-Adesoye, commit 85b658e).
Only the three runbook files are brought in — the test regression from
that commit (7 deleted KB unit tests) and the orphaned scripts/uc1_parser.py
are intentionally excluded.

These runbooks will be validated against actual TF log paths as part of #60.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rectness

Closes #59 — cluster_hosts.json restructured for UC1 TF node names (cdp-master-01,
cdp-data-01/02, cdp-utility-01, cdp-bus-01). IPs placeholder until TF apply.

Closes #60 — UC1 KB runbooks validated and enriched against TF log paths.
_KEYWORD_RE extended with Kafka, ZooKeeper, NiFi, AuthenticationException,
DiskOutOfSpaceException, GC overhead. test_file_kb.py: restored original 8 tests,
updated fixture count (2→8), added TestUC1RunbookAcceptance (3 tests).

Closes #61 — cdp_ssh_key_secret() config option added (core/config.py); SSH key vault
key now configurable via conf.yaml cdp.ssh_key_secret, default CDP_SSH_KEY unchanged.
api/dependencies.py wired to cfg.cdp_ssh_key_secret(). conf_template.yaml annotated
with UC1 TF secret alignment guidance and full TF log dir paths.

Closes #62 — GCPLogConnector: resource_types param adds resource.type OR-clause and
cluster_name host label alias for Dataproc. api/dependencies.py sets
['cloud_dataproc_cluster', 'cloud_dataproc_job'] for UC2. 3 new filter tests.

Closes #63 — UC2 Dataproc KB runbooks (dataproc_cluster.md, dataproc_job.md) added.
TestUC2RunbookAcceptance (2 tests).

Closes #64 — gcp_native.md added as UC3 graceful degradation marker. Agent 2 returns
LOW confidence / empty logs for native GCP services; Agent 4 notifies with gap message.

Closes #85 — _validate_log_paths() in log_extractor.py: drops LLM-planned paths
outside /var/log/ before passing to connectors. 4 new unit tests.

Closes #83 — ClassificationError caught in _agent3_node (pipeline.py): Agent 4 now
always runs, notify-only guarantee preserved. ClassifierAgent adds 1 retry + 1s sleep
before raising. 1 retry test + 1 pipeline resilience test added.

309 unit tests green (was 294, +15 new).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both dataproc_cluster.md and dataproc_job.md scored equally for "cluster"
queries due to "cluster_name" appearing in the Log Paths section of
dataproc_job.md. This caused non-deterministic test failures in CI where
the wrong runbook was returned for cluster-level incidents (YARN missing).

Remove the token by replacing the multi-line filter block with a single
sentence that doesn't contain "cluster", making dataproc_cluster.md the
unique winner for cluster-targeted queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous fix replaced cluster_name label text but left two more cluster
occurrences: the word literal in "UC2 cluster:" and the hyphen-split
token from "aria-uc2-cluster". Since _tokenize uses re.findall(r"\w+")
— hyphens split but underscores don't — aria-uc2-cluster tokenises to
["aria","uc2","cluster"], still tying with dataproc_cluster.md for
"dataproc-cluster gcp" queries.

Replace both with "UC2 job runner: aria-uc2-dataproc" so dataproc_job.md
scores 0.5 and dataproc_cluster.md scores 0.75 for cluster queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…structure

Two architectural fixes in one commit:

1. Split knowledge_base fixtures into resource_kb/ (Agent 2) and analyser_kb/
   (Agent 3). Eliminates the design confusion that put failure vocabulary in
   Agent 2's resource catalog — the root cause of the S4 CI score-tie failures.

2. Consolidate from 8 per-component files to 3 per-cluster files in resource_kb.
   Each file describes a cluster's physical/logical resources and log paths — no
   error keywords, no failure descriptions. The cdp_cluster.md covers all 5 UC1
   nodes in one file; aria_uc2_cluster.md covers Dataproc logical resources.

3. Add analyser_kb/ with 5 labeled log excerpts (OOM, disk, auth, YARN safe mode,
   OK baseline) injected into Agent 3's prompt as few-shot examples. These files
   double as a training corpus for the future fine-tuned Agent 3 model.

4. ClassifierAgent gains analyser_kb_dir param + _load_analyser_kb() loader.
   cfg.analyser_kb_dir() reads knowledge_base.analyser_kb_dir / ARIA_ANALYSER_KB_DIR.
   get_agent3() passes the configured dir at construction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCP billing blocked (OR_BACR2_44); Azure $200 credit available.
Ports all 3 UC testing environments to Azure so S4 smoke test can
proceed independently while GCP billing resolves.

Changes:
- Restructure infra/terraform/uc_testing/: GCP configs moved to gcp/
  subfolder (pure git rename, no content changes)
- New azure/uc1-hadoop-onprem/: 5 Standard_B2ms VMs + VNet + NSG
  simulating CDP Hadoop cluster; cloud-init installs Hadoop 3.3.6 and
  creates /var/log/hadoop/* structure; ~$24 for 2 days
- New azure/uc2-hdinsight/: HDInsight Spark cluster (aria-uc2-cluster)
  + Log Analytics workspace; Diagnostic Settings route Syslog to
  workspace for AzureLogConnector; ~$3 for 3h window
- New azure/uc3-azure-native/: Event Hubs + Synapse + Storage;
  references UC2 Log Analytics workspace; no logs seeded (LOW confidence
  expected); ~$2 for 2 days
- api/dependencies.py: wire AzureLogConnector into Agent 2 registry
  (PlatformTag.AZURE → AzureLogConnector) and wire azure vault backend
  in _get_vault(); both implementations already existed (ARI-52)

Closes #93
Closes #94
Closes #95
Closes #96
Closes #97

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or Azure

- AzureLogConnector is now implemented and wired (not a stub) — update
  Agent 2 connector list and implementations/clusters/cloud/azure/ entry
- infra/terraform/uc_testing/ now has gcp/ and azure/ subfolders — update
  repository structure section
- S4 roadmap entry updated to reflect Azure UC equivalents addition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bayrem
bayrem previously approved these changes Jun 17, 2026
@bayrem bayrem dismissed their stale review June 17, 2026 19:45

The merge-base changed after approval.

@bayrem bayrem merged commit 65f6eb6 into main Jun 17, 2026
7 checks passed
@bayrem bayrem deleted the feat/s4-testing-infra branch June 17, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment