feat(infra): Azure UC equivalents + infra directory restructure [S4]#98
Merged
Conversation
…ty nodes Cherry-picked content from PR #82 (Tobi-Adesoye, commit 85b658e). Only the three runbook files are brought in — the test regression from that commit (7 deleted KB unit tests) and the orphaned scripts/uc1_parser.py are intentionally excluded. These runbooks will be validated against actual TF log paths as part of #60. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rectness Closes #59 — cluster_hosts.json restructured for UC1 TF node names (cdp-master-01, cdp-data-01/02, cdp-utility-01, cdp-bus-01). IPs placeholder until TF apply. Closes #60 — UC1 KB runbooks validated and enriched against TF log paths. _KEYWORD_RE extended with Kafka, ZooKeeper, NiFi, AuthenticationException, DiskOutOfSpaceException, GC overhead. test_file_kb.py: restored original 8 tests, updated fixture count (2→8), added TestUC1RunbookAcceptance (3 tests). Closes #61 — cdp_ssh_key_secret() config option added (core/config.py); SSH key vault key now configurable via conf.yaml cdp.ssh_key_secret, default CDP_SSH_KEY unchanged. api/dependencies.py wired to cfg.cdp_ssh_key_secret(). conf_template.yaml annotated with UC1 TF secret alignment guidance and full TF log dir paths. Closes #62 — GCPLogConnector: resource_types param adds resource.type OR-clause and cluster_name host label alias for Dataproc. api/dependencies.py sets ['cloud_dataproc_cluster', 'cloud_dataproc_job'] for UC2. 3 new filter tests. Closes #63 — UC2 Dataproc KB runbooks (dataproc_cluster.md, dataproc_job.md) added. TestUC2RunbookAcceptance (2 tests). Closes #64 — gcp_native.md added as UC3 graceful degradation marker. Agent 2 returns LOW confidence / empty logs for native GCP services; Agent 4 notifies with gap message. Closes #85 — _validate_log_paths() in log_extractor.py: drops LLM-planned paths outside /var/log/ before passing to connectors. 4 new unit tests. Closes #83 — ClassificationError caught in _agent3_node (pipeline.py): Agent 4 now always runs, notify-only guarantee preserved. ClassifierAgent adds 1 retry + 1s sleep before raising. 1 retry test + 1 pipeline resilience test added. 309 unit tests green (was 294, +15 new). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both dataproc_cluster.md and dataproc_job.md scored equally for "cluster" queries due to "cluster_name" appearing in the Log Paths section of dataproc_job.md. This caused non-deterministic test failures in CI where the wrong runbook was returned for cluster-level incidents (YARN missing). Remove the token by replacing the multi-line filter block with a single sentence that doesn't contain "cluster", making dataproc_cluster.md the unique winner for cluster-targeted queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous fix replaced cluster_name label text but left two more cluster occurrences: the word literal in "UC2 cluster:" and the hyphen-split token from "aria-uc2-cluster". Since _tokenize uses re.findall(r"\w+") — hyphens split but underscores don't — aria-uc2-cluster tokenises to ["aria","uc2","cluster"], still tying with dataproc_cluster.md for "dataproc-cluster gcp" queries. Replace both with "UC2 job runner: aria-uc2-dataproc" so dataproc_job.md scores 0.5 and dataproc_cluster.md scores 0.75 for cluster queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…structure Two architectural fixes in one commit: 1. Split knowledge_base fixtures into resource_kb/ (Agent 2) and analyser_kb/ (Agent 3). Eliminates the design confusion that put failure vocabulary in Agent 2's resource catalog — the root cause of the S4 CI score-tie failures. 2. Consolidate from 8 per-component files to 3 per-cluster files in resource_kb. Each file describes a cluster's physical/logical resources and log paths — no error keywords, no failure descriptions. The cdp_cluster.md covers all 5 UC1 nodes in one file; aria_uc2_cluster.md covers Dataproc logical resources. 3. Add analyser_kb/ with 5 labeled log excerpts (OOM, disk, auth, YARN safe mode, OK baseline) injected into Agent 3's prompt as few-shot examples. These files double as a training corpus for the future fine-tuned Agent 3 model. 4. ClassifierAgent gains analyser_kb_dir param + _load_analyser_kb() loader. cfg.analyser_kb_dir() reads knowledge_base.analyser_kb_dir / ARIA_ANALYSER_KB_DIR. get_agent3() passes the configured dir at construction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GCP billing blocked (OR_BACR2_44); Azure $200 credit available. Ports all 3 UC testing environments to Azure so S4 smoke test can proceed independently while GCP billing resolves. Changes: - Restructure infra/terraform/uc_testing/: GCP configs moved to gcp/ subfolder (pure git rename, no content changes) - New azure/uc1-hadoop-onprem/: 5 Standard_B2ms VMs + VNet + NSG simulating CDP Hadoop cluster; cloud-init installs Hadoop 3.3.6 and creates /var/log/hadoop/* structure; ~$24 for 2 days - New azure/uc2-hdinsight/: HDInsight Spark cluster (aria-uc2-cluster) + Log Analytics workspace; Diagnostic Settings route Syslog to workspace for AzureLogConnector; ~$3 for 3h window - New azure/uc3-azure-native/: Event Hubs + Synapse + Storage; references UC2 Log Analytics workspace; no logs seeded (LOW confidence expected); ~$2 for 2 days - api/dependencies.py: wire AzureLogConnector into Agent 2 registry (PlatformTag.AZURE → AzureLogConnector) and wire azure vault backend in _get_vault(); both implementations already existed (ARI-52) Closes #93 Closes #94 Closes #95 Closes #96 Closes #97 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or Azure - AzureLogConnector is now implemented and wired (not a stub) — update Agent 2 connector list and implementations/clusters/cloud/azure/ entry - infra/terraform/uc_testing/ now has gcp/ and azure/ subfolders — update repository structure section - S4 roadmap entry updated to reflect Azure UC equivalents addition Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bayrem
previously approved these changes
Jun 17, 2026
bayrem
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OR_BACR2_44); Azure $200 credit available via MS AI Festinfra/terraform/uc_testing/— GCP configs moved togcp/subfolder (pure rename, no content changes)AzureLogConnector(ARI-52) and Azure vault into the ARIA dependency factoryChanges
infra/terraform/uc_testing/gcp/infra/terraform/uc_testing/azure/uc1-hadoop-onprem/infra/terraform/uc_testing/azure/uc2-hdinsight/aria-uc2-cluster+ Log Analytics workspace; Syslog → workspace via Diagnostic Settings (~$3/3h)infra/terraform/uc_testing/azure/uc3-azure-native/api/dependencies.pyPlatformTag.AZURE → AzureLogConnectorin Agent 2 registry; wireazurevault backend in_get_vault()Closes
Closes #93 Closes #94 Closes #95 Closes #96 Closes #97
Test plan
make lintpasses (CI)az loginwith bjridi.codes@gmail.com account (once Azure credit confirmed active)terraform init && terraform applyin each azure/ UC directoryconfidence_band=LOWandnotification_sent=Trueterraform destroyall 3 UC resource groups after smoke tests pass🤖 Generated with Claude Code