fix(beta9): 5 bugs from Cubic review of upstream PR — status-checking, idempotency, NodePort Service, SDK batch-embeddings by Wingie · Pull Request #3 · Wingie/beta9

Wingie · 2026-04-20T13:59:39Z

Summary

Addresses 5 bugs Cubic AI flagged on the (now-closed) upstream PR beam-cloud#1562. All fixes land with tests; full Go + SDK suites pass (62/62 SDK, was 61/62).

Branch off origin/main; one commit per fix for clean review.

Fixes

Commit	Finding	File
`e463ced4`	SDK batch-embedding field routing — `embedding` now receives a single vector, `embeddings` receives the batch list-of-lists	`sdk/src/beta9/inference.py`
`55adb935`	`UnloadModel` status-checks worker response; non-200 logs error with ≤4 KiB body, returns error, leaves `LoadState` unchanged	`pkg/gateway/inference_router.go`
`cd7acd37`	`/inference/pull` status-checks Ollama before decoding; decode errors return 502; includes `isAllowedModelName()` allowlist (blocks `..`) + `MaxBytesReader(4096)`	`pkg/agent/control.go`
`241e932a`	`KeepaliveLoop` Start/Stop guarded with `sync.Once` + atomic started flag; Stop-before-Start no longer hangs; Stop is idempotent; concurrent Stop is safe	`pkg/agent/keepalive.go`
`14d47d1f`	Added `Service/gateway-nodeport` (type NodePort, ports 31994 + 31993, selector `app=gateway`) alongside the existing ClusterIP Service	`manifests/k3d/beta9.yaml`

Tests added / updated

sdk/tests/test_cubic_fixes.py::test_batch_embeddings — pre-existing, now passes
pkg/gateway/inference_router_test.go — new — TestUnloadModelNonOK (200/500/404 table) + TestUnloadModelNodeNotFound
pkg/agent/control_test.go — new — TestInferencePullStatus (200/404/500) + TestInferencePullDecodeFailureReturnsBadGateway + TestInferencePullRejectsInvalidModelName + TestInferencePullMethodNotAllowed
pkg/agent/keepalive_test.go — new — TestKeepaliveLoopIdempotent + TestKeepaliveLoopConcurrentStop (32 goroutines) + TestKeepaliveLoopStopBeforeStart

Verification

go build ./... — clean
go test ./... -count=1 -timeout=300s — all packages PASS
go vet ./... — clean on touched files (pre-existing warnings in pkg/worker/worker.go and pkg/gateway/services/stub.go unchanged)
cd sdk && python -m pytest tests/ -x --tb=short — 62 passed (was 61 passed / 1 failed before this PR)

Notes

The NodePort Service (commit 14d47d1) was validated via Python YAML parse (the build host had no kubectl). Worth re-verifying with kubectl apply --dry-run=client -f manifests/k3d/beta9.yaml before apply.
Cubic's line numbers drifted vs current origin/main; fixes are by function identity, not line — the underlying bugs were unchanged.
/inference/pull allowlist and body cap are inlined here (not pulled from security-hardening-b9agent branch) since this PR is based on origin/main for a clean diff.
Fix context (all 40 Cubic findings + cross-check against our earlier audits): https://github.com/Wingie/flowstate-agents/blob/main/wip-specs/beta9/cubic-review-findings-1562-2026-04-20.md

Co-Authored-By: Claudistrator savetheplanet@agentosaurus.com

Cubic review finding (sdk/src/beta9/inference.py:307): Batch embed() results were being assigned to the singular `embedding` field while `embeddings` (the documented plural list-of-lists) stayed empty. tests/test_cubic_fixes.py::test_batch_embeddings exercises this path and asserts `len(result.embeddings) == 3`. Fix: track whether caller passed a string (single) or a list (batch), and populate: - `embedding` = flat vector for single-input calls (back-compat) - `embeddings` = full list-of-lists for batch calls Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>

Cubic review finding (pkg/gateway/inference_router.go:463): UnloadModel previously cleared the registry's LoadState to Idle no matter what the worker returned. Any 4xx/5xx response would leave the registry inconsistent with the worker's actual in-memory model set — a cache-coherency bug. Fix: - On non-200 response: log error with status + up to 4 KiB of body, return an error, and leave the model's LoadState unchanged. - Only on 200: drain body and transition to LoadStateIdle. Added table-driven TestUnloadModelNonOK + TestUnloadModelNodeNotFound in pkg/gateway/inference_router_test.go covering 200/500/404 paths and the missing-node early return. Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>

…broke Cubic review finding (pkg/agent/control.go:227): handleInferencePull forwarded to Ollama's /api/pull, looped over the NDJSON body with a decoder, and returned StatusOK unconditionally. A non-200 from Ollama or a malformed stream both produced a false success response. Fix: - Check resp.StatusCode != 200 before streaming; non-OK now returns 502 with a 512-byte body preview from Ollama for diagnosis. - Track decoder errors; if decode fails (or the stream is empty) the handler returns 502 instead of pretending success. - Success case now returns 201 Created to distinguish from the pre-fix false-success 200. Inline P0-A hardening (required by spec, not yet on origin/main): - MaxBytesReader caps request body at 4 KiB. - isAllowedModelName() allowlist [A-Za-z0-9._/:-] + length <= 128 and blocks ".." path-traversal sequences before they reach Ollama. Added TestInferencePullStatus (200/404/500 upstream), TestInferencePullDecodeFailureReturnsBadGateway, TestInferencePullRejectsInvalidModelName, TestInferencePullMethodNotAllowed in pkg/agent/control_test.go. ControlServer gained an optional ollamaBaseURL field so httptest servers can stand in for the daemon. Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>

Cubic review finding (pkg/agent/keepalive.go, Start ~L113 / Stop): Calling Stop() twice — or Start() after Stop() — closed an already closed channel and produced a "close of closed channel" panic. The bug is reachable during shutdown races and during retries after a transient failure. Fix: - Added `startOnce` + `stopOnce` sync.Once guards plus an atomic `started` flag. Start is a no-op after the first call; subsequent Stops are no-ops too. - Stop-before-Start is also safe: when started == 0 we return without waiting on doneCh (which would otherwise never close). Tests in pkg/agent/keepalive_test.go: - TestKeepaliveLoopIdempotent — Start, Stop, Stop, Start, Stop - TestKeepaliveLoopConcurrentStop — 32 goroutines stopping in parallel - TestKeepaliveLoopStopBeforeStart — Stop never blocks pre-Start Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>

Cubic review finding (manifests/k3d/beta9.yaml around L361): The manifest's comment declared "using NodePort instead" but the only gateway Service was ClusterIP on 1994/1993/9090. Tailscale peers and external workers hitting :31994/:31993 on the OCI node saw connection refused because no NodePort Service ever materialised — matches empty output of `ss -tlnp | grep 31994` on the live host. Fix: add `Service/gateway-nodeport` (type: NodePort) alongside the existing ClusterIP Service, with: - nodePort: 31994 -> targetPort 1994 (HTTP) - nodePort: 31993 -> targetPort 1993 (gRPC) - selector: app=gateway Kept the ClusterIP Service unchanged for in-cluster traffic. Validated with `python3 -c "import yaml; yaml.safe_load_all(...)"`; kubectl dry-run not available on the build host but the Service matches the Kubernetes v1 Service schema used elsewhere in the repo. Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>

Wingie and others added 5 commits April 20, 2026 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(beta9): 5 bugs from Cubic review of upstream PR — status-checking, idempotency, NodePort Service, SDK batch-embeddings#3

fix(beta9): 5 bugs from Cubic review of upstream PR — status-checking, idempotency, NodePort Service, SDK batch-embeddings#3
Wingie wants to merge 5 commits intomainfrom
fix/cubic-review-findings-20260420

Wingie commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Wingie commented Apr 20, 2026

Summary

Fixes

Tests added / updated

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant