Skip to content

fix(beta9): 5 bugs from Cubic review of upstream PR — status-checking, idempotency, NodePort Service, SDK batch-embeddings#3

Open
Wingie wants to merge 5 commits intomainfrom
fix/cubic-review-findings-20260420
Open

fix(beta9): 5 bugs from Cubic review of upstream PR — status-checking, idempotency, NodePort Service, SDK batch-embeddings#3
Wingie wants to merge 5 commits intomainfrom
fix/cubic-review-findings-20260420

Conversation

@Wingie
Copy link
Copy Markdown
Owner

@Wingie Wingie commented Apr 20, 2026

Summary

Addresses 5 bugs Cubic AI flagged on the (now-closed) upstream PR beam-cloud#1562. All fixes land with tests; full Go + SDK suites pass (62/62 SDK, was 61/62).

Branch off origin/main; one commit per fix for clean review.

Fixes

Commit Finding File
e463ced4 SDK batch-embedding field routing — embedding now receives a single vector, embeddings receives the batch list-of-lists sdk/src/beta9/inference.py
55adb935 UnloadModel status-checks worker response; non-200 logs error with ≤4 KiB body, returns error, leaves LoadState unchanged pkg/gateway/inference_router.go
cd7acd37 /inference/pull status-checks Ollama before decoding; decode errors return 502; includes isAllowedModelName() allowlist (blocks ..) + MaxBytesReader(4096) pkg/agent/control.go
241e932a KeepaliveLoop Start/Stop guarded with sync.Once + atomic started flag; Stop-before-Start no longer hangs; Stop is idempotent; concurrent Stop is safe pkg/agent/keepalive.go
14d47d1f Added Service/gateway-nodeport (type NodePort, ports 31994 + 31993, selector app=gateway) alongside the existing ClusterIP Service manifests/k3d/beta9.yaml

Tests added / updated

  • sdk/tests/test_cubic_fixes.py::test_batch_embeddings — pre-existing, now passes
  • pkg/gateway/inference_router_test.go — new — TestUnloadModelNonOK (200/500/404 table) + TestUnloadModelNodeNotFound
  • pkg/agent/control_test.go — new — TestInferencePullStatus (200/404/500) + TestInferencePullDecodeFailureReturnsBadGateway + TestInferencePullRejectsInvalidModelName + TestInferencePullMethodNotAllowed
  • pkg/agent/keepalive_test.go — new — TestKeepaliveLoopIdempotent + TestKeepaliveLoopConcurrentStop (32 goroutines) + TestKeepaliveLoopStopBeforeStart

Verification

  • go build ./... — clean
  • go test ./... -count=1 -timeout=300s — all packages PASS
  • go vet ./... — clean on touched files (pre-existing warnings in pkg/worker/worker.go and pkg/gateway/services/stub.go unchanged)
  • cd sdk && python -m pytest tests/ -x --tb=short62 passed (was 61 passed / 1 failed before this PR)

Notes

  • The NodePort Service (commit 14d47d1) was validated via Python YAML parse (the build host had no kubectl). Worth re-verifying with kubectl apply --dry-run=client -f manifests/k3d/beta9.yaml before apply.
  • Cubic's line numbers drifted vs current origin/main; fixes are by function identity, not line — the underlying bugs were unchanged.
  • /inference/pull allowlist and body cap are inlined here (not pulled from security-hardening-b9agent branch) since this PR is based on origin/main for a clean diff.
  • Fix context (all 40 Cubic findings + cross-check against our earlier audits): https://github.com/Wingie/flowstate-agents/blob/main/wip-specs/beta9/cubic-review-findings-1562-2026-04-20.md

Co-Authored-By: Claudistrator savetheplanet@agentosaurus.com

Wingie and others added 5 commits April 20, 2026 15:43
Cubic review finding (sdk/src/beta9/inference.py:307):
Batch embed() results were being assigned to the singular `embedding`
field while `embeddings` (the documented plural list-of-lists) stayed
empty. tests/test_cubic_fixes.py::test_batch_embeddings exercises this
path and asserts `len(result.embeddings) == 3`.

Fix: track whether caller passed a string (single) or a list (batch),
and populate:
  - `embedding` = flat vector for single-input calls (back-compat)
  - `embeddings` = full list-of-lists for batch calls

Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>
Cubic review finding (pkg/gateway/inference_router.go:463):
UnloadModel previously cleared the registry's LoadState to Idle no
matter what the worker returned. Any 4xx/5xx response would leave the
registry inconsistent with the worker's actual in-memory model set —
a cache-coherency bug.

Fix:
- On non-200 response: log error with status + up to 4 KiB of body,
  return an error, and leave the model's LoadState unchanged.
- Only on 200: drain body and transition to LoadStateIdle.

Added table-driven TestUnloadModelNonOK + TestUnloadModelNodeNotFound
in pkg/gateway/inference_router_test.go covering 200/500/404 paths
and the missing-node early return.

Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>
…broke

Cubic review finding (pkg/agent/control.go:227):
handleInferencePull forwarded to Ollama's /api/pull, looped over the
NDJSON body with a decoder, and returned StatusOK unconditionally. A
non-200 from Ollama or a malformed stream both produced a false
success response.

Fix:
- Check resp.StatusCode != 200 before streaming; non-OK now returns
  502 with a 512-byte body preview from Ollama for diagnosis.
- Track decoder errors; if decode fails (or the stream is empty) the
  handler returns 502 instead of pretending success.
- Success case now returns 201 Created to distinguish from the pre-fix
  false-success 200.

Inline P0-A hardening (required by spec, not yet on origin/main):
- MaxBytesReader caps request body at 4 KiB.
- isAllowedModelName() allowlist [A-Za-z0-9._/:-] + length <= 128 and
  blocks ".." path-traversal sequences before they reach Ollama.

Added TestInferencePullStatus (200/404/500 upstream),
TestInferencePullDecodeFailureReturnsBadGateway,
TestInferencePullRejectsInvalidModelName, TestInferencePullMethodNotAllowed
in pkg/agent/control_test.go. ControlServer gained an optional
ollamaBaseURL field so httptest servers can stand in for the daemon.

Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>
Cubic review finding (pkg/agent/keepalive.go, Start ~L113 / Stop):
Calling Stop() twice — or Start() after Stop() — closed an already
closed channel and produced a "close of closed channel" panic. The
bug is reachable during shutdown races and during retries after a
transient failure.

Fix:
- Added `startOnce` + `stopOnce` sync.Once guards plus an atomic
  `started` flag. Start is a no-op after the first call; subsequent
  Stops are no-ops too.
- Stop-before-Start is also safe: when started == 0 we return without
  waiting on doneCh (which would otherwise never close).

Tests in pkg/agent/keepalive_test.go:
- TestKeepaliveLoopIdempotent — Start, Stop, Stop, Start, Stop
- TestKeepaliveLoopConcurrentStop — 32 goroutines stopping in parallel
- TestKeepaliveLoopStopBeforeStart — Stop never blocks pre-Start

Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>
Cubic review finding (manifests/k3d/beta9.yaml around L361):
The manifest's comment declared "using NodePort instead" but the only
gateway Service was ClusterIP on 1994/1993/9090. Tailscale peers and
external workers hitting :31994/:31993 on the OCI node saw connection
refused because no NodePort Service ever materialised — matches
empty output of `ss -tlnp | grep 31994` on the live host.

Fix: add `Service/gateway-nodeport` (type: NodePort) alongside the
existing ClusterIP Service, with:
  - nodePort: 31994 -> targetPort 1994 (HTTP)
  - nodePort: 31993 -> targetPort 1993 (gRPC)
  - selector: app=gateway

Kept the ClusterIP Service unchanged for in-cluster traffic.
Validated with `python3 -c "import yaml; yaml.safe_load_all(...)"`;
kubectl dry-run not available on the build host but the Service
matches the Kubernetes v1 Service schema used elsewhere in the repo.

Co-Authored-By: Claudistrator <savetheplanet@agentosaurus.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant