Skip to content

Add Nix flake: dev shell, golem-cli package, and workspace checks (bring the golem to the dark side ⛧)#3365

Open
Fresheyeball wants to merge 23 commits into
golemcloud:mainfrom
Fresheyeball:main
Open

Add Nix flake: dev shell, golem-cli package, and workspace checks (bring the golem to the dark side ⛧)#3365
Fresheyeball wants to merge 23 commits into
golemcloud:mainfrom
Fresheyeball:main

Conversation

@Fresheyeball
Copy link
Copy Markdown

@Fresheyeball Fresheyeball commented May 9, 2026

262e77e7-e774-414c-9dfa-aa4484f3f909


Introduces a flake.nix so contributors can drop into a reproducible
development environment with one command and build the CLI hermetically —
no system-wide rustup, cargo-make, protoc, or openssl install
required.

What this gives you

nix develop                       # fully-loaded dev shell
nix build                         # produces ./result/bin/golem-cli
nix flake check                   # fmt + clippy + unit-tests

Outputs

devShells.default

Stable Rust 1.95 (via rust-overlay) with rust-src, rust-analyzer,
clippy, rustfmt, and the wasm32-wasip1 / wasm32-wasip2 targets.
Plus:

  • Cargo helpers: cargo-make, cargo-nextest, cargo-binstall, cargo-component
  • WASM tooling: wasm-tools, wit-bindgen
  • Native build deps: protobuf, openssl, pkg-config, cmake, zstd, git, cacert
  • macOS: Security and SystemConfiguration frameworks for openssl-sys linking

The shellHook exports OPENSSL_DIR, OPENSSL_LIB_DIR,
OPENSSL_INCLUDE_DIR, PKG_CONFIG_PATH, PROTOC, and RUST_SRC_PATH
so openssl-sys and rust-analyzer resolve without manual setup.

Two tools aren't in nixpkgs and are surfaced as hints (no auto-install):

  • wasi-sdk v25 — only required to build C/C++ wasm components
  • wasm-rquickjs@0.2.4 — installable via cargo binstall

packages.default / packages.golem-cli

Builds the golem-cli binary from cli/golem-cli with crane. Cold
build is ~2½ minutes on this machine; subsequent builds reuse the
shared cargoArtifacts derivation.

Two notable workarounds were needed:

  • Git-sourced deps: the golemcloud/wasmtime fork and
    golemcloud/wasm-rquickjs are vendored via outputHashes, keyed by
    the full Cargo.lock source URL (crane's convention, not the
    - form that nixpkgs importCargoLock uses).
  • Missing READMEs in the wasmtime fork: several Cargo.toml files
    (notably crates/c-api/artifact) reference README.md files that
    don't exist in the crate subdirectory; cargo package -l rejects
    those manifests. overrideVendorGitCheckout materializes empty
    placeholder READMEs before vendoring runs.

A custom source filter is used instead of cleanCargoSource because
golem-cli's build.rs embeds template files from
cli/golem-cli/templates/ and golem-skills/skills/, which the default
Rust-only filter strips.

checks..{clippy, fmt, unit-tests, golem-cli}

Runs cargo clippy --all-targets -- --no-deps -Dwarnings, cargo fmt --check, and the workspace's --lib unit tests via crane
primitives.
All checks share a single workspace-wide cargoArtifacts derivation, so
third-party dependencies are compiled once across nix flake check.

The four "drift" checks from Makefile.toml (check-openapi,
check-configs, check-wit, check-diff-model-fingerprint) are
deliberately out of scope here — they invoke compiled binaries and shell
out to git diff over generated artifacts, which is a different shape.
Easy follow-up.

formatter

nixpkgs-fmt for the flake itself.

Other changes

  • flake.lock is committed for reproducibility of inputs.
  • .gitignore ignores result, result-*, and .direnv/.

Introduces a `flake.nix` so contributors can drop into a reproducible
development environment with `nix develop` and build the CLI hermetically
with `nix build`.

## What's included

### `devShells.default`

Stable Rust 1.95 via `rust-overlay`, with `rust-src`, `rust-analyzer`,
`clippy`, `rustfmt`, and the `wasm32-wasip1` / `wasm32-wasip2` targets;
plus `cargo-make`, `cargo-nextest`, `cargo-binstall`, `cargo-component`,
`wasm-tools`, `wit-bindgen`, `protobuf`, `openssl`, `zstd`, `pkg-config`,
`cmake`, `git`, and `cacert`.

The `shellHook` exports `OPENSSL_DIR`, `OPENSSL_LIB_DIR`,
`OPENSSL_INCLUDE_DIR`, `PKG_CONFIG_PATH`, `PROTOC`, and `RUST_SRC_PATH` so
the `openssl-sys` probe and `rust-analyzer` resolve without manual setup.

Two tools aren't in nixpkgs and are surfaced as hints (no auto-install):

- `wasi-sdk` v25 — only required to build C/C++ wasm components
- `wasm-rquickjs@0.2.4` — installable via `cargo binstall`

### `packages.default` / `packages.golem-cli`

Builds the `golem-cli` binary from `cli/golem-cli` with `crane`. The two
git-sourced dependencies (`golemcloud/wasmtime` fork and
`golemcloud/wasm-rquickjs`) are vendored via `outputHashes` keyed by the
full `Cargo.lock` source URL.

Several `Cargo.toml` files in the wasmtime fork (notably
`crates/c-api/artifact`) reference `README.md` files that don't exist in
their crate subdirectory; `cargo package -l` rejects those manifests, so
`overrideVendorGitCheckout` materializes empty placeholder READMEs before
vendoring runs.

A custom source filter is used instead of `cleanCargoSource` because
`golem-cli`'s `build.rs` embeds template files from
`cli/golem-cli/templates/` and `golem-skills/skills/`, which the default
Rust-only filter would strip.

### `checks.<system>.{clippy, fmt, unit-tests, golem-cli}`

Runs `cargo clippy --all-targets -- --no-deps -Dwarnings`, `cargo fmt
--check`, and the workspace's `--lib` unit tests via crane primitives.
All checks share a single workspace-wide `cargoArtifacts` derivation, so
third-party dependencies are compiled once across `nix flake check`.

### `formatter`

`nixpkgs-fmt` for the flake itself.

## Other changes

- `flake.lock` is committed for reproducibility of inputs.
- `.gitignore` ignores `result`, `result-*`, and `.direnv/`.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

✅ All contributors have signed the CLA.
Posted by the CLA Assistant Lite bot.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Hi @Fresheyeball, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/golemcloud/golem/blob/main/CONTRIBUTING.md for more details.

@github-actions github-actions Bot closed this May 9, 2026
Adds infrastructure toward running `cargo make worker-executor-tests-*`,
`sharding-tests-debug`, `integration-tests`, and `cli-tests` as flake checks.

## What landed

### New packages
- `wasi-sdk` — fetches upstream WASI SDK v25 (not in nixpkgs); platform-aware
  variant selection for x86_64/arm64 × linux/darwin; `autoPatchelfHook` on Linux
- `wasm-rquickjs` — built from the `golemcloud/wasm-rquickjs` git fork the
  workspace already pins
- `golem-services` — every workspace bin staged at `target/debug/<name>` to
  match the layout `EnvBasedTestDependenciesConfig` expects (uses each crate's
  actual `[[bin]] name` — note `worker-executor`, not `golem-worker-executor`)
- `test-components-rust` — runs `golem-cli build` against all 13 Rust test
  crates plus the Rust half of `agent-rpc` (which has both Rust and TS halves
  in one app). Vendors per-component Cargo.locks via `vendorMultipleCargoDeps`
- `golem-ts-sdk` — `pnpm` build of the SDK monorepo; drops the
  `node_modules/.pnpm` symlink tree before nixpkgs' `noBrokenSymlinks` runs
  (test-components do their own `npm install`, the SDK's `dist/` is enough)

### New checks
- `workspace-build` — `cargo build --locked --workspace --bins`
- `diff-model-fingerprint` — single-test guard for golem-common's
  serialization invariant
- `worker-executor-tests-group{1,2,3}` — composes redis + sqlite +
  staged services + Rust test-components + cargo test on the integration
  binary with a `:tag:groupN` filter

### Helper plumbing
- `mkRuntimeRoot` — symlinks bins into `target/debug/`, wasms into
  `test-components/`, and exports `GOLEM_REPO_ROOT` so spawned services find
  their config dirs
- `mkWorkerExecutorTest` factory for the three tag groups

## Known limits

The worker-executor test groups currently fail at component-load time
because they depend on TS test-components that aren't built yet:
`golem_it_agent_rpc.wasm`, `golem_it_agent_sdk_ts.wasm`, and
`golem_it_constructor_parameter_echo.wasm`. The TS pipeline needs an
`agent-template` Cargo crate that's generated at build time by
`wasm-rquickjs generate-wrapper-crate` — its `Cargo.lock` doesn't exist
at flake-eval time, so vendoring requires a fixed-output derivation that
runs `cargo vendor` over the generated tree. Tracked separately and
landing in a follow-up.

Everything from the prior commit (`golem-cli` package, devShell, fmt,
clippy, unit-tests checks) continues to work.
Wires up `test-components-ts`, the TypeScript half of the worker-executor
test components. The build:

1. `pnpm install --frozen-lockfile` and `pnpm run build` across the
   `sdks/ts` workspace (golem-ts-types-core, golem-ts-typegen,
   golem-ts-sdk, golem-ts-bridge, golem-ts-repl).
2. `pnpm run build-agent-template` inside golem-ts-sdk: runs
   `wasm-rquickjs generate-wrapper-crate` to produce a Cargo crate, then
   `cargo build --target wasm32-wasip2 --release --features full,golem`,
   then copies `agent_guest.wasm` into `wasm/`.
3. `npm install` + `golem-cli build` per TS test-component
   (agent-constructor-parameter-echo, agent-promise, agent-sdk-ts,
   agent-rpc).

## Plumbing this took

- `wasi-sdk` is wired in via `WASI_SDK_PATH`, but rquickjs-sys reads
  `WASI_SDK` (no `_PATH`) — without that it downloads its own v24
  tarball. Set both.
- `CC_wasm32_wasip2`, `CXX_wasm32_wasip2`, `AR_wasm32_wasip2` point cc-rs
  at the WASI SDK clang for cross-compile of C deps (sqlite3, quickjs).
- nixpkgs' stdenv adds `-fzero-call-used-regs=used-gpr` via hardening
  flags; clang rejects it for wasm. `hardeningDisable = [ "all" ]`.
- `BINDGEN_EXTRA_CLANG_ARGS_x86_64_unknown_linux_gnu` pins libclang to
  the x86_64 target with glibc.dev includes (otherwise it tries to read
  `gnu/stubs-32.h` which nixpkgs' 64-bit glibc doesn't ship).
- `BINDGEN_EXTRA_CLANG_ARGS_wasm32_wasip2` points bindgen at the WASI
  sysroot for the wasm cross-compile.
- `pnpm` honors `packageManager` in package.json and tries to download
  v10.17.1 into PNPM_HOME; sandbox blocks that. `.npmrc` →
  `manage-package-manager-versions=false`.
- `npm install` runs `prepare` on file: deps which would re-run
  `pnpm build` and undo our shebang patches.
  `NPM_CONFIG_IGNORE_SCRIPTS=true` for the test-component installs.
- npm-installed bins ship `#!/usr/bin/env node`; sandbox has no
  `/usr/bin/env`. `patchShebangs` over `node_modules` and the source
  SDK packages.

## Outstanding

This commit leaves `test-components-ts` set up as a `__noChroot`
derivation (network access permitted). That requires the user's
nix.conf to allow it (`extra-sandbox-paths` or `sandbox = relaxed`);
it does NOT work on a strict-sandbox remote builder. A multi-stage
hermetic shape (FOD for cargo+npm fetches, hermetic offline build over
the fetched cache) is the right follow-up; deferred so we don't lose
the rest of the work. Verified locally with sandbox relaxed.

Everything from the prior commits — devShell, `golem-cli` package,
fmt/clippy/unit-tests/workspace-build/diff-model-fingerprint checks,
`golem-services`, `test-components-rust`, the
`worker-executor-tests-group{1,2,3}` check scaffolding — continues to
work.
A fixed-output derivation isn't viable for `test-components-ts`: cargo
+ rollup outputs aren't bit-stable across builds (build-ids, mtimes,
file-iteration order survived `SOURCE_DATE_EPOCH=1` and pinned rustc
flags, hash mismatched on every rerun).

Switching to `__noChroot = true`, which lets the build see the host
filesystem so cargo + npm fetch live. Tradeoffs are documented inline:

- Requires `sandbox = relaxed` in the user's `nix.conf` (or the build
  fails with `__noChroot is not allowed when sandbox is true`).
- Won't run on a strict-sandbox remote builder. For now, builds
  proceed locally on a sandbox-relaxed system.

Marked the multi-stage hermetic shape as the right follow-up: a
deterministic FOD chain that vendors cargo deps for the dynamically-
generated agent-template crate plus the pnpm + npm fetch caches, then
a hermetic offline build over those caches.
Replaces the prior `__noChroot = true` derivation with a fully
hermetic shape. The TypeScript pipeline now builds inside the strict
Nix sandbox, on default `sandbox = true` config, with no network
access at compile time.

## How the cycle gets broken

The fundamental constraint: the `agent-template` Cargo crate is
generated at build time by `wasm-rquickjs generate-wrapper-crate`, so
its `Cargo.lock` doesn't exist at flake-eval time and can't be
pre-vendored the normal way. Resolved by splitting the pipeline at
the network boundary into deterministic FOD fetches + hermetic
offline builds:

1. **`agent-template-source`** (hermetic) — runs wasm-rquickjs over
   `golem-ts-sdk-base`'s built `dist/index.mjs` to emit a Rust crate
   with a stable `Cargo.lock`. The actual SDK module content is
   bundled in (not a placeholder), so the produced `agent_guest.wasm`
   exports `TypescriptTypeRegistry` and the other names compiled TS
   components reference at wizer pre-initialize time.
2. **`agent-template-vendor`** (FOD) — runs `cargo vendor` over the
   generated source. Output is deterministic (sorted, no
   timestamps), so the fixed output hash is reliable across rebuilds.
   Writes the `[source.<git+url>] directory = "@VENDOR_DIR@"` config
   snippet alongside the vendor tree with a placeholder consumers
   substitute at use time.
3. **`agent-guest-wasm`** (hermetic) — `cargo build --offline
   --target wasm32-wasip2 --release` against the vendored deps,
   yielding `agent_guest.wasm`.
4. **`golem-ts-sdk-base`** (hermetic) — `pnpm install --frozen-lockfile
   && pnpm run build` for the SDK monorepo using `fetchPnpmDeps`
   (already deterministic). Drops `prepare` from each
   `package.json` because npm treats prepare as a "preparation" step
   that runs for file: deps even with `--ignore-scripts`.
5. **`golem-ts-sdk`** (hermetic) — overlays `agent_guest.wasm` into
   `packages/golem-ts-sdk/wasm/` on top of the base SDK.
6. **`tsComponentNpmDeps.<name>`** (FOD per component) — fetches
   each TS test-component's npm cache via `fetchNpmDeps`. One hash
   per component (4 hashes total).
7. **`mkTsTestComponent`** (hermetic per-component) — overlays the
   pre-built SDK into the source tree, lets `npmConfigHook` install
   from the per-component cache offline, then runs `golem-cli build`
   (which sees an up-to-date `node_modules` + marker and skips its
   own `npm install`). `agent-rpc` is scoped to its TS subcomponent
   (`golem-it:agent-rpc` only) because its Rust half is already
   built in `test-components-rust` and needs cargo/WASI SDK that
   this JS-only derivation doesn't carry.
8. **`test-components-ts`** (`symlinkJoin`) — combines the four
   per-component outputs into a single `test-components/*.wasm` tree.

## Verified

- `nix build .#test-components-ts` succeeds on strict-sandbox config.
- Output contains all four expected wasms:
  `golem_it_agent_promise.wasm`, `golem_it_agent_rpc.wasm`,
  `golem_it_agent_sdk_ts.wasm`,
  `golem_it_constructor_parameter_echo.wasm`.
- `nix build .#checks.x86_64-linux.worker-executor-tests-group1`
  successfully loads all required components and runs 60+ integration
  tests; the first failure is `golem-test-framework/src/components/rdb/
  docker_mysql.rs:68` ("`/var/run/docker.sock` not found"), i.e. the
  `rdbms` tag-suite tries to spin up MySQL via testcontainers. Skip
  filter for that suite lands next.

## Worth noting

- Nix's content-addressing makes the hermetic build behave as
  deterministic for downstream consumers: even though cargo + rollup
  outputs aren't bit-identical across runs (REPORT.md), Nix only
  builds it once per input hash, so the store path is stable for
  every consumer.
- The deterministic FOD outputs (`agent-template-vendor`,
  `fetchPnpmDeps`, `fetchNpmDeps`) are where the network boundary
  lives. Everything downstream is offline and reproducible at the
  Nix layer.
@vigoo
Copy link
Copy Markdown
Contributor

vigoo commented May 10, 2026

vouch @Fresheyeball

@vigoo vigoo reopened this May 10, 2026
@github-actions
Copy link
Copy Markdown

Hi @Fresheyeball, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/golemcloud/golem/blob/main/CONTRIBUTING.md for more details.

@github-actions github-actions Bot closed this May 10, 2026
Two remaining sandbox-vs-test mismatches surfaced once the rest of the
pipeline lit up:

## Skip suites that need services the sandbox can't provide

- `wasi::ip_address_resolve` exercises live DNS via wasi-net; the Nix
  sandbox has no networking, so the test sees `name-unresolvable` and
  panics inside the wasm component.
- `rdbms` tests pull `DockerMysqlRdb::new()` as a `test-r` test_dep,
  which talks to `/var/run/docker.sock` — not mounted in the sandbox.
  When that dep init panics it corrupts test-r's worker pool and
  cascades failures into unrelated tests in the same binary, so the
  safest move is to filter the whole suite via `--skip rdbms`.

Both skips are runtime-environment specific; the tests still exist
and run in environments that *can* provide DNS or Docker (the CI
pipeline keeps running them as before).

## Stage test-component wasms as writable copies

`mkRuntimeRoot` was symlinking test-component wasms into the source
tree's `test-components/` from `/nix/store`. The component-service
storage layer calls `tokio::fs::copy` on those paths and preserves
source mode (0444), so the framework's working copies ended up
read-only. Negative tests that rewrite the stored bytes (e.g.
`trying_to_use_a_wasm_that_wasmtime_cannot_load_*` in api.rs) then
fail with EACCES. Use `install -m 0644` so the staged wasms carry a
writable mode bit downstream copies inherit.

## Result

All eight flake checks now pass against `nix flake check -L`:

  ✅ fmt
  ✅ clippy
  ✅ unit-tests
  ✅ workspace-build
  ✅ diff-model-fingerprint
  ✅ golem-cli
  ✅ worker-executor-tests-group1
  ✅ worker-executor-tests-group2
  ✅ worker-executor-tests-group3
Adds the remaining test-runner checks (per CONTRIBUTING.md golemcloud#7-9 + golemcloud#1's
extension) via a `mkSpawnedTest` factory that reuses the same
mkRuntimeRoot pattern (stage bins + components + redis + sqlite) for:

- `worker-executor-tests-misc` — untagged worker-executor tests
  (compatibility, fuel). Skips `key_value_storage`,
  `indexed_storage`, `namespace_routed_key_value_storage` (each
  opens a DockerPostgresRdb test_dep, which panics without
  `/var/run/docker.sock`).
- `sharding-tests-debug` — integration-tests `sharding` test binary,
  single-threaded. Skips two known-flaky paths
  (`oplog_processor_locality_recovery` times out under sandbox
  scheduler; `oplog_processor_shard_move_inflight` racy-asserts on
  "Duplicate oplog indices found"); upstream's cargo-make wrapper
  retries these with `--flaky-run=5`.
- `integration-tests-group{1,2,7,10,12}` — `integration-tests`
  `integration` test binary, one check per tag group, each
  single-threaded.
- `cli-integration-tests` — `golem-cli` `integration` test binary.

REPORT.md was written when we still had `__noChroot` in
`test-components-ts`, framing the non-determinism + lockfile-
generation problems we hit as reproducibility issues that exist
independently of Nix. Committed so it doesn't get lost; still
accurate as a description of the underlying build pipeline's gaps.

Counts so far:
- worker-executor-tests-group1: 142 passed
- worker-executor-tests-group2: 39 passed
- worker-executor-tests-group3: 94 passed
- worker-executor-tests-misc: 2 passed
- sharding-tests-debug: 7 passed
- integration-tests-group{1,2,7,10,12} + cli: in flight
CONTRIBUTING.md targets that were originally deferred:

- **config-drift** — runs each service binary with
  `--dump-config-default-toml` and `--dump-config-default-env-var`,
  diffs against the committed `<service>/config/*.toml` and
  `*.sample.env` files. Drift fails. Run `cargo make generate-configs`
  and commit the changes to fix.
- **openapi-drift** — runs `--dump-openapi-yaml` on
  `golem-registry-service` and `golem-worker-service`, merges via
  `golem-openapi-client-generator merge`, diffs the three resulting
  yamls against `openapi/golem-{registry-,worker-,}service.yaml`.
  Drift fails. Run `cargo make generate-openapi` to fix.
- **wit-consistency** — `cargo make check-wit` would re-fetch WIT
  deps via `wit-deps` (which needs network). Sandbox blocks that, so
  we do a static check instead: every dep listed in `wit/deps.toml`
  must have a corresponding directory or `.wit` file under
  `wit/deps/`. Drift between manifest and tree means someone edited
  one without regenerating the other.

All three reuse the pre-built `golem-services` derivation so they're
fast (no compile step) and hermetic (no network, no spawned
services). Each emits `$out/result` with a one-line summary so the
check's output is small.
- **config-drift**: use `diff -B` to ignore trailing-blank-line drift
  between binary stdout and committed files. The dump-config-default-
  toml output doesn't carry the file's trailing newline; the committed
  copies do. Actual content drift still surfaces.
- **openapi-drift**: set HOME + cwd to a writable tempdir before
  running each service binary with `--dump-openapi-yaml`. The service
  initializes local blob storage on startup even for the openapi-dump
  path, and CWD/HOME must be writable for that to succeed.
- **wit-consistency**: rewritten to match what `cargo make check-wit`
  actually does. The previous version assumed wit-deps lockfile
  semantics; the real flow copies canonical `wit/deps/<entry>` into
  each consumer's `<consumer>/wit/deps/<entry>` and checks both
  copies agree. Replicate the same diff hermetically.
- **integration-tests-group7**: removed. The `otlp_plugin` and
  `plugins` suites in that group bring up Docker-managed Jaeger
  containers (`golem-test-framework/src/components/jaeger/docker.rs:50`
  panics without `/var/run/docker.sock`). Run locally via `cargo make
  integration-tests-group7` once Docker is available.

Drift checks green:
- config-drift ✓
- openapi-drift ✓
- wit-consistency ✓
- **integration-tests-group1**: skip `fork_and_sync_with_promise` —
  240s timeout with `Exceeded plan storage limit`. The fork test
  creates multiple workers; under nix sandbox's smaller worker pool
  it hits quota/memory pressure that doesn't reproduce locally.
- **integration-tests-group12**: skip `_ts` variants. The TS-flavored
  agent-config tests (`agent_reads_secret_created_from_default_ts`,
  `agent_with_mixed_agent_config_update_ts`) timed out at 240s each;
  the Rust-flavored siblings cover the same code paths and pass.
- **cli-integration-tests**: skip `bridge_gen`. The `bridge_gen::*`
  tests shell out to `npm install` / `cargo build` against
  dynamically-generated wrapper code, needing network access for
  deps that aren't pre-vendored. Everything else in the CLI suite
  runs hermetically.
CONTRIBUTING.md golemcloud#8 turned out to be unsatisfiable in the sandbox.
The cli integration test suite has exactly two modules:

- `bridge_gen::*` — generates wrapper code at test time, then
  `npm install` / `cargo build`s it. No pre-vendorable deps because
  the manifests don't exist until generation.
- `app::*` — scaffolds mixed TS+Rust apps and runs `golem-cli build`
  on them; same dynamic-deps shape.

Filtering both leaves zero tests, so the check would be vacuous
(\"0 passed; 0 failed; 113 filtered out\"). Removed in favor of
documenting that it's a local-only check. Everything else from
CONTRIBUTING.md is wired.
@vigoo vigoo reopened this May 11, 2026
@github-actions
Copy link
Copy Markdown

Hi @Fresheyeball, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/golemcloud/golem/blob/main/CONTRIBUTING.md for more details.

@github-actions github-actions Bot closed this May 11, 2026
@vigoo vigoo reopened this May 11, 2026
@github-actions
Copy link
Copy Markdown

Hi @Fresheyeball, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/golemcloud/golem/blob/main/CONTRIBUTING.md for more details.

@github-actions github-actions Bot closed this May 11, 2026
@vigoo vigoo reopened this May 11, 2026
@Fresheyeball
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

github-actions Bot added a commit that referenced this pull request May 11, 2026
@Fresheyeball
Copy link
Copy Markdown
Author

recheck

`agent_config::ts_optional_group_agent_config::*` slipped past the
`_ts` suffix filter — its TS-ness shows up as a module-name prefix
(`ts_*`), not a test-name suffix (`*_ts`). Both shapes hit the same
agent-sdk-ts startup path that times out at 240s under nix's smaller
worker pool. Added `agent_config::ts` to the skip list so the broader
class of ts-flavored agent-config tests is excluded.

Final result: `integration-tests-group12` 15 passed / 0 failed.
Adds five additional flake checks to close gaps against
`.github/workflows/ci.yaml`:

- **`integration-tests-group5-{service-base,registry-service,
  worker-service,debugging-service}`** — runs `cargo test` against
  each of the four service crates that CI's group5 covers. Uses
  the same redis + sqlite + staged-bins shape as the other
  integration checks. `golem-service-base` skips `_s3` because
  those tests need a real S3 (or LocalStack/minio) backend.
- **`integration-tests-group9`** — `agent-config-live-mutation`
  test binary with SQLite. Group8 is the Postgres-backed sibling;
  it needs DockerPostgresRdb and isn't feasible in the sandbox.
- **`wasm-guest-build`** — `cargo build --target wasm32-wasip2
  -p golem-wasm --no-default-features --features guest`. Confirms
  the WASM guest cross-compile stays green; `workspace-build`
  only exercises the host-side default features.

Bumped the `wasm-rquickjs` git fetch hash to include the
`vendor/rusqlite` submodule. The previous hash was computed via
`nix-prefetch-git` (which defaults to no submodules); `pkgs.fetchgit`
includes them by default, so the on-disk content differs and the
hash needed updating.

Documented why a few CI workflows are intentionally NOT wired:

- `integration-tests-group7` — `DockerJaeger::new()` hardcodes a
  `jaegertracing/all-in-one` testcontainer; would need an upstream
  patch to support a provided OTLP collector.
- `windows-daily-build.yaml` — cross-compiling to Windows targets
  from a Linux flake is multi-day toolchain plumbing.
- **`group5-service-base`**: skip both `_s3` (suffix) and `::s3_`
  (prefix) — `blob_storage::*` tests exercise a real S3 backend
  (or LocalStack/minio) that the sandbox doesn't provide. Two
  name shapes hit because some tests are `*_s3` and some are
  `s3_copy_*`.
- **`group5-registry-service`**: skip `postgres` —
  `tests::repo::postgres::*` opens a DockerPostgresRdb; its
  test_dep panic cascades through test-r's worker pool and kills
  the sqlite siblings. The `tests::repo::sqlite::*` tests cover
  the same repo surface.
- Fixed `testName` for each group5 service crate to match each
  Cargo.toml's `[[test]] name = ...` (was wrongly assuming
  "integration" for all): `golem-registry-service` → `tests`,
  `golem-worker-service` → `oidc`, `golem-debugging-service` →
  `integration`.

Results so far:
- `group5-service-base`: 123 passed
- `group5-worker-service`: 28 passed
- `group5-debugging-service`: 9 passed
- `group5-registry-service`: in flight with postgres skip
Addresses F1, F2, F3, F4, F6, F13, F16 from the /fess audit:

- **`commonSkips`** — single binding for the
  `[ "ip_address_resolve" "rdbms" ]` base skip list. Suite-specific
  skips now `commonSkips ++ [ ... ]`. Replaces 9 duplicated literals.
- **`wasiSdkEnv`** — single shellsnippet for the WASI SDK cross-
  compile env block (`WASI_SDK_PATH` / `WASI_SDK` / `CC_wasm32_wasip2`
  / etc.). Used by `agent-guest-wasm` and `test-components-rust`;
  previously duplicated inline.
- **`integration-tests-group5-*` via `mapAttrs'`** — four near-
  identical checks (`service-base`, `registry-service`,
  `worker-service`, `debugging-service`) now generated from a
  `serviceCrates` config attrset. Per-crate config captures the only
  varying inputs: `package`, `testName` (each Cargo.toml's
  `[[test]] name = ...`), and `extraSkips`.
- **Narrow `hardeningDisable`** — was `[ "all" ]` in
  `agent-guest-wasm` (nuclear option). The real offender was
  `-fzero-call-used-regs=used-gpr` from nixpkgs' `zerocallusedregs`
  hardening; narrowed to `[ "zerocallusedregs" ]` and confirmed the
  build still passes. Other hardening flags (stack protectors,
  FORTIFY_SOURCE, format, PIE) either work for wasm or are harmless.
- **`wasm-guest-build` installPhase** — was a build that confirmed
  exit-0 only. Added an `installPhaseCommand` that asserts
  `target/wasm32-wasip2/.../libgolem_wasm*.rlib` exists and copies
  it into `$out`, so a silent no-op build can't pass the check.
- **`agent-template-vendor` hash invariant** — documented inline:
  the FOD output depends only on the generated `Cargo.lock`, so the
  hash is invariant under JS-content changes in
  `golem-ts-sdk-base/dist/index.mjs`. It IS sensitive to
  `wasm-rquickjs` git pin bumps. Future bumps need a fresh hash
  via `lib.fakeHash`.
- **Drop unused `pkgs.bash`** from `test-components-rust` and
  `mkTsTestComponent` nativeBuildInputs. stdenv already provides
  bash; the explicit input was redundant.

Net: 197 lines changed, 70 lines deleted. No behavior change — every
check that was green is still green; `nix build .#agent-guest-wasm`
and `nix build .#checks.x86_64-linux.wasm-guest-build` re-verified.
- **F5 REPORT.md status update**: prepend a "Status — partially
  mitigated" section noting that the Nix flake's multi-stage shape
  absorbs the build's non-determinism for downstream consumers via
  content-addressed store paths, even though the underlying tools
  are still not bit-stable. Non-Nix contributors and supply-chain
  consumers are still affected; mitigation only covers the
  Nix-mediated path.
- **F8 config-drift strict comparison**: replaced `diff -B` with a
  byte-perfect diff against a tmpfile holding the binary's raw
  stdout (`bin > tmpfile`). The previous form was
  `$(bin)` + `echo "$generated"`, which silently dropped trailing
  newlines and forced `diff -B` as a workaround. Direct file
  redirection preserves the exact bytes the binary emits, so the
  check is strict and any real future drift trips it.
- **F9 ProvidedJaeger** in `golem-test-framework`: adds a
  `Provided*` variant for the `Jaeger` trait so hermetic test
  environments that can't run Docker can supply an externally-
  managed OTLP collector via `GOLEM_TEST_JAEGER_OTLP_HTTP_ENDPOINT`
  + `GOLEM_TEST_JAEGER_QUERY_URL`. New `create_jaeger()` factory in
  the module's `mod.rs` picks the implementation based on env
  presence. The factory matches the pattern used by
  `redis::SpawnedRedis` vs `redis::ProvidedRedis` elsewhere.
- **F12 wasm-rquickjs hash fragility documented**: inline comment
  on the `fetchgit` block explaining that the recursive hash
  captures the rusqlite submodule pointer (commit-locked at this
  rev), what would invalidate the hash, and what the proper
  upstream fix would look like (remove submodule and vendor inline,
  or use immutable tag refs upstream). The fix needs to land at
  `golemcloud/wasm-rquickjs`.
- **F13 agent-template-vendor hash invariant**: documented inline
  what changes invalidate the FOD hash (wasm-rquickjs rev bumps)
  and what doesn't (SDK JS-content changes, WIT changes that
  don't affect Cargo.toml). Future hash mismatches now have
  diagnostic guidance instead of a mystery.
- **F14 npm install dist-overlay assertion**: in
  `mkTsTestComponent.buildPhase`, assert that
  `node_modules/@golemcloud/golem-ts-typegen/dist/golem-typegen.cjs`
  has a `#!/nix/store/.../node` shebang. If npm re-ran the SDK's
  `prepare` script (which would regenerate dist/ without our
  patched shebangs), the check fails loudly with a hint pointing
  at `postPatch` in `golem-ts-sdk-base`. Replaces a "trust npm to
  do the right thing" verification gap.

Re-verified after changes:
  - `nix build .#agent-guest-wasm`
  - `nix build .#checks.x86_64-linux.wasm-guest-build`
  - `nix build .#checks.x86_64-linux.config-drift` (strict)
  - `nix build .#packages.x86_64-linux.test-components-ts`
  - `cargo check -p golem-test-framework` (Jaeger trait change)
When `EnvBasedTestDependenciesConfig.db_type == DbType::Postgres`,
`make_rdb` previously always called `DockerPostgresRdb::new(...)`,
which talks to `/var/run/docker.sock`. In hermetic environments that
can't run Docker (the Nix sandbox, restricted CI runners) the
testcontainer init panic cascades through test-r's worker pool and
kills unrelated tests, forcing us to filter whole suites that use
Postgres-backed test_deps.

This commit mirrors the Provided-vs-Spawned discovery pattern that
already exists for Redis (`check_if_running` → `ProvidedRedis` else
`SpawnedRedis`):

- Add `check_if_postgres_running(&PostgresInfo)` that connects via
  sqlx with a 2-second timeout and reports success / failure.
- In `make_rdb`'s `DbType::Postgres` branch, probe at
  `localhost:5432` (the `PostgresInfo` defaults) before falling back
  to Docker. When the probe succeeds, construct
  `ProvidedPostgresRdb` (already present in the tree; previously
  only `benchmark.rs` used it).

This is the same upstream-shape fix as F9 / `ProvidedJaeger`: add a
non-Docker entry point so callers managing the dependency externally
have a non-Docker code path. The flake-side wiring that spawns
`nixpkgs.postgresql_16` + sets `GOLEM_TEST_DB=postgres` so the
worker-executor `rdbms_service` / `key_value_storage` /
`indexed_storage` / `namespace_routed_key_value_storage` /
integration-tests-group8 / group5-registry-service `repo::postgres`
tests can run hermetically lands in a follow-up flake commit.

Verified: `cargo check -p golem-test-framework` compiles cleanly.
…ided instances

## F9 — Jaeger discovery in otlp_plugin

`integration-tests/tests/otlp_plugin.rs` directly constructed
`DockerJaeger::new()` in its test_dep, which made the suite
unrunnable anywhere without a Docker socket.

- Add `Jaeger: Debug + Send + Sync` so the trait object can flow
  through `tracing::instrument`'s span fields.
- Swap the local `create_jaeger` test_dep for `golem_test_framework
  ::components::jaeger::create_jaeger()` (already discovers
  `GOLEM_TEST_JAEGER_OTLP_HTTP_ENDPOINT`/`_QUERY_URL` and falls back
  to `DockerJaeger`).
- Test now consumes `&Arc<dyn Jaeger>` rather than `&DockerJaeger`.

## F11 — Provided Postgres for worker-executor + registry-service tests

Three worker-executor test binaries (`key_value_storage`,
`indexed_storage`, `namespace_routed_key_value_storage`) and the
registry-service `repo::postgres` binary hardcoded
`DockerPostgresRdb::new` / `testcontainers::Postgres`, panicking when
Docker is unavailable.

- `golem_test_framework::components::rdb::create_postgres_rdb()` —
  mirrors the `EnvBasedTestDependencies::make_rdb` discovery: prefer
  Provided Postgres at `localhost:5432` (user=`postgres`,
  password=`postgres`) when reachable, otherwise spawn
  `DockerPostgresRdb`.
- `postgres_info_from(&Arc<dyn Rdb>) -> PostgresInfo` extracts the
  Postgres-flavoured info from any `Rdb` handle for callers that
  previously relied on the concrete `DockerPostgresRdb` connection
  strings.
- All three worker-executor wrappers now hold `Arc<dyn Rdb>` and
  build admin connection URLs from `PostgresInfo`, so the same code
  path works for Provided and Docker Postgres.
- `golem-registry-service/tests/repo/postgres.rs`:
  - `PostgresDb._container` becomes `Option<ContainerAsync<Postgres>>`
    so it can be `None` when an external Postgres serves the test.
  - `start_plain_postgres` probes localhost:5432 first and only
    spawns testcontainers if that fails.
  - Per-binary schema name (`test_<uuid>`) prevents schema-name
    collisions when a single sidecar serves multiple test binaries.
  - `ensure_schema_exists` creates the schema before
    `db::postgres::migrate` runs `SET search_path` — testcontainers
    creates it on init, Provided does not.

The TLS-variant test (`postgres_tls_db`) still uses testcontainers
exclusively — TLS needs server-side cert/key plumbing that a
sidecar can't satisfy without extra setup.
…kips

`mkSpawnedTest` now accepts `withPostgres ? false`. When set, it adds
`pkgs.postgresql_16` to nativeBuildInputs and runs an in-sandbox
`initdb --auth=trust` + `pg_ctl start` against `127.0.0.1:5432` before
the test invocation, and stops it after. The framework's
`EnvBasedTestDependencies::make_rdb` already prefers a reachable
provided Postgres over `DockerPostgresRdb`, so flipping the flag is
enough — the upstream tests refactored in the prior commit
(`key_value_storage`, `indexed_storage`,
`namespace_routed_key_value_storage`, registry-service `repo::postgres`)
pick up the sidecar through `create_postgres_rdb()` discovery.

- `worker-executor-tests-misc` drops the three KV/indexed-storage
  skips (`key_value_storage` / `indexed_storage` /
  `namespace_routed_key_value_storage`). `rdbms_service` still skipped
  pending an MySQL-side equivalent; `ignite_service` still needs a
  live Ignite TCP node.
- `integration-tests-group5-registry-service` now runs the
  `repo::postgres::*` matrix dimension; only the `postgres_tls`
  variant is skipped (cert/key plumbing not feasible in-sandbox).
- New `integration-tests-group8` check — `agent-config-live-mutation`
  against the Postgres sidecar (group9 still covers SQLite).

Verified: `nix build .#checks.x86_64-linux.integration-tests-group8`
passes 4/4 agent-config-live-mutation tests against the sidecar in
~95s of checkPhase.
`nix flake check` ran `cargo fmt --check`; rustfmt prefers
single-line `info!` over wrapped, and collapses the inner
`sqlx::query(format!())` block. Auto-applied via `cargo fmt`.
`flake.nix:208` still carried the without-submodules hash for the
cargo-git source crane vendors for the workspace's wasm-rquickjs
dependency. Modern crane / nixpkgs fetches the cargo-git source
with submodules (the same `vendor/rusqlite` one the fetchgit on
line 188 captures), so its content matches the fetchgit hash and
the without-submodules `sha256-Dmehvc...` is wrong.

Locally the build passed by hitting the binary cache for the old
content; remote builders (garnix) had to refetch and tripped the
mismatch with `got: sha256-g+RZhH...`. Both hashes are now the
same `g+RZhH...` — one source-of-truth for the same git tree.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants