Skip to content

fix(io): validate cloud object URI buckets#1259

Open
fallintoplace wants to merge 5 commits into
apache:mainfrom
fallintoplace:fix/cloud-key-authority-validation
Open

fix(io): validate cloud object URI buckets#1259
fallintoplace wants to merge 5 commits into
apache:mainfrom
fallintoplace:fix/cloud-key-authority-validation

Conversation

@fallintoplace

@fallintoplace fallintoplace commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • parse cloud object locations without URL-unescaping, so raw % and spaces in keys keep working
  • keep default key extraction compatible for cross-authority object paths
  • add opt-in strict authority validation with object-store.strict-authority-validation
  • wire that strict mode through S3/GCS/OSS and the real ADLS extractor
  • use the parsed object location for WalkDir path reconstruction, keep the root trailing slash, and reject malformed authority-only roots instead of walking the whole bucket
  • add regression coverage for default compatibility, strict S3/GCS rejection, ADLS strict rejection, raw key characters, WalkDir roots, and malformed roots

Why

The original hard check caught wrong-bucket paths, but it also risked rejecting valid Iceberg layouts where metadata and data live under different authorities. This keeps the stricter guard available for deployments that want it, while preserving the default behavior until the FileIO can load per-authority buckets.

Testing

  • go test ./io/gocloud -run 'TestDefaultKeyExtractor|TestBlobFileIORejectsWrongBucketObjectPaths|TestBlobFileIOWalkDir|TestAdlsKeyExtractor' -count=1
  • go test ./io -count=1
  • go run github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.11.4 run --timeout=10m ./io/gocloud ./io
  • git diff --check

@fallintoplace fallintoplace requested a review from zeroshade as a code owner June 22, 2026 23:12
Comment thread io/gocloud/blob.go
// KeyExtractor extracts the object key from an input path
type KeyExtractor func(path string) (string, error)

var errEmptyObjectKey = errors.New("object key is empty")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be public so that consumers can use errors.Is?

Comment thread io/gocloud/blob.go
Comment on lines +104 to +111
} else {
_, after, found := strings.Cut(location, "://")
if found {
location = after
}

key = strings.TrimPrefix(location, bucketName+"/")
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the case where the url.Parse succeeds but the condition at line 87 fails? can't we still just use parsed to figure this out instead of needing to do strings.Cut?

Comment thread io/gocloud/blob.go Outdated

func validatesURIAuthority(scheme string) bool {
switch scheme {
case "", "s3", "s3a", "s3n", "oss", "gs":

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this missing azure authorities and others?

@fallintoplace fallintoplace force-pushed the fix/cloud-key-authority-validation branch from f76f5e4 to be4e016 Compare June 23, 2026 17:54

@tanmayrauth tanmayrauth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix — the wrong-bucket rejection is clearly an improvement, and the regression tests cover it well. One thing I want to confirm before this lands:

Switching the extractor to url.Parse (blob.go:81) means every location is now parsed as a URL up front, including the string-trim fallback branch. The old code never parsed, so it tolerated object keys with a literal %. Now anything that isn't a valid URL escape gets rejected outright — and because the parse happens before the scheme gate, it hits relative keys and non-cloud schemes too, not just s3/gs:

  • s3://bucket/data/100%off/f.parquet -> old: key "data/100%off/f.parquet"; new: error invalid URL escape "%of"
  • data/100%off/file.parquet -> same, even with no scheme
  • abfs://c@acct/data/50%done.pq -> same

Literal % in a key is legal in S3/GCS, so this looks like a read-path tightening that would reject paths previously written/readable. Is that intended scope, or should the fallback branch keep handling non-parseable keys the old way?

Smaller note: EscapedPath() at blob.go:95 re-encodes raw characters, so s3://bucket/data/city=New York/f.parquet becomes ...city=New%20York/.... Symmetric within iceberg-go since write and read share the extractor, but worth calling out as a deliberate behavior change.

@fallintoplace fallintoplace force-pushed the fix/cloud-key-authority-validation branch from be4e016 to dcdb57d Compare June 23, 2026 18:43

@laskoviymishka laskoviymishka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The authority validation here closes a genuinely nasty gap — before this, a manifest path like s3://other-bucket/data/file.parquet got silently rewritten into the configured bucket and read/written/deleted from the wrong place, and TestBlobFileIORejectsWrongBucketObjectPaths proves that's now blocked for S3/GCS/OSS.

I'd hold this before merging, though, because as written the strict check may break legitimately cross-bucket tables on the default read path. Iceberg supports write.data.path pointing at a different bucket than the table location, so tables written by Java/PyIceberg/Spark embed full s3://data-bucket/... URIs in manifests while metadata lives in s3://metadata-bucket/.... If the scan reuses one IO keyed to the metadata bucket, this now rejects files that Java S3FileIO and PyIceberg read without restriction. I'd keep the validation but make strict equality opt-in (or load a per-authority IO), not the default. I left that scan-path assumption as a question inline — if it turns out we already load an FS per data path, it drops to a nit.

The other thing is Azure: the description reads as if cross-container writes are covered, but production Azure goes through adlsKeyExtractor, which has no authority check — the Azure test only passes because it wires up defaultKeyExtractor. So that protection isn't actually there yet.

Things I'd like to settle in this PR before merge:

  • Confirm or repair the cross-bucket write.data.path read path — opt-in strict mode or a per-authority IO rather than a hard default reject.
  • Either add authority validation to adlsKeyExtractor and retarget the Azure test at it, or drop the test and state Azure is out of scope.
  • Parse root once in WalkDir and share the objectLocation for both key extraction and prefix reconstruction.
  • Keep the trailing slash on the root "." entry so the walked root matches its stored form.
  • Decide whether the malformed-authority-URI → full-bucket-walk fallback is intended; I'd reject it instead.

Once those are sorted, happy to take another pass.

Comment thread io/gocloud/blob.go
key := strings.TrimPrefix(location, bucketName+"/")
key := parsed.key
if parsed.hasAuthority {
if parsed.authority != bucketName {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one real hesitation about merging this: the hard authority check runs on the default read path, and I think it breaks a supported Iceberg layout.

A table can set write.data.path = s3://data-bucket/ while its location and metadata live in s3://metadata-bucket/.... Manifests then carry full URIs like s3://data-bucket/data/00000.parquet. As I read it, the scan opens those through the single table IO keyed to the metadata bucket — so parsed.authority is data-bucket, bucketName is metadata-bucket, and we now reject a file that Java S3FileIO and PyIceberg read without complaint (neither restricts which bucket a path resolves to).

I'd keep the validation but make it not the default on reads — either construct a per-authority IO when a data-file URI's authority differs from the table's, or gate strict equality behind an opt-in property. Before I lean too hard on this though: can you confirm the scan path really does reuse one IO across differing data-path authorities? If it already loads an FS per data path, this concern mostly evaporates and I'd drop it to a nit. wdyt?

Comment thread io/gocloud/blob_test.go Outdated

bfs := &blobFileIO{
Bucket: bucket,
keyExtractor: defaultKeyExtractor("container@account.dfs.core.windows.net"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't actually exercise the Azure path. Production Azure/ADLS schemes are registered with adlsKeyExtractor (register.go), which extracts the path but never compares the authority — so WriteFile("abfs://other-container@account.../data/file.parquet") still silently lands in the opened container with the old bug intact.

The test only rejects because it wires up defaultKeyExtractor here, so it's proving the S3 extractor's behavior under an Azure-looking string rather than the real Azure code path. As written it implies a cross-container protection that doesn't exist.

I'd either add a configured authority + mismatch check to adlsKeyExtractor and point this test at it, or drop the test and note in the PR that ADLS authority validation is out of scope for now.

Comment thread io/gocloud/blob.go Outdated
}

parsed.Path = ""
location, err := splitObjectLocation(root)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We parse root twice here — once via preprocess (which runs keyExtractor) and again directly through splitObjectLocation — then use the second parse's uriPrefix to reconstruct the callback paths.

That's fine for defaultKeyExtractor, but for any extractor whose authority boundary differs from the naive up-to-/?# cut — adlsKeyExtractor today, or any custom one — the prefix used for reconstruction can diverge from what the extractor actually accepted. The two parsers disagreeing is exactly the kind of bug that stays invisible until someone plugs in a non-default extractor.

I'd parse once and share the objectLocation for both the key extraction and the prefix reconstruction, so there's a single source of truth for the authority boundary.

Comment thread io/gocloud/blob.go Outdated
return fn(parsed.JoinPath(path).String(), d, err)
if location.hasAuthority {
if path == "." {
path = strings.TrimSuffix(location.uriPrefix, "/")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root-entry path changed shape here. The old code returned parsed.JoinPath(".")s3://bucket/ for the root entry; this returns strings.TrimSuffix(uriPrefix, "/")s3://bucket (no trailing slash), while every child is uriPrefix + paths3://bucket/.... So the root entry no longer matches the form callers passed in as the walk root.

Current callers all filter with !d.IsDir(), so the root "." entry is skipped and nothing breaks today — but it's a silent behavior change waiting for the first caller that compares a walked entry against the stored root. I'd use uriPrefix consistently for the "." case so the root keeps its trailing slash.

Comment thread io/gocloud/blob.go
return objectLocation{}, fmt.Errorf("URI authority is empty: %s", location)
}

key := ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A malformed authority-only URI quietly degrades into a full bucket walk, which surprised me. For s3://my-bucket?foo/bar, IndexAny stops at ?, the authority is my-bucket, and since rest[authorityEnd] is ? not /, key stays empty. That bubbles up as ErrEmptyObjectKey, which WalkDir then treats as a bucket root and promotes to walkPath = "." — walking the entire bucket from a malformed input.

Iceberg locations shouldn't carry a query or fragment, so this is an edge, but silently turning a bad URI into a full walk feels risky. I'd either reject a non-/ continuation after the authority explicitly, or at least keep that case from reaching the walk-everything fallback.

Comment thread io/gocloud/blob.go Outdated
return fmt.Errorf("invalid URL %s: %w", root, err)
}
if !errors.Is(err, ErrEmptyObjectKey) {
return &fs.PathError{Op: "walkdir", Path: root, Err: err}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small one: the rest of this file uses space-separated op names ("write file", "new writer"), so I'd make this "walk dir" to match.

While we're here — the authority-mismatch error is a plain fmt.Errorf with no sentinel, so callers can't errors.Is it the way they can ErrEmptyObjectKey. If we expect anyone to branch on it, an exported ErrWrongAuthority wrapped with %w would be more consistent. Not blocking.

Comment thread io/gocloud/blob_test.go Outdated
err := bfs.WriteFile(tt.path, []byte("content"))
require.ErrorContains(t, err, "does not match configured authority")

exists, err := bucket.Exists(ctx, tt.oldKey)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This oldKey existence check is vacuously true — nothing ever seeds oldKey, so Exists returns false even if the write had been misrouted somewhere else, and the require.ErrorContains above already guarantees no write happened.

If we want this to actually prove "not misrouted to the old key", seed oldKey with a known sentinel first and assert it's unchanged after the failed write. Otherwise I'd just drop the Exists block — the ErrorContains carries the assertion on its own.

Comment thread io/gocloud/blob_test.go Outdated
if test.wantErrContains != "" {
assert.ErrorContains(t, err, test.wantErrContains, "Expected error for input: %s", test.input)
if test.wantErrIs != nil {
assert.True(t, errors.Is(err, test.wantErrIs), "expected errors.Is(%v)", test.wantErrIs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert.ErrorIs(t, err, test.wantErrIs) reads better here and prints the actual error chain on failure, versus assert.True over a bare bool.

Separately, the gating assertions in this table are assert, not require — on a nil-when-error-expected case the ErrorContains fails but execution continues into this errors.Is check (and on the success path NoError continues into Equal), so one root cause produces two failures. I'd make the gate a require.

Comment thread io/gocloud/blob.go Outdated
// e.g., s3://bucket/path/file -> path/file
// ErrEmptyObjectKey is returned when a location names a bucket or authority
// without an object key.
var ErrEmptyObjectKey = errors.New("object key is empty")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrEmptyObjectKey is exported, but its only consumer is the internal errors.Is in WalkDir — bucket-root handling is otherwise transparent to callers. Exporting it is a permanent API commitment, so I'd either unexport it to errEmptyObjectKey, or document it explicitly as part of the KeyExtractor author contract ("custom extractors must wrap this for bucket-root inputs") — which also ties into the adlsKeyExtractor gap.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laskoviymishka Can you please check again when you have time? Thank you for your attention.

@fallintoplace fallintoplace force-pushed the fix/cloud-key-authority-validation branch from dcdb57d to 03fb63f Compare June 24, 2026 16:41

@laskoviymishka laskoviymishka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this, replacing the hard wrong-bucket rejection with an opt-in strict mode is the right thing, and it settles the worry from the last round about valid cross-authority layouts getting bounced. The objectLocation refactor reads well, and the errEmptyObjectKey sentinel makes the WalkDir empty-root handling a lot cleaner than the old url.Parse path.

For the next step, the part I'd like us to work through is the non-strict (default) path, since that's where most users land. Two related things show up there: a cross-authority URI silently becomes a key inside the configured bucket instead of erroring, and WalkDir reconstructs paths with the authority doubled (s3://other-bucket/other-bucket/...). The old code surfaced both as a clean not-found from the SDK, so it's more a small regression on the default path than a new gap — and since Iceberg tables routinely split data and metadata across buckets, it's worth getting right. None of the new tests exercise it yet.

A couple of smaller things to settle alongside that:

  • whether ObjectStoreStrictAuthorityValidation should be a top-level exported property at all, given there's no Java/PyIceberg/Rust counterpart — a user setting it expecting portable behavior would be silently ignored everywhere but the gocloud backend
  • dropping the variadic extractObject for a required parameter, which also removes the nil-extractor fallback the doubled-authority path leans on

I left the specifics inline. None of it is far off — once the default-path behavior and the property question are sorted (plus a cross-authority WalkDir test to lock it in), I think we're close. Happy to take another pass; could land the round after, might even be ready then.

Comment thread io/gocloud/blob.go Outdated

func createBlobFS(ctx context.Context, bucket *blob.Bucket, keyExtractor KeyExtractor) icebergio.IO {
return &blobFileIO{Bucket: bucket, keyExtractor: keyExtractor, ctx: ctx}
func createBlobFS(ctx context.Context, bucket *blob.Bucket, keyExtractor KeyExtractor, extractObject ...objectLocationExtractor) icebergio.IO {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make extractObject a required parameter rather than variadic.

The variadic reads as composable but isn't — only [0] is ever used, a second arg is silently dropped, and createBlobFS(ctx, bucket, ext, nil) is legal and quietly installs a nil extractor that sends every call down the fallback path below. The only callers relying on the optionality are a few old 3-arg test helpers; I'd rather update those four call sites than keep a fallback that papers over a real semantic difference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your attention, I will have a look tomorrow evening.

Comment thread io/gocloud/blob.go Outdated
return objectLocation{}, err
}

key, err := bfs.preprocess(root)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback computes the key two incompatible ways and they disagree on cross-authority input.

For s3://other-bucket/path, splitObjectLocation leaves location.uriPrefix = "s3://other-bucket/" with key = "path", but bfs.preprocess(root) (non-strict defaultKeyExtractor) then overwrites location.key with legacyAuthorityKey = "other-bucket/path" while uriPrefix still points at other-bucket. WalkDir reconstructs s3://other-bucket/other-bucket/path — authority doubled. If we make extractObject required (see the createBlobFS comment) this branch goes away entirely; if it stays, the key needs to come from splitObjectLocation too so prefix and key agree.

Comment thread io/gocloud/blob.go Outdated
if path == "." {
path = location.uriPrefix
} else {
path = location.uriPrefix + path

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same doubled-authority bug surfaces here on the non-strict path even with extractObject set: defaultObjectLocationExtractor folds the authority into parsed.key via legacyAuthorityKey but leaves uriPrefix as the original s3://other-bucket/, so walkPath = "other-bucket/data", the walk yields other-bucket/data/file.parquet, and this line prepends s3://other-bucket/ to give s3://other-bucket/other-bucket/data/file.parquet.

A table case would lock this down: write a file at other-bucket/data/file.parquet, walk s3://other-bucket/data non-strict against a FileIO bound to my-bucket, and assert no returned path contains a duplicated authority. Every current WalkDir test uses a matching authority or strict mode, so this path is unguarded.

Comment thread io/gocloud/blob.go

if key == "" {
return "", fmt.Errorf("URI path is empty: %s", location)
parsed.key = legacyAuthorityKey(parsed)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In non-strict mode a cross-authority URI silently becomes a key inside the configured bucket, which I think is riskier than the old behavior.

s3://data-bucket/snap-1.avro against a FileIO bound to meta-bucket writes an object at key data-bucket/snap-1.avro inside meta-bucket and reports success, where the SDK previously returned a clear not-found. I'd either make strict the default with an explicit object-store.allow-cross-authority opt-out, or at minimum emit a structured warning here so a misconfigured FileIO leaves a trace instead of silently misplacing data. Which way would you prefer — strict-by-default, or keep the legacy fold but make it loud?

Comment thread io/config.go
// ObjectStoreStrictAuthorityValidation rejects fully-qualified object paths
// whose URI authority differs from the bucket or container used to create
// the FileIO.
ObjectStoreStrictAuthorityValidation = "object-store.strict-authority-validation"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sits at the top level next to s3.*/gcs.*/adls.*, all of which are mirrored in Java and PyIceberg — but object-store.strict-authority-validation is Go-only. Java's S3FileIO/ADLSFileIO and PyIceberg's fsspec create per-authority FileIO instances and have no cross-authority knob, so a user who sets this in table properties expecting portable behavior gets it honored only by the gocloud backend and silently ignored elsewhere.

I'd either move it into the gocloud package unexported and wire it through the functional option you already have, or, if it stays exported, add a doc comment that it's honored only by the gocloud FileIO and ignored by Java/PyIceberg/Rust. wdyt?

@zeroshade zeroshade left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opt-in strict authority validation and the splitObjectLocation hardening read well. One consistency gap that I don't think the earlier rounds covered (separate from the cross-authority / doubled-authority threads): the ADLS extractor doesn't reject malformed authority-only URIs the way the new generic parser now does. Details inline.

Comment thread io/gocloud/azure.go
}

uriPath := matches[3]
key := strings.TrimPrefix(uriPath, "/")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adlsURIPattern is ^(abfss?|wasbs?)://([^/?#]+)(.*)?$, so group 3 captures everything after the authority including a leading ? or #. For abfs://container@account.dfs.core.windows.net?prefix=data, uriPath is ?prefix=data and key becomes ?prefix=data - non-empty, so it's accepted as the object key. splitObjectLocation now explicitly rejects this shape for s3/gs (must be followed by an object path), and since objectLocation() also feeds WalkDir, the two paths disagree on the same malformed input. Suggest rejecting a non-empty uriPath that doesn't begin with / here too, to keep ADLS consistent with the generic parser (a matching abfs://...?prefix=data test would lock it in).

@fallintoplace fallintoplace force-pushed the fix/cloud-key-authority-validation branch from 24db6e2 to ddcfc5a Compare June 27, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants