Skip to content

docs: write architecture document#2

Open
frrist wants to merge 3 commits into
mainfrom
frrist/docs/architecture
Open

docs: write architecture document#2
frrist wants to merge 3 commits into
mainfrom
frrist/docs/architecture

Conversation

@frrist

@frrist frrist commented Jun 17, 2026

Copy link
Copy Markdown
Member

What this is

The design doc set for the Ingot S3-over-Forge redesign, bundled as one PR so the architecture
and the analysis that backs it are reviewed together. No code changes — these four docs are the
shared understanding we need before building.

docs/
├── aggregation-gate.md            ← start here (5 min): why this is gated on Piri
├── architecture.md                ← the target design (the main artifact)
├── rfc-pdp-minimum-piece-size.md  ← the PDP gas/economics + Piri policy that the size knobs rest on
└── pdp-cost-calculator.html       ← interactive model behind the RFC (open in a browser)

Start here

Suggested reading order

  1. aggregation-gate.md — the one-page framing. Thesis: none of the S3 design runs on Forge until Piri stops size-pooling unrelated objects into ~128 MiB pieces. Read this first; it tells you why the other two docs exist and what the two asks of the rest of the stack are.
  2. architecture.md — the target architecture, layer by layer (S3 → catalog → data → Forge/chain → cross-cutting), then a system contract, topology, gaps, and a Postgres schema appendix. This is the artifact most of your review time should go to.
  3. rfc-pdp-minimum-piece-size.md — open to the Decision summary (top). That page is the whole argument; §1–§10 are the supporting analysis and the Appendices (behind the "Evidence" banner) are the raw gas audit — skim unless you're checking the numbers.
  4. pdp-cost-calculator.html — play with piece size / churn / base fee / sender count to pressure-test the economics yourself.

Where your input is most wanted

The docs call these out explicitly; these are the decisions we want ratified, not just read:

  • Aggregation model (aggregation-gate.md, architecture.md §6, rfc §7): the proposal is no cross-object aggregation — store each object as its own piece(s) — with the size floor (min_aggregate_size, ~8 MiB) demoted to the fallback if the contracts can't take that piece volume. The open question for FilOZ / FOC: can the PDP contracts/proving take object=piece at PiB scale? (See the top-level note "Should Piri do cross-object aggregation at all?")
  • Versioned key encoding (architecture.md §3 callout): composite-key vs per-key version index; and versionId as a direct locator (scan-free, but leaks version ordering) vs opaque. We picked the direct locator — weigh the trade-off.
  • The two asks (aggregation-gate.md): (1) rethink Piri's aggregation — target no cross-object aggregation (object=piece), with a generational background aggregator as the relief valve and the size floor as the fallback; (2) prove small pieces are operationally viable in the PDP contracts at PiB scale. Do you agree this is the gate, and the sequencing?

Correctness — what was verified against source

Every gas/contract claim in the RFC was checked against the actual contracts (by Claude!)
(PDPVerifier.sol and FilecoinWarmStorageService.sol, re-verified against FWSS v1.3.0 / PDPVerifier v3.4.0 after an upstream contract update) and the calibnet
gas-benchmark fixtures. The load-bearing (dammit claude) findings held: no proof-slashing exists (the new 0.1 FIL cleanup deposit is refundable, not a penalty), add cost is size-independent (~120,700 gas), every on-chain hot path is bounded (no O(live pieces) loop), and the calibnet anchors matched exactly. Corrections that came out of the audit and are already baked in:

  • ⚠️ The addPieces batch cap was removed upstream (FWSS v1.3.0 / PDPVerifier v3.4.0). EXTRA_DATA_MAX_SIZE / MAX_ADD_PIECES_EXTRA_DATA_SIZE are gone; batch size is now bound by the FVM PiecesAdded event-size + per-tx gas. So the registration-throughput numbers (pieces/tx, txns/PiB, sender fleet) are an open measurement, not fixed — the earlier ~13-pieces/tx / ~11M-txn / ~12-sender figures are withdrawn pending re-measurement. (Good news for "many small pieces": the contract wall that would have throttled them is gone.)
  • Storage price is 2.5 USDFC/TiB/mo (not 0.9), now an immutable constant (lib/PriceListUSDFC.sol); v1.3.0 also adds a per-dataset USDFC fee layer. Doesn't move the margin conclusion.
  • Citation fixes (nextPieceId now :800, standalone-submit jobqueue.go:205, challenge-window 60 mainnet / 20 calibnet / 10 devnet, etc.).

Scope / non-goals

  • Docs only — no implementation in this PR. The architecture explicitly supersedes the bootstrap MVP currently in the tree.
  • Catalog GC, the allocate-by-size security model, and the batch allocate/accept wire format are named as known gaps, not solved here (architecture.md §11).
  • Some RFC code references point at sibling repos (piri, filecoin-services/FilOzone) by design, since the gas model is derived from them.

frrist added 2 commits June 17, 2026 12:59
…inst source

Move the architecture doc under docs/ and add the supporting design set so
the whole thing is reviewable as one PR:

- docs/rfc-pdp-minimum-piece-size.md — PDP gas/economics RFC + Piri policy
- docs/aggregation-gate.md — "read first" brief framing the aggregation gate
- docs/pdp-cost-calculator.html — interactive cost model
@frrist frrist self-assigned this Jun 18, 2026

@bajtos bajtos left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-written, detailed proposal. Love it!

I am afraid I ran out of energy & concentration by the time I got to the middle of docs/architecture.md. I'll continue the review of section 7 and onwards tomorrow.

Comment thread docs/aggregation-gate.md Outdated
- **`min_aggregate_size` (floor, ~8 MiB).** Any blob **≥ floor** becomes its **own** on-chain piece,
so deleting it removes a whole piece (O(1)) with no survivors to re-hash. Any blob **< floor** joins
a **cross-object** sub-floor aggregate built up to the floor — but **lazily**: never repacked on
delete; dead bytes linger and compact periodically (or never). They are cheap bytes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never repacked on delete; dead bytes linger and compact periodically (or never).

We need to consider requirements for compliance with GDPR and other regulations.

In fil-one/RFC#9 (comment), we are relying on Forge to remove deleted files from PDP within 24 hours after deletion.

That means we cannot let dead bytes linger forever, we need to compact them within that 24 hour window.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and I think the key is to separate two obligations that this line is conflating:

  1. Compliance / "the customer's data is gone" — met by cryptoshredding, not by byte removal. The committed encryption RFC is explicit that true byte-removal is a slow, future mechanism and that the deletion path is (a) remove the entry from the bucket, (b) destroy the region key:
  1. Deletes should be fast. When an object is deleted, its region keys are cryptoshredded. Removing the underlying blob bytes from the network is a separate, future mechanism; this RFC is about the keys... The first step to deleting is removing the entry from the bucket. The second is deleting the region key, truly preventing access.

source (LN 28)

Cryptoshredding is realistically our only way to hit a 24h window anyway — a Piri operator can't be forced to delete bytes on demand, but we can destroy the key that makes those bytes readable.

  1. Cost — compacting/removing the dead bytes (and retiring the on-chain piece) so we stop paying to store and prove them. This is the part "linger... or never" was talking about, and it's a cost concern with a looser SLA, not the compliance clock.

So the full delete path as I see it:

  • Immediate: remove the key from the MST — matches S3 GET/HEAD → NoSuchKey semantics.
  • Within 24h: cryptoshred the region key — the data is unreadable; this is what satisfies the compliance obligation.
  • Looser, somewhat unbounded: remove the bytes from Piri's MinIO store and retire the on-chain piece — this reduces our cost (we pay for proven bytes until they're gone), not a compliance gate.

One thing I'll flag as an open cross-doc item: your link is to discussion comment asserting a hard 24h byte-removal SLA, while the committed RFC text treats byte removal as a slow non-goal. We should reconcile which is the real requirement (immediate cryptoshred vs. hard 24h byte deletion) across this doc, RFC #9, and the appliance plan — because the answer changes how aggressive the compaction process has to be. (My gut says we can't force fast compaction, so cryptoshred feels like a hard requirement, and iirc is how Auroa operates)

One more connection: if we go the way I'm proposing in the top-level note ("Should Piri do cross-object aggregation at all?" — one object = its own piece(s)), the "dead bytes linger" problem largely evaporates, because deleting an object retires its own whole piece(s) — there's no shared aggregate to compact around.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with what you wrote.

The problem with cryptoshredding is that we want to have two copies of the content-encryption-key:

  • One is stored by Ingot and easy to delete
  • Another is stored in the blob via FEE. This copy is wrapped by a key owned by FilOne and has coarser granularity (per bucket or per tenant).

The consequence: As long as the blob is stored in Piri, FilOne has the means to recover the stored data.

If we want to implement proper cryptoshredding, then we need to change FilOne/Hilt to create a new per-object wrapper key instead of sharing the same per-bucket/per-tenant key.

I don't have a strong opinion about which path we should take (per-object keys in Hilt vs. removing deleted blobs within a reasonable timeframe). What's important to me is meeting the compliance requirements.

Let's add @Peeja to the loop since she is designing the encryption.

Comment thread docs/aggregation-gate.md Outdated
shows the **economics** hold (registration linear, proving logarithmic, proof-fee size-neutral,
**no slashing**, registration gas-gatable). The remaining proof is **operational** — that Piri can
*register, prove, and delete* sub-256-MiB pieces routinely at PiB scale:
- **throughput** — ~11M `addPieces` txns per PiB at 8 MiB (the deployed verifier's ~13-piece/tx cap,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a lot of transactions!

  • Do we have an estimate of how much we will pay in gas fees to onboard PiB of 8 MiB pieces?
  • How quickly can we submit these 11M transactions on the chain?

@frrist frrist Jun 18, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions, and the answer splits ~cleanly: gas dollars are (probably) not the problem; transaction throughput is.

  • Gas cost: registration (addPieces) is ~120,700 EVM gas/piece (size-independent), ~8× in FEVM. For a whole PiB at 8 MiB (as an example) that's on the order of a fraction of a FIL to low tens of dollars at a quiet base fee — negligible against storage revenue. And registration is deferrable: Piri gas-gates it (defer-and-retry, no deadline, no slashing for being late), so we ride cheap-gas windows and never pay spike prices for it. So I'm not worried about the fee. But there is a tradeoff for operators here, since not registering during times of high gas fees means not being paied for data stored.
  • Throughput: this is the real wall. And this all depends on how many roots we can add in a single transaction, and how many of those we can send per epoch. The calculator in here attempts to model this.

So "how fast" is a function of how many addPieces transactions we can send, and how many pieces we can stuff in each transaction.

To be upfront: this throughput magnitude is exactly the assumption I'm asking FilOZ to stress-test in the top-level note ("Should Piri do cross-object aggregation at all?"). Good news since I first wrote this: FilOZ already removed the addPieces batch cap in v1.3.0, so the registration wall is softer than the early ~13-pieces/tx draft assumed — but the measured pieces/tx (and whether to shard across multiple data sets) is still the open question that decides whether we can drop aggregation entirely or have to keep a floor.

Comment thread docs/aggregation-gate.md
Comment on lines +93 to +94
workload literature (IBM COS / SNIA traces); **instrument Ingot from day one** so the floor/ceiling
become measured, not guessed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
workload literature (IBM COS / SNIA traces); **instrument Ingot from day one** so the floor/ceiling
become measured, not guessed.
**instrument Ingot from day one** so the floor/ceiling become measured, not guessed.

I agree we should instrument Ingot and start collecting object sizes from day one. However, we need data from real users, not from out internal testing. I think that means this instrumentations belongs to the R1 milestone.

Thoughts?

Pull initial shapes from object-store workload literature (IBM COS / SNIA traces)

Please ask Aurora if they can find these numbers for us. Either for FilOne customers only, or for their entire customer base, whatever is easier for them. I think this will give us some good data points too!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on both counts. The instrumentation hooks are cheap and I'd put them in the code from day one, but you're right that the tuning decision (floor/ceiling values) is only meaningful against real billable workload (which we lack w/ Forge), so it's R1-gated — internal test data would just bias us. (Hypotehtically we could punt this decision to the SP, but its a complicated tuning decision I doubt they will want to make)

On Aurora — I strongly agree we should ask, and their numbers (FilOne customers, or their whole base, whichever is easier for them) would be the best single data point we could get, since they're a real object-store operator running their own (non-Forge) stack. I don't have a direct line to them though, so I'll need a hand getting connected — who's the right person to make that intro / own the ask? Is this an ask you could take on (pretty please)?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realise you are not in the shared Slack channel. I added you and started the discussion with Aurora here: https://filecoinproject.slack.com/archives/C0AGASAA4FN/p1781856109595929

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread docs/aggregation-gate.md

Today Piri pools accepted blobs into ~128 MiB pieces (`MinAggregateSize`, up to a 256 MiB ceiling)
across **unrelated** objects, registers them in one node-wide proof set, and can delete only a
**whole** piece. So deleting one 8 MiB S3 object means: take down a ~128 MiB aggregate that also holds

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering how you chose 8 MiB as the new floor, rather than 1, 2, 4, 16, or 32 MiB?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, 8 MiB is a principled starting point, not a derived optimum. It's the AWS SDK default: both boto3 and the aws CLI default multipart_threshold and multipart_chunksize to 8 MiB, so any object large enough to matter arrives already split into ~8 MiB parts by the client. Setting the floor there makes "one client part ≈ one on-chain piece" land for the dominant upload path, while smaller whole objects fall below the floor and get aggregated.

The trade across the range: a smaller floor (1/2/4) puts more objects above the floor (better delete granularity, each its own piece) but multiplies piece count and therefore registration transactions; a larger floor (16/32) means fewer pieces but more sub-floor aggregation and bigger repack units when a member is deleted. 8 sits at the natural client boundary.

That said — I'm now questioning whether we want a floor at all. The cleaner model is one object = its own piece(s) with no cross-object aggregation, which makes "8 vs 1/2/4/16/32" moot. I've written that up as a top-level note ("Should Piri do cross-object aggregation at all?"); short version: 8 MiB only matters if FilOZ tells us we must aggregate, in which case the floor is a function of the measured object-size distribution we don't have yet (the instrumentation/Aurora thread). Absent that constraint, there's no floor to pick.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, 8 MiB is a principled starting point, not a derived optimum. It's the AWS SDK default: both boto3 and the aws CLI default multipart_threshold and multipart_chunksize to 8 MiB, so any object large enough to matter arrives already split into ~8 MiB parts by the client. Setting the floor there makes "one client part ≈ one on-chain piece" land for the dominant upload path, while smaller whole objects fall below the floor and get aggregated.

Thank you for the explanation; now the choice makes sense to me!

Comment thread docs/architecture.md
Comment on lines +18 to +19
Ingot is an **S3 gateway over the Forge network**. It runs as an embeddable Go library (a host like
piri/guppy/sprue imports it in-process) and as a standalone daemon (`ingot serve`). It presents each

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ingot is an S3 gateway over the Forge network. It runs as an embeddable Go library (a host like piri/guppy/sprue imports it in-process) and as a standalone daemon (ingot serve).

Interesting! That means we can further simplify our R0 or even R1 scope by running Ingot inside the Piri process.

@hannahhoward @alanshaw Thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and that's by design — Ingot was always meant to be importable as a library by one of these services, with ingot serve as the standalone option. Folding it in as an optional module of Piri is a real win on a couple of fronts: it simplifies installation/deployment (one process to ship and run; one-process is a super hot topic right now 😂 ), shaves the LAN-RPC latency between Ingot and Piri, and lets Ingot's local "spool" store reuse Piri's existing object store instead of keeping a second on-disk copy.

The one thing I'd be precise about so we don't over-claim: it's a packaging/perf optimization, not an architectural or scope reduction. The appliance already co-locates ingot + piri at the provider, and the control plane still flows ingot → sprue → piri (and → Hilt in R1) regardless of whether they share a process — so it doesn't remove a network hop on the upload path or simplify the auth story.

Net: worth doing for the deploy/spool simplification, but it doesn't cut R0/R1 work in a structural way.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one thing I'd be precise about so we don't over-claim: it's a packaging/perf optimization, not an architectural or scope reduction.

💯

Comment thread docs/architecture.md Outdated
Comment on lines +252 to +259
1. At `accept`, Piri publishes the blob's location to the indexer, which caches it right away — so
a normal GET resolves at once. But that cached **hit** has only a ~1 h TTL.
2. The durable, long-term resolution comes from an IPNI advertisement that propagates **separately
and more slowly**. If the hit expires before IPNI has caught up, there is a window where the
indexer can find nothing.
3. The catch: a lookup that lands in that window returns *not found*, and the indexer **caches that
failure too** — a *negative* cache entry, also ~1 h. So one ill-timed miss makes a
fully-stored object look missing for up to an hour, even after IPNI catches up.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the R0/R1/R2 roadmap, the plan is to remove indexer from the system in the initial versions. Do we need to account for that and update this section to describe how we will be handling reads without the indexer?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm treating that roadmap as a tentative plan, and I'm not yet convinced dropping the indexer makes any of this land faster — when we remove a system we still have to build its replacement (here, a local location table) and we take on the credible-exit / external-discoverability cost of doing so. So removing the indexer reads to me as a lateral move, not a shortcut. I'd rather keep this doc indexer-based as the target and document the no-indexer R0/R1 path as a deliberate, reversible reduction, rather than rewrite the read path around its absence.

Comment thread docs/architecture.md
Comment on lines +264 to +269
- **Read cache (optional, recommended):** beyond that floor, the local store serves hot reads
directly, skipping the indexer→Piri round-trip. Read-after-write retains *recently written* data;
a cache retains *recently read* data, so the two may use distinct eviction policies over a shared,
bounded, size-configurable store. The alternative — a near-stateless Ingot that resolves every read
through the indexer — trades latency for simpler horizontal scaling; it is a supported mode, but
the read-after-write floor holds regardless.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to remove the read cache from the R0/R1 scope to allow us to ship faster?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it, as I suspect the performance benefits from it will be worth it, and we'll need it any ways before a production deployment to get a reasonable TTFB. But its listed as optionally recomended here because my guts says its a few hours to implement and gut feeling are sometimes wrong. Will drop if it proves complicated, left here for posterity.

Comment thread docs/architecture.md

`min` is the central lever because on-chain cost is **count-driven, not size-driven** (gas RFC). A
small `min` means most objects are their own piece, so most deletes are O(1) (below); the price is
more pieces — more `addPieces` transactions, a one-time registration tax, and an O(log) proving

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above - what is the argument for the log function in O(log)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as the earlier O(log) thread: it's O(log(cumulative piece count)) = O(log(dataset_bytes / piece_size)) — the depth of the sumtree descent each proof challenge performs (floor(log2(N)) + 1). Not O(log(piece_size)), not O(log(total_size)) on its own. And the count is monotonic (nextPieceId never decrements), so under churn it's log of adds-ever, ratcheting up permanently. I'll disambiguate both occurrences in the doc.

Comment thread docs/architecture.md Outdated
Comment on lines +317 to +318
aggregates `≤ max`, and the batch submitter respects the contract's `extraData` cap — the binding
gate is the PDPVerifier's `EXTRA_DATA_MAX_SIZE` = 2048 B (~13 pieces/tx), not the larger WSS limit

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth discussing with the FilOz/FOC folks why PDP enforces this relatively low limit, and whether it's feasible to increase it to match the FWSS limits?

Comment thread docs/architecture.md
reaches zero across **all** spaces (Piri keeps per-`(digest, space)` allocation rows to count on).
At zero, an own-piece blob takes Regime A and an aggregated blob takes Regime B.

**Bucket teardown.** All blobs share one node-wide PDP data set, so `deleteDataSet` would tear down

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any concerns about the one node-wide PDP data set as we grow the amount of data stored?

If a node is storing 100TiB of data, spread across 1M objects >= 8MiB:

  • The data set has 1M roots, does the on-chain data structure support this?

The PDP proofs are probabilistic. I asked Claude to compare one 100TiB-sized proof set vs 100x 1TiB-sized proof sets, and it says there is meaningful difference.

Summary:

The probabilistic guarantee is "you can't cheat much for long without detection," not "every byte is checked every period." But "eventually" scales with how small the missing fraction is — and a single 100 TiB set makes "eventually" very long for small losses.

More details:

  • One proof set, 100 TiB, 5 challenges/period: If the provider drops 1% of your data (1 TiB), per period: 1 − 0.99^5 ≈ 4.9% chance of catching it. They'd expect to get away with it for ~20 periods on average. Drop 0.1% (100 GiB) and per-period detection falls to ~0.5%.
  • 100 proof sets, 1 TiB each, 5 challenges each = 500 challenges/period total:
    Now you have 500 independent challenges across your data every period. If the provider drops 1% spread across everything: 1 − 0.99^500 ≈ 99.3% caught per period. Even 0.1% missing gives 1 − 0.999^500 ≈ 39% per period.
  • There's a localization problem beyond raw probability. With 1M roots in one set and only 5 challenges, an adversary who deletes one entire root (say one of your 1M pieces, ~100 MiB) is missing 1/1,000,000 of the leaves. Per-period detection ≈ 5/1,000,000 = 0.0005%. That root could be gone for thousands of years of proving periods before a challenge happens to land on it. The data is effectively unprotected at fine granularity.

@frrist frrist Jun 18, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two halves to this, and I want to answer them differently.

On the security/probabilistic half (the 5-challenges-over-1M-roots dilution argument): I'd rather not pull that thread here. This doc treats the PDP/PDPVerifier contract security model as a vetted assumption, and how hard it is to cheat over a proving window is covered by the contract security analysis, not this doc.

On the operational scaling half, though — your instinct is right and it's very much live, so I've pulled it into a top-level note ("Should Piri do cross-object aggregation at all?"). Structurally, I believe ~1M roots is fine: every on-chain hot path is bounded (O(batch) / O(5) / O(log cumulative) / O(removals≤2000)). The real open question you're circling is how many proof sets a node should maintain. Today it's one data set per node for operational simplicity — but each set is a proof the node must produce every period, so sharding into N sets is N proofs/day and shifts the node's SLA and gas usaged since more proofs per day. The flip side is that a cap on pieces-per-set + sharding could help delete throughput, localization, and per-tenant rails. I don't have the answer — it's exactly the kind of thing we need FilOZ to weigh in on, and it's tied to whether we can drop cross-object aggregation entirely. Let's track it in that note.

@bajtos bajtos Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the security/probabilistic half (the 5-challenges-over-1M-roots dilution argument): I'd rather not pull that thread here. This doc treats the PDP/PDPVerifier contract security model as a vetted assumption, and how hard it is to cheat over a proving window is covered by the contract security analysis, not this doc.

I agree to not question the PDPVerifier contract's security model.

What I am questioning: Did FilOz consider the security implications of very large ProofSets (like 100TiB+)? If they did, what is their recommendation for the maximum proofset size that still keeps PDPVerifier security promises?

I think it's important to include this point in our discussions with FilOz.

@alanshaw alanshaw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this please be a PR against the RFC repo?

I haven't yet reviewed rfc-pdp-minimum-piece-size.md but I have review fatigue so I'm going to stop there for today and publish the comments I have.

Comment thread docs/aggregation-gate.md Outdated
- **`min_aggregate_size` (floor, ~8 MiB).** Any blob **≥ floor** becomes its **own** on-chain piece,
so deleting it removes a whole piece (O(1)) with no survivors to re-hash. Any blob **< floor** joins
a **cross-object** sub-floor aggregate built up to the floor — but **lazily**: never repacked on
delete; dead bytes linger and compact periodically (or never). They are cheap bytes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to not never repack. I'm unsure how we square this off, i.e. if we allow storage nodes to prove over pieces that no customer is paying for, it's basically on us to pick up the bill. Furthermore how do we distinguish a uncollected garbage piece from a completely random piece the storage node inserted to earn more $$$.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — "never" was too strong, and I think the cleanest way to "not never repack" is to not cross-object aggregate in the first place. I've written this up as a top-level note ("Should Piri do cross-object aggregation at all?"). If each object is its own piece(s), there's nothing to repack on delete (you retire the object's own piece), and the concerns somewhaat dissolve:

  • Unpaid pieces: there's no shared aggregate accumulating dead bytes from other objects, so the proven set maps 1:1 to objects we're paying for.
  • Garbage vs. fabricated — and these split on trust, which I want to be careful about: the garbage half (dead bytes from our own deletes) is honest-operator GC — non-adversarial — so our own bookkeeping tells us which pieces are no longer referenced and we schedule them for retirement on the normal delete path. The fabricated-piece half is adversarial and harder. If we, or a client ran Ingot we could reconcile on-chain roots against blob_refs. But that does not work for if the SP runs Ingot since:
    1. blob_refs is keyed by the blob's sha256, but the on-chain root is a CommP, and there's no way to verify a CommP corresponds to a given sha256 — that binding is a UCAN attestation (trust), not a computation
    2. the SP operator is the one running Ingot, so reconciling against their reverse index is letting the adversary audit themselves.

The trust-correct anchor has to be FilOne-side: Sprue issues every accept, so FilOne holds a central, SP-independent record of what it authorized (space, digest, size), and FilOne is the FilecoinPay payer, so the real backstop is reconciling the on-chain dataset size/accounting against the authorized total (which needs no per-piece CommP↔sha256 identity) and halting the rail on divergence. That covers inflate-to-earn; it doesn't by itself cover byte-substitution, but that's always been the case for this deployment configuration requiring mnual download and verification by a trusted party.

@bajtos bajtos Jun 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I remember, the TX adding a new root to a PDP ProofSet must contain extraData signed by the Filecoin Pay payeer (FilOne). This extraData also contains the CommP of the added root. In other words, the SP cannot add arbitrary roots to PDP behind our back; every such addition must be signed by FilOne's wallet.

Also, since PDP is using CommPv2, we know what size we are signing over when FilOne is authorising a PDP addRoot transaction.

The point I want to make: I think we don't need to compare the total PDP proofset size against the authorised total, because we should be able to verify the size on a per-object basis.

I don't fully understand how Forge interacts with PDP, so I may be wrong.

Comment thread docs/aggregation-gate.md Outdated
~8 MiB floor (per-SP / per-dataset), keeping aggregation adaptive: blobs ≥ floor become their own
piece, only the sub-floor tail pools — and **lazily** (never repacked on delete). The RFC §7 carries
the spec. This is mostly a constant change plus a lazy-compaction path — Piri's aggregator is already
size-driven — not a new grouping primitive.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to wonder if we even bother with a minimum size...

While there's probably some aspect of churn related to size, my suspicion is that time is the dominant factor for churn. In other words, the longer a piece lives for, the less likely it is to be deleted.

What if Piri just put every piece on the chain as is, but had a background process to aggregate old pieces together? It allows new pieces to churn easily with minimal cost, it only aggregates pieces that are unlikely to be removed (and thus reduces re-aggregation work/cost), and it keeps the total number of proving roots low.

We can also implement this in 2 steps. Step 1 is just put every piece on chain as is. Ship it, then...Step 2 is add the generational aggregator background process.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the comment thats reshaping my thinking, and I've written the full version up as a top-level note ("Should Piri do cross-object aggregation at all?") — let me anchor the discussion there, but the short version: I think you're right, and I'd go further than "maybe no minimum" to no cross-object aggregation at all. Your "time dominates churn" prior is backed by the gas RFC (on-chain cost favors bigger pieces at every churn level, §3.2), so the only reason to aggregate is throughput/root-count, not cost — and if the contracts can take the piece volume, the cleanest answer is one object = its own piece(s), no co-tenancy, deletion always O(1).

In that world the two steps map cleanly: Step 1 is object=piece (no floor) — register everything as-is; Step 2, the generational background aggregator, becomes the optional relief valve for root-count/throughput, aggregating only old, cold pieces unlikely to be deleted (so it rarely triggers a repack).

Two caveats I'd flag rather than gloss:

  1. "every piece as-is" maximizes piece count, which makes Miro's ~11M-txn / send wall worse for a heavy tail of tiny objects — so whether we can do
    this at all depends on the contract-practicality question we need to raise with FilOZ in the top-level note (and on raising the extraData cap / sharding across data sets).
  2. Generational aggregation reduces the live-root count, but because nextPieceId is monotonic it doesn't undo the proving-depth ratchet, and registration txns are paid per-object up front regardless — so it's relief, not a free lunch.

Net: this is my preferred direction and I'd ship it iff FilOZ confirms the piece volume is practical; if they say it isn't, that's exactly when the size floor comes back as the fallback.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go further than "maybe no minimum" to no cross-object aggregation at all.

Sounds great to me! 💯

Generational aggregation reduces the live-root count, but because nextPieceId is monotonic it doesn't undo the proving-depth ratchet, and registration txns are paid per-object up front regardless — so it's relief, not a free lunch.

I am wondering if we can work around the monotonic nextPieceId by creating a new proofset for each generation aggregation run?

  1. Flag the current proofset as "gc_pending".
  2. Create a new proofset that will be used for new uploads from now on.
  3. Wait until all TXs adding pieces to the gc_pending proofset are settled
  4. Perform generational aggregation for pieces in the gc_pending proofset
    1. Create a new proofset to hold the aggregated roots
    2. Iterate over all roots in the processed proofset
      • Large root - move to the new proofset as-is
      • Aggregate small roots into bigger ones
    3. Schedule the processed proofset for deletion

Anyhow, I agree generational aggregation is post-MVP (post-M2), so we don't need to worry about it right now.

Comment thread docs/architecture.md
invisible to storage and retrieval — it matters only for proving and deletion.
- **The catalog names content; the indexer locates it.** Ingot's manifest pins body **digests**
(stable across any chain-side reshuffling); the indexer maps a digest to its current physical
location. A read joins the two.

@alanshaw alanshaw Jun 18, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, my mental model of this currently is that a body digest is the hash of the entire object (single part) OR the hash of a root UnixFS node that links to all shards of the object (multi part).

Location claims are created by Piri nodes for each shard, telling us where the shard(s) for an object are.

A (sharded DAG) index is created by Ingot for the object - the index lists the shard digests that comprise the blob. The new nodes property (used only for multipart uploads) carries the root UnixFS node and it tells us the order the shards need to be served for retrieval. See RFC for details.

An object read queries the indexer for the body digest and gets back location claims and an index, allowing retrieval to take place.

What we could do, to remove the indexer would be:

  1. Assume all blobs are stored at a specific Piri node URL (configurable probably).
  2. Encode the (ordered) shard hashes in the object manifest.

This does however make the MST and blobs undiscoverable without Ingot. This actively prevents (or at least makes it hard) to credible exit.

Comment thread docs/architecture.md
└ body
├ size 629_145_600 (600 MiB total)
├ sha256 <whole-body digest> (integrity)
└ blobs[] ── ordered, contiguous, covers [0, size) ──┐

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, the plan is to store these in the object manifest anyway? What do we get from this? We have to go to the indexer anyway to get the locations...

Consider calling this "shards" to align with what they are referred to in Forge.

Comment thread docs/architecture.md
durable. The catalog plane is the per-operation delta.

Because many tiny blocks share a CAR, catalog retrieval uses the indexer's **index-claim /
sharded-dag-index path** (block CID → byte range within its CAR shard). This two-level lookup is

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, we still need to index the MST blocks...

Comment thread docs/architecture.md Outdated

To stay out of that window, a blob's local copy is retained until the object is **published** —
confirmed by an independent, cache-cold indexer probe actually resolving the digest (or a fixed
margin past the TTL) — not merely until `accept` returns.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, so this is basically the reason the indexer exists. Items written to the cache are given ample time for IPNI to catch up (7 days) so this shouldn't be necessary.

That said a belt and braces approach might be a good idea. We have a library for getting notifications that an IPNI chain has "done a sync" you might want to take a look at: https://github.com/storacha/go-libstoracha/blob/main/ipnipublisher/notifier/notifier.go

Comment thread docs/architecture.md
- A **within-space** re-PUT short-circuits in Sprue, which replays the existing claim — no PUT, no
new accept, same location returned. The reference index simply gains another version.
- A **cross-space** dedup hit (bytes present from another space) returns no upload URL but still
mints a fresh per-space location claim.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mints a fresh per-space location claim.
issues a fresh per-space location commitment.

Comment thread docs/architecture.md Outdated
Object bodies resolve straight from a digest: `accept` publishes an `/assert/location` commitment
keyed by the blob's own digest with a whole-blob range, and the manifest's ordered blob list carries
the byte-range map for split/multipart objects. So body retrieval needs only a digest → location
lookup, no per-CAR sharded-dag-index.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so. I just don't know why you wouldn't use an index? You have to go to the indexer anyway and we should probably try to put as little information in the object manifest as possible as we don't have a good garbage collection story for the calalog.

If we ever want to access that object outside of the catalog then it is impossible - you can't ask the indexer about that object CID. 🤷‍♂️ IDK maybe not important, just seems weird to break compatibility here.

How about as a compromise we don't create indexes when the object is single part, but if there is > 1 shard then we do?

@frrist

frrist commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Should Piri do cross-object aggregation at all?

A theme runs through a lot of these comments, so I want to pull it out here and link the individual threads back to it.

The assumption I want to challenge. Piri has aggregated blobs into larger on-chain pieces since it was first written, and I think we've all just assumed that was necessary — for gas/throughput reasons — without ever actually stress-testing it. Reading this review, it struck me that

almost every hard problem in this doc traces back to one thing: cross-object aggregation.

Deletion complexity, compaction, "never repack," proving over pieces nobody's paying for, telling uncollected garbage from a fabricated piece, the GDPR compaction window, the size floor itself — they exist only because we pool blobs from different objects into a shared piece. Remove that and the whole cluster collapses.

The proposal (the simple world). Drop cross-object aggregation. Store each S3 object as its own content-addressed blob(s): one on-chain piece if it fits under a ceiling, or a handful of object-owned pieces if it's larger (split at the proving-code ceiling; per-part for multipart, which keeps us compatible with the encryption RFC's per-part envelopes). No floor, no co-tenancy — a piece belongs to exactly one object. Then:

  • Deletion is ~O(1) per object, always — delete the object's piece(s), no survivors to re-hash, no compaction, no SLA window to chase.
  • The garbage-vs-fabricated-piece question splits cleanly: garbage (our own dead bytes from deletes) is honest-operator GC; fabricated pieces are adversarial and can't be caught by Ingot's blob_refs (sha256↔CommP isn't verifiable — it's a UCAN attestation and the SP runs Ingot, so it can't audit itself), so that one anchors on FilOne-side accounting (Sprue's central accept log) + payment leverage. Still an open problem, but a tractable, correctly-anchored one.
  • Per-tenant attribution gets cleaner, and the entire floor / "Regime B" compaction machinery in this doc just disappears.

The catch, and the real point: this hinges on a contract-practicality question we don't have the answer to. Object=piece maximizes piece count, so it only works if the PDP contracts and proving can take that volume in practice. We've never actually asked, though we meant to at the FilDev summit ( 🤦 ) The open questions for FilOZ / FOC — as the experts:

  1. Is there a practical ceiling on the number of pieces in a single proof set? Where does it bite first — registration throughput (the addPieces batch is no longer contract-capped as of FWSS v1.3.0 — it's now bound by the FVM PiecesAdded event-size + per-tx gas, so what's the measured pieces/tx?), the monotonic nextPieceId proving-depth ratchet, the MAX_ENQUEUED_REMOVALS=2000/period delete cap, or the view-getter scans?
  2. How many proof sets should a node maintain? Today one Piri node = one data set, for operational simplicity — but every data set is a proof the node has to produce each period, so N sets = N proofs/day and it shifts the node's SLA, proving burden, and most importantly cost via extra gas. Is a cap on pieces-per-set + sharding across multiple sets the right move (it would help delete throughput)? If so, what's the right number?
  3. What's the registration-throughput ceiling now that you've removed the extraData cap? As of FWSS v1.3.0 / PDPVerifier v3.4.0 the addPieces byte cap (EXTRA_DATA_MAX_SIZE / MAX_ADD_PIECES_EXTRA_DATA_SIZE) is gone — the limit is now the FVM PiecesAdded event-size + per-tx gas. A measured pieces/tx (and whether that event-size limit is worth lifting at the protocol level) is what we need to size the sender fleet for "many small pieces."

The decision rule. I want to build around the assumption that object=piece is viable, because if it is, this gets dramatically simpler. But I'm treating it as an assumption to stress-test, not a settled fact:

  • If FilOZ confirms the contracts can take it → we drop cross-object aggregation; the floor and compaction in this doc go away; Alan's generational background-aggregator becomes an optional relief valve for root-count/throughput (it only ever aggregates old, cold pieces unlikely to be deleted, so it rarely triggers a repack); and Miro's multi-dataset sharding is how we scale it.
  • If FilOZ says there's a hard practical limit → we keep the size floor exactly as this doc outlines, as the documented fallback.

So the floor stays in the doc — but as the fallback, not the default. The default I want to pursue is "no cross-object aggregation," pending FilOZ telling us whether the assumption holds. This is also why I'd treat the aggregation strategy (and, relatedly, the indexer scope) as open in both this doc and the appliance plan until we have that answer.

How this relates to the S3 sharding RFC. I think these are two halves of one model, not competing proposals. The RFCs flat-file sharding strategy is the object→shard half: for each PutObject/UploadPart it shards data at 256 MB into ordered, content-addressed blobs (I think we can tune this limit as needed). This thesis is the shard→piece half: each of those shards, being its own object-owned blob, becomes its own on-chain PDP piece. The join is a single number — the RFCs 256 MB shard threshold, this thesis's blob ceiling, and Piri's MaxMemtreeSize = 256<<20 (pkg/pdp/proof/proof.go:114) are the same constant for the same reason (the commP memtree is built in RAM, for now, we will be improving this via code from Curio implementation). So under no-aggregation, shard == blob == piece, 1:1, and per-object deletion is per-shard == per-piece == O(1). The thesis is therefore the Shard RFC plus one line: each shard is its own piece. And because retrieval addresses bytes by digest, the aggregation decision is invisible to its read path entirely — the only thing the two halves must negotiate is the multi-shard ordering record (see the §8 / arch:467 thread).

- Aggregation: pivot to "no cross-object aggregation (object=piece),
  floor as fallback pending FilOZ" across the gate brief + RFC §7,
  matching the top-level review note.
- v1.3.0: remove the EXTRA_DATA_MAX_SIZE batch-cap framing (cap was
  removed upstream); neutralize the throughput numbers to "bound by
  the FVM PiecesAdded event-size + gas, pending measurement"; repoint
  pricing to lib/PriceListUSDFC.sol; note the new USDFC fee layer;
  tighten "no slashing" wording; refresh moved citations.
- architecture: add scope note vs the appliance release plan; fix §5
  cache TTLs to production values + demote the read-after-write floor;
  retitle §8 to "when bodies need a sharded-dag-index" + single/multi
  split; fold MST credible-exit rationale into §4; clarify O(log).
Comment thread docs/aggregation-gate.md
Comment on lines +70 to +72
- **the delete path is real end-to-end** — the contract accepts pieces down to 32 B, but Piri's
whole-root delete (`schedulePieceDeletions`) is **wired but unsigned today** — it needs its
extraData signature finished and exercised;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Can you say more? I didn't follow this line at all.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry. This is a very subtle reference to the fact that Piris RemoveRoot method is incomplete.

Comment thread docs/aggregation-gate.md
Comment on lines +73 to +74
- **proving stays bounded under churn** — the monotonic `nextPieceId` ratchet must not drift proving
cost or the chain-reconciliation view-getters into trouble at high churn.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I didn't follow this either. I know what the nextPieceId is, but I'm not following why it's mentioned here, and I'm having trouble parsing "must not drift proving cost or the chain-reconciliation view-getters into trouble at high churn". Must not drift it into trouble?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, yeah I need to clear this up. What I am trying to indicate is that some methods on the PDPVerifier, and iirc a view contract over it will iterate from N->nextPieceId (where N = 0, and sometimes > 0). And since nextPieceId only ever increases, even with deletes, as the dataset churns iteration will become prohibitivly expensive over time.

Comment thread docs/aggregation-gate.md
On-chain dollar cost is negligible at **every** piece size — so this is **not** a cost optimization,
it is a **capability gap**. The two real risks are exactly the two asks: (a) the off-chain repack on
churn — which **one-object-one-piece eliminates** (no survivors to re-hash); and (b) throughput — more
pieces means a sender fleet. "Piece = object" is simultaneously the cost-optimal *and* the only

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I'm not sure what that means. What's a "sender fleet"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another whoops 😅 this leaked in - its a separate idea in Piri I have been playing with that fell out of this whole exploration, will provide a better definition and motivation for it in doc.

Basically its the idea of multiple blockchain accounts Piri submits transactions from in parallel, so registrations aren't all stuck single-file behind one account's queue. i.e. an account's messages must apply in strictly increasing nonce order, one stuck transaction stalls every higher nonce behind it — an underpriced tx during a gas spike, a revert, a dropped message, a bad gas estimate. At millions of registrations under variable base fee, that single-file fragility is the problem. I might be overthinking this here, but seems like a real issue that could popup at high throughput.

Comment thread docs/aggregation-gate.md
bytes in large objects), object lifespan by size class, multipart fraction and the actual part sizes
clients use, and the delete/overwrite/version/lifecycle mix. Pull initial shapes from object-store
workload literature (IBM COS / SNIA traces); **instrument Ingot from day one** so the floor/ceiling
become measured, not guessed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What are our knobs to adjust in response to signals? Does instrumenting after launch still give us room to respond to observation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have no knobs, reaction would be via an operator re-configuring their node - say for a lower floor (if we even go that route), different aggregation threshold, re-pack strats, etc. Or via an update we ship, preference for the former.

Comment thread docs/architecture.md
Comment on lines +36 to +38
- **The catalog names content; the indexer locates it.** Ingot's manifest pins body **digests**
(stable across any chain-side reshuffling); the indexer maps a digest to its current physical
location. A read joins the two.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Let's reuse a noun here instead of introducing "body", to clarify what we're talking about. This is the blob digest, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yes the blob digest.

Comment thread docs/architecture.md
Comment on lines +243 to +246
**Object → blobs.** A body is hashed (sha256 for content addressing, md5 for the ETag) in a single
streaming pass and written to the local store. It becomes an ordered list of content-addressed
blobs, each `≤ max_blob_size`: one blob for objects within the ceiling, a coarse split (e.g. 256 MiB
granularity, not fine chunking) for larger ones. Each blob is uploaded to Piri by digest ([§7](#7-cross-cutting-durability-concurrency-retrieval)).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Two things:

  • The stream will also encrypt, yielding a FEE envelope of the blob on disk and a region-wrapped CEK to store in the DB. Doesn't necessarily need to be explicit here, but I want to make sure that's in your head too.
  • It sounds like this is writing the whole thing out to disk before moving on, but we can upload each part blob as it's complete on disk. I had been thinking of this as bounding the space Ingot needs on disk, because we can throw out blobs after they're sent to Piri, but it turns out we want to hold onto them anyway for read-after-write and as a cache, so…that's less important. 😛

Comment thread docs/architecture.md
Comment on lines +306 to +310
- **`min_aggregate_size`** — the deletion-granularity knob (today a hardcoded 128 MiB). A blob
**≥ min** becomes its **own** on-chain piece. A blob **< min** is folded with other small blobs
into a shared aggregate piece (built up to ~min).
- **`max_blob_size`** — the largest blob Piri accepts (currently 256 MiB, liftable). Larger objects
are split into `≤ max` blobs by the data layer.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Are these sizes in plaintext, or in FEE envelopes?

Comment thread docs/architecture.md
`min` is the central lever because on-chain cost is **count-driven, not size-driven** (gas RFC). A
small `min` means most objects are their own piece, so most deletes are O(1) (below); the price is
more pieces — more `addPieces` transactions, a one-time registration tax, and an O(log) proving
ratchet under churn — all bounded, none a per-live-piece cost. The RFC's proposed knee is **8 MiB**

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: …🦵❓

Comment thread docs/architecture.md
Comment on lines +388 to +390
Postgres writes, and the guarded root swap — so it is small and bounded (sub-millisecond order). No
object is ever held whole in RAM. The catalog manifest is written only after its body blobs are
durable and accepted in Piri, so a crash never leaves a catalog entry pointing at non-durable data.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: The entire manifest is held in RAM, right? How pathological does a bucket have to be for that to be a problem?

Comment thread docs/architecture.md
Comment on lines +488 to +496
The **multi-shard** case (an object split at the ceiling, or a multipart object) does need an ordering
record, and here there is a deliberate choice. Ingot's manifest carries the ordered
`{digest, offset, length}` list (Ingot-private, lean, no extra index block). The sharding RFC's
alternative is a UnixFS File root node + a sharded-dag-index `nodes` property, which preserves
**credible exit**: `guppy retrieve <root-cid>` reassembles *and decrypts* the whole object with no Ingot
in the loop (guppy can decrypt via the FEE tenant-recipient/recovery path), where the flat manifest
cannot. So the open call for multi-shard objects is **flat manifest (lean, Ingot-only) vs. UnixFS root +
index (stock-tooling plaintext recovery)** — a deliberate trade, not settled here. The compromise:
**skip the index for single-shard objects; build it only for >1 shard.**

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: In other words, for credible exit, multipart manifests need to exist in the space somewhere and not just in a Postgres DB?

@hannahhoward hannahhoward left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broadly excellent.

My only hard change is I'd like to have a pretty serious conversation about my proposed alternate form for doing versioning, which I think more closely mirrors the domain model for how S3 versioning is supposed to work.

Beyond that, is just amazing well thought out.

I do want to call attention to the fact that some of the most thinking goes into questions we may not need to answer immediately:

  • The first version of deletion is just crypto-shredding the regional encryption key
  • The absolute top level goal in R0 is standup an end to end Forge provider in FilOne. No versioning, no real deletion, no various other things.

So my main direction is is focus on "how can I deliver this incrementally to meet R0 (which is just staging and can be completely wiped), R1 (which needs to be data format complete, but need not be feature complete), R2 (where we really want everything working)

Comment thread docs/architecture.md
Comment on lines +133 to +137
> - **Composite key (chosen) vs. per-key version index.** Keying the MST by `(key, version)` keeps all
> versions in one sorted space and amortizes well for many-versioned keys. The alternative keys the
> MST by object key alone, with the leaf pointing at an explicit newest-first version index — which
> removes the inversion, the escaping, and the key-budget pressure, but rewrites and grows that
> index block on every new version. Revisit if hot keys accumulate very many versions.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to second this and make a concreate proposal, after honestly rabbit holing an entire day of this design and how S3 versioning is supposed to work.

Before I get to what I propose, let me cover what I learned about S3 Version:

In particular: in S3 version id is different from its ordinal order:

  • seq — the per-bucket monotonic ordinal (buckets.next_version_seq). Its only job is ordering (recency). Internal; never leaves the server.
  • version_id — a string, the S3 client handle (x-amz-version-id). Its job is identity. It's opaque, and for suspended/unversioned writes it is the literal string "null".

Unfortunately this distinction is not just an information leaking concern.

For numbered versions they can coincide (version_id can just be seq rendered, information concerns not withstanding). They diverge for the null version: its version_id is frozen at "null" by S3, but it still has a real ordering (a seq).

Blame S3 for a wildly complicated concept.

invertedVersionId collapses the two — it's derived from the ordinal but handed back as the client version_id — and the null version is where that breaks: its id has to be "null", not an encoded ordinal, and there is only ever one null per key, replaced in place (verified against real S3 - I encourage you to make a bucket and mess around with turning on versioning, suspending and re-enabling multiple times etc). Keeping seq and version_id as separate fields makes that case fall out instead of needing a sentinel.

Ok now to the concrete proposal built on @Peeja 's idea:

  1. Top MST stays keyed by the plain object key
  2. Each leaf is:
    node = {seq, version_id, CatalogCID}
    leaf = { current: node, prev: MST<seq -> node >, null_seq?: seq }

current gives the fastest current path — GET / ListObjects read one leaf, no descent. prev is a per-key sub-tree of older versions keyed by seq; because it's scoped to one buckey key, the prev key is just the ordinal — no fancy error prone string math to produce an object-key + version composite; "order an integer" is all that's left. null_seq is to track if there is a null version id in the prev tree -- because of S3's very unique way of having exactly one "null" in the version history.

Write rule — depending on versioning state, push the displaced current into prev or discard it; and since S3 allows only one null per key, creating a new null also evicts any existing one:

retain = (versioning == Enabled) || (displaced.version_id != "null")
if retain: 
   prev = put(prev, displaced.seq, displaced)
   if displaced.version_id == "null": null_seq = displaced.seq
else: 
   gc_release(displaced)
if new.version_id == "null":
   if null_seq: prev = remove(prev, null_seq)
   null_seq = undefined
current = new
Mode Displaced retain
Enabled numbered or null push (S3 keeps a re-enabled null as noncurrent "null")
Suspended numbered push
Suspended null discard (replace the null in place)
Unversioned null discard

prev is written only when we retain history — always for Enabled, never for Unversioned, only at the numbered→null boundary for Suspended. The lingering null is often noncurrent (it lands in prev when an Enabled write displaces it), so the leaf tracks its seq (null_seq);

Optional (to speed up lists): lift is_delete + render fields (etag/size/last_modified) into the leaf and prev entries, so ListObjects/ListObjectVersions render straight from the tree without fetching manifests, still verifiable. Left out above to keep the base shape minimal.

Cost: Enabled writes do a prev insert + top resplice each time — more blocks per write than a flat splice, but scoped to keys you're actually versioning. One thing to confirm: rebuild assumes logstore keeps per-op records / op-roots that point at created manifests, not just the latest forge_root_cid.

Comment thread docs/architecture.md

This layer is how blobs become proven storage, and how the size knobs govern deletion.

**Storage.** `allocate` → `PUT` parks a blob in MinIO keyed by its digest. `accept` commits it to

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MinIO or "storage backend" -- I do think there's a world where some people just wanna throw ZFS at this.

@bajtos bajtos left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished reading the version from yesterday and have a few more comments. Next, I'll read the discussion and the updated version.

Comment thread docs/architecture.md

Ingest, hashing, and the Piri upload all run **outside any lock** (upload is keyed by digest, not
bucket). The critical section does no large-body work — a manifest write, an MST splice, a few
Postgres writes, and the guarded root swap — so it is small and bounded (sub-millisecond order). No

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub-millisecond order

Are you sure we can perform a few Postgres writes in a sub-millisecond time?

Comment thread docs/architecture.md
-- catalog plane ships).
CREATE TABLE ingot.buckets (
name text PRIMARY KEY,
space text NOT NULL, -- Forge space DID

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the name space_did or forge_space_did to capture the information from the code comment in the variable name? (Similarly to how we have root_cid and not root).

Comment thread docs/architecture.md
-- all. `space` is denormalized from buckets for a direct claim query.
CREATE TABLE ingot.blob_refs (
digest bytea NOT NULL, -- sha256 multihash of the blob
bucket text NOT NULL,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is bucket here?

Is it the string bucket name (referencing ingot.buckets.name? If yes, then I think bucket_name would be a better column name.

Do we have any concerns about the storage & performance implications of using text instead of integers? By using integers, I mean blob_refs.bucket_id INT as a foreign key reference to buckets.id column we would need to add.

Comment thread docs/architecture.md
digest bytea NOT NULL, -- sha256 multihash of the blob
bucket text NOT NULL,
object_key text NOT NULL,
version_id text NOT NULL,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why text and not bigint, considering that next_version_seq already has bigint type?

Comment thread docs/architecture.md
bucket text NOT NULL,
object_key text NOT NULL,
version_id text NOT NULL,
space text NOT NULL, -- = buckets.space (denormalized)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space_did if we agree to rename the column (https://github.com/fil-forge/ingot/pull/2/changes#r3440973094)

Comment thread docs/architecture.md
Comment on lines +615 to +616
versioning text NOT NULL DEFAULT 'unversioned'
CHECK (versioning IN ('unversioned','enabled','suspended')),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using an enum type instead? I think that will be more efficient in the terms of storage size and query performance.

Comment thread docs/architecture.md
-- manifest without the client resupplying them.
CREATE TABLE ingot.multipart_sessions (
upload_id text PRIMARY KEY,
bucket text NOT NULL,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is bucket here? Is it the string bucket name (referencing ingot.buckets.name? Do we have any concerns about the storage & performance implications of using text instead of integers for this?

Comment thread docs/architecture.md
Comment on lines +662 to +663
state text NOT NULL DEFAULT 'open'
CHECK (state IN ('open','completing','aborting')),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using an enum type instead? I think that will be more efficient in the terms of storage size and query performance.

Comment thread docs/architecture.md
Comment on lines +678 to +679
state text NOT NULL DEFAULT 'parked'
CHECK (state IN ('parked','accepted')),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using an enum type instead? I think that will be more efficient in the terms of storage size and query performance.

Comment thread docs/architecture.md
-- this iteration — no catalog GC yet; a future collector consumes it.
CREATE TABLE ingot.gc_candidates (
cid bytea PRIMARY KEY, -- superseded MST node CID
bucket text,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is bucket here? Is it the string bucket name (referencing ingot.buckets.name? Do we have any concerns about the storage & performance implications of using text instead of integers for this?

Comment thread docs/aggregation-gate.md
Today Piri pools accepted blobs into ~128 MiB pieces (`MinAggregateSize`, up to a 256 MiB ceiling)
across **unrelated** objects, registers them in one node-wide proof set, and can delete only a
**whole** piece. So deleting one 8 MiB S3 object means: take down a ~128 MiB aggregate that also holds
~16 **other** objects, re-hash the survivors off-chain, re-register them under a new root, and

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do objects need to be re-hashed? Should not we only compute the new root hash instead?

Comment thread docs/aggregation-gate.md

## The model that fixes it: one object = one piece (no cross-object aggregation)

The proposal is to **stop pooling unrelated objects into shared pieces** and store each S3 object as

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the objects are less than 8 MiB in size? Should they get padded or are they going to be aggregated with unrelated objects into a shared piece like they are aggregated today?

Comment thread docs/aggregation-gate.md
so the floor-fallback is a small change; the object=piece target mostly means *not* aggregating across
objects.

**2 — Prove small pieces are viable in the PDP contracts, at scale.** The RFC's *measured* gas already

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an ongoing discussion about this with the FWSS team? Would you mind linking that discussion in case there is one?

Comment thread docs/architecture.md
blobs, each `≤ max_blob_size`: one blob for objects within the ceiling, a coarse split (e.g. 256 MiB
granularity, not fine chunking) for larger ones. Each blob is uploaded to Piri by digest ([§7](#7-cross-cutting-durability-concurrency-retrieval)).

**The local store (spool + cache).** Each blob is written locally before upload — both because the

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is Sprue aware if Ingot node also doubles as Piri node? Will it prioritize such node for blob slot allocation over the other nodes since there's no moving bytes across the wire?

Comment thread docs/architecture.md
GET key[?versionId] (after §3 precondition checks)
MST → manifest → ordered [blobDigest, byteRange] list (catalog via the index-claim path)
per covering blob:
in local cache/store → serve locally

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Ingot going to check UCAN delegations for local reads?

@bajtos

bajtos commented Jun 19, 2026

Copy link
Copy Markdown

The proposal (the simple world). Drop cross-object aggregation. Store each S3 object as its own content-addressed blob.
The catch, and the real point: this hinges on a contract-practicality question we don't have the answer to. Object=piece maximizes piece count, so it only works if the PDP contracts and proving can take that volume in practice.

FWIW, Filecoin Onchain Cloud uses exactly this design (or at least used to use it back in Q1 2026): one file = one PDP piece. So PDP & FWSS was designed with the assumption that there will be one piece per file, with files as small as ~32 bytes.

Of course, the question is whether FOC has ambitions to handle the storage scale FilOne is aiming for, or whether their design works for FOC only because they operate at a much smaller scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants