On-disk persistance layer#33
Open
zzet wants to merge 234 commits into
Open
Conversation
The persistence layer is about to grow a second and third backend
(on-disk bbolt + on-disk SQLite), eventually a remote one. To let the
rest of gortex stay backend-agnostic, lift the surface the codebase
actually consumes out of *Graph into a graph.Store interface and have
*Graph satisfy it via a compile-time assertion.
The interface mirrors the 28 public methods on *Graph as they exist
today, in their current slice-shaped signatures, so this commit is
strictly additive: every existing caller keeps working unchanged. A
few notes on the shape:
- Slice-shaped reads (AllNodes / AllEdges / FindNodesByName / …)
materialise their result in memory. Fine for the in-memory store;
disk and remote backends will want iterator variants added
alongside as those implementations come online — they don't have
to replace these.
- Memory-estimate methods (RepoMemoryEstimate /
AllRepoMemoryEstimates) are inherently in-memory specific. Disk
and remote backends return whatever they can compute and callers
treat the result as advisory.
- *Graph.ResolveMutex() is intentionally NOT on the interface. It's
an in-memory implementation detail (resolver coordination) that
does not generalise to disk / remote backends. Resolver callers
keep operating on *Graph directly until that coordination is
reshaped.
The compile-time assertion `var _ Store = (*Graph)(nil)` is the
load-bearing check: if anyone's edit to *Graph drifts a signature, the
build breaks here instead of at runtime when a different backend gets
swapped in.
No behaviour change, no caller change, no test change. Graph
package tests still pass with -race.
…line
Adds internal/graph/storetest, a reusable conformance test suite that
every graph.Store implementation MUST pass. Codifies the union of
behaviour the rest of gortex depends on from *graph.Graph today, so
new backends (on-disk bbolt, on-disk SQLite, remote) can prove
drop-in compatibility before being wired into the daemon.
31 subtests cover:
- point lookups (GetNode, GetNodeByQualName)
- name + scope queries (FindNodesByName, FindNodesByNameInRepo,
GetFileNodes, GetRepoNodes)
- edge adjacency (GetOutEdges, GetInEdges) + idempotency +
line-disambiguation
- bulk reads (AllNodes, AllEdges) + counts + Stats / RepoStats /
RepoPrefixes
- mutations: AddNode, AddBatch, AddEdge, RemoveEdge, ReindexEdge,
SetEdgeProvenance
- eviction: EvictFile, EvictRepo (+ "no nodes" edge cases)
- structural invariants: EdgeIdentityRevisions, VerifyEdgeIdentities
- memory estimation: RepoMemoryEstimate, AllRepoMemoryEstimates
- Meta map round-trip
- empty-store invariants
- concurrent AddNode from 8 goroutines (race-safe)
Backends invoke via:
storetest.RunConformance(t, func(t *testing.T) graph.Store {
return openMyBackend(t)
})
memory_conformance_test.go proves the in-memory *graph.Graph passes
the full suite — 31/31 subtests green with -race. This is the
canonical baseline; on-disk backends will land alongside in follow-up
commits and slot into the same harness.
A few methods are documented as "permissive" in the suite
(EdgeIdentityRevisions allows zero, VerifyEdgeIdentities allows nil,
memory-estimate methods only check NodeCount) because they're
inherently in-memory-specific. Disk and remote backends return
whatever they can compute and callers treat the result as advisory —
matches the contract documented on the Store interface itself.
…Store The first non-memory backend for the persistence layer extracted in 8221a40. Embeds bbolt v1.4.3 (already a transitive dep, promoted to direct here), keeps gortex deployable as a single binary, and adds a real on-disk option for any deployment that wants graph state to survive daemon restarts without paying the full snapshot/restore cycle every time. ## Schema Ten top-level bbolt buckets: nodes key=nodeID value=gob(Node) edges key=edgeKeyBytes value=gob(Edge) idx_node_kind key=kind\x00nodeID value=empty idx_node_file key=filePath\x00nodeID value=empty idx_node_repo key=repoPrefix\x00nodeID value=empty idx_node_name key=name\x00nodeID value=empty idx_node_qualname key=qualName value=nodeID idx_edge_out key=fromID\x00edgeKeyBytes value=empty idx_edge_in key=toID\x00edgeKeyBytes value=empty meta misc counters `edgeKeyBytes` encodes (from, to, kind, file, line) with 2-byte big-endian length prefixes on each variable-length component plus a 4-byte big-endian line — uniquely decodable so RemoveEdge / ReindexEdge locate exact rows, lexicographically scannable so adjacency prefix walks are O(k) in the matches. The four scoped node indexes use the standard "{attr}\x00{nodeID} → empty" pattern so a Seek on the attr-prefix enumerates every matching nodeID in O(k). idx_node_qualname is a flat unique lookup (1:1). The `meta` bucket holds the 8-byte big-endian edge-identity-revisions counter, bumped from putEdgeTx and SetEdgeProvenance to mirror the in-memory store's revision semantics. ## Concurrency All writes go through `db.Update` (bbolt single-writer); all reads through `db.View` (unlimited concurrent readers under MVCC). SetEdgeProvenance also takes a small in-memory `provMu` to make its read-modify-write atomic against concurrent provenance bumps. The conformance suite's 8-goroutine concurrent AddNode test passes under `-race`. ## Encoding Node and Edge are gob-encoded — same codec the existing FileStore-based snapshot uses, so Meta map[string]any round-trips without surprises and we inherit gob's forward-compatibility for unknown-field-during-decode (matters when an older daemon reads a newer-schema DB). ## Conformance `storetest.RunConformance` passes 30/30 subtests with `-race`: AddGetNode, AddGetEdge, AddNodeIdempotent, AddEdgeIdempotent, AddEdgeLineDisambiguates, AddBatch, RemoveEdge, EvictFile, EvictFile_NoNodes, EvictRepo, EvictRepo_NoNodes, NodeAndEdgeCount, AllNodesAndEdges, FindNodesByName, FindNodesByNameInRepo, GetFileNodes, GetRepoNodes, GetNodeByQualName, Stats, RepoStats, RepoPrefixes, SetEdgeProvenance, ReindexEdge, Concurrency, EdgeIdentityRevisions, VerifyEdgeIdentities, RepoMemoryEstimate, AllRepoMemoryEstimates, MetaPreserved, EmptyStore. Nothing skipped or weakened — including EdgeIdentityRevisions (real counter persisted in `meta`) and VerifyEdgeIdentities (cross-checks every edge bucket row against both adjacency indexes). ## Dependencies Zero new deps. `go.etcd.io/bbolt v1.4.3` was already an indirect transitive; this commit promotes it to a direct require because the new package imports it.
…ph.Store The second on-disk backend for the persistence layer extracted in 8221a40. Built on modernc.org/sqlite (the transpiled pure-Go SQLite driver) so the single-binary deployment story stays intact — no CGO beyond what tree-sitter already pulls in. Sits behind the same graph.Store interface as the in-memory and bbolt backends and passes the identical conformance suite. Why two on-disk backends: bbolt and SQLite have different sweet spots (bbolt = faster point lookups, simpler model; SQLite = richer query surface, mature tooling). The Store interface lets us ship both and let the deployment pick. Cross-backend benchmarking comes in a follow-up commit. ## Schema Two tables: nodes (PK on id, secondary indexes on name, kind, file_path, partial index on repo_prefix where non-empty, partial UNIQUE on qual_name where non-empty) edges (synthetic INTEGER PK AUTOINCREMENT, UNIQUE(from_id, to_id, kind, file_path, line), secondary indexes on (from_id, kind) and (to_id, kind) for the hot adjacency walks) Meta rides as a gob-encoded BLOB on both tables; NULL when empty so the common case stays zero-cost. The UNIQUE constraint on edges (from, to, kind, file, line) gives INSERT OR IGNORE semantics matching the in-memory store's logical edge-key dedup without needing application-level checks. The two partial indexes (repo_prefix where non-empty, qual_name where non-empty) skip the empty-string default values that the zero-valued Node struct produces, keeping those indexes tight. ## Connection management - DSN PRAGMAs: journal_mode=WAL, synchronous=NORMAL, busy_timeout=5000. - SetMaxOpenConns(1) plus a Go-side write mutex serialises writes and sidesteps SQLITE_BUSY under the 8-goroutine conformance Concurrency test. - All hot queries use prepared *sql.Stmt built once in Open and closed in Close. - AddBatch wraps the inserts in a single BEGIN/COMMIT transaction — the 10-100x speedup that matters at indexing scale. ## EdgeIdentityRevisions / VerifyEdgeIdentities - EdgeIdentityRevisions: in-process atomic.Int64, bumped only when SetEdgeProvenance actually changes the stored origin (mirrors the in-memory store, where the counter is also per-process). - VerifyEdgeIdentities: returns nil. The in-memory invariant is "same *Edge pointer in both adjacency views"; the SQL store has one row per edge so the invariant is structurally trivial. ## Conformance `storetest.RunConformance` passes 30/30 subtests with `-race`. Total: 93 tests across all three backends (in-memory + bolt + sqlite) green. ## Dependencies One new direct dep: `modernc.org/sqlite v1.50.1` (latest release, tagged 2026-05-10). Transitives: modernc.org/libc, mathutil, memory, github.com/ncruces/go-strftime, github.com/remyoudompheng/bigfft — all standard for this driver. Pure Go end-to-end; no additional CGO.
The agent-generated first cut of store_bolt used gob with a fresh
gob.Encoder per record. Each fresh encoder emits the Node / Edge
type-definition prologue (~200-400 bytes of metadata) at the start of
its byte stream because it has no remembered type state — across the
hundreds of thousands of nodes and edges a large repo's graph holds,
that's hundreds of MB of redundant per-record metadata flowing
through the BTree on bulk load and a proportional commit-time
penalty.
Compounded by AddBatch doing all writes in a single Update over the
full input — bbolt has to rebalance every dirty page in the tx at
commit, so commit cost scales O(N log N) with batch size and dominates
once N gets large.
The combined result of those two paper cuts: AddBatch of a
121 097-node, 515 232-edge graph from gortex itself took 4-5 minutes
on a clean box and never finished on linux/drivers. Not viable as a
benchmarkable backend, let alone production.
Two fixes in this commit:
1. Replace gob with a hand-rolled length-prefixed binary codec.
Schema (versioned with a 1-byte tag for future migration):
Node: ID, Kind, Name, QualName, FilePath, Language, RepoPrefix,
WorkspaceID, ProjectID, AbsoluteFilePath (varint-prefixed
strings), StartLine, EndLine (varint), Meta (varint-len
+ gob blob, len=0 when empty).
Edge: From, To, Kind, FilePath, Line, Confidence (8-byte f64),
ConfidenceLabel, Origin, Tier, CrossRepo (u8), Meta.
Meta keeps gob (handles map[string]any free-form), but only the
small blob pays the prologue and only when meta is actually
populated — the common "no meta" record pays zero codec overhead.
Encode reuses a sync.Pool'd []byte to avoid alloc churn.
2. Chunk AddBatch into 5 000-mutation transactions instead of a
single giant Update. Each chunk commits independently; readers
see writes in chunk granularity rather than as one atomic batch,
but the indexer only calls AddBatch from a single goroutine during
cold-index so that's not a correctness concern. 5 000 is the
empirical sweet spot where dirty-set commit cost amortises
without ballooning.
Measured on the gortex repo itself (1 955 files, 121 097 nodes,
515 232 edges):
bbolt AddBatch: 4-5 min (stuck, killed) → 18.6 s (real-world fast).
The remaining gap vs in-memory (883 ms) and SQLite (13.4 s) is
fundamental on-disk write cost — bbolt's BTree commit + the index
fan-out (each node touches 4 index buckets; each edge touches 2)
costs what it costs.
The 31 storetest.RunConformance subtests still pass with -race,
identical to the original implementation. Codec roundtrip is exact
for every field including Meta.
Disk size note: 914 MB at 121 k nodes / 515 k edges (≈1.4 KB/item).
SQLite stores the same data in 387 MB; the gap is bbolt's per-bucket
page allocation across 10 buckets — addressable later by collapsing
index buckets if disk size becomes load-bearing, but not in this
commit.
A standalone bench that loads the same in-memory reference graph into
every graph.Store implementation and reports load time, on-disk size,
heap residency, and query-mix p50/p95. Lets us validate that a backend
choice is the right tradeoff for a given workload instead of guessing.
Procedure:
1. Index the target repo once with the in-memory indexer to build a
reference *graph.Graph (ground truth shared across all runs).
2. Sample a deterministic-ish query workload from the reference
graph: N point lookups, N adjacency walks (split out/in), N/4
name searches, N/4 file-node scans.
3. For each backend (in-memory, bbolt, sqlite): open a fresh store,
bulk-load via AddBatch (timed), run the workload (timed), force
GC and sample HeapInuse, close and measure on-disk size.
4. Emit a markdown comparison table.
Result on the gortex repo itself (1 955 files, 121 097 nodes,
515 232 edges):
| backend | load | disk | heap | qp50 | qp95 |
|---------|--------:|---------:|-------:|------:|--------:|
| memory | 883 ms | — | 746 MB | <1µs | 2 µs |
| bbolt | 18.6 s | 914.0 MB | 747 MB | 13 µs | 626 µs |
| sqlite | 13.4 s | 386.7 MB | 31 MB | 20 µs | 1.35 ms |
Headline reads:
- In-memory wins on load + query latency by 1-2 orders of magnitude
(no encoding, no commits) — confirms the existing default is right
for repos that fit in RAM.
- SQLite wins on disk footprint (2.4× smaller than bbolt) and Go
heap (24× less — only the connection pool resides; rows stay on
disk) — the right answer for "doesn't fit in RAM" deployments.
- bbolt wins on hot-path query latency vs sqlite (13 µs vs 20 µs p50;
tail is in the same ballpark). Right when read latency matters
more than disk size.
- Both disk backends are sub-ms p50 — comfortably below "feels
instant" for interactive use.
Usage:
go run ./bench/store-bench -root <path> -queries N
go run ./bench/store-bench -root <path> -skip-bolt # memory + sqlite only
go run ./bench/store-bench -root <path> -skip-sqlite # memory + bolt only
Notes for future readers: heap numbers in the table are HeapInuse
(includes free-but-not-released-to-OS spans), which over-reports vs
true live allocation. The right metric for "what would a daemon
really hold" is HeapAlloc, but HeapInuse stays consistent across
backends and matches what ps reports — kept for that reason. The
in-memory and bbolt rows both include the reference graph (held by
the bench's main()), so their delta is what the backend itself adds
on top of the reference; the sqlite row presumably saw GC reclaim the
intermediate parse trees between the bolt and sqlite runs.
Closes the gap between "we extracted a Store interface" and "the indexer actually uses it". Previously the Store interface existed (8221a40) and three backends implemented it, but every consumer of the graph — indexer.New, resolver.New, NewCrossRepo, the temporal / gRPC / external resolver passes, the contracts bind/wrapper passes, the modules import linker, the semantic enricher — still typed its parameter as *graph.Graph. That made the disk backends unreachable from production code paths and reduced the cross-backend benchmark to "how fast can we migrate one in-memory graph into another store" instead of "how fast does the real indexer run with this backend". This commit rewrites the affected signatures in place: *graph.Graph → graph.Store across the indexer, resolver, contracts, modules, semantic, and related packages. No call sites change behaviour — *graph.Graph already satisfies graph.Store (via the compile-time assertion in store.go), so existing callers that hand in a *graph.Graph keep working unchanged. Disk and remote backends are now also legal arguments everywhere a graph used to flow. One small interface change: ResolveMutex() is now a Store method. The resolver's cross-package coordination (cross-repo, temporal, external, edge-mutation passes) needs the same serialisation regardless of backend, so the in-memory-specific carve-out from the original interface no longer makes sense. Memory store keeps its existing graph-wide resolveMu; bbolt and sqlite each grew a dedicated resolveMu separate from their internal write mutexes — the two protect different invariants and shouldn't share a lock. What works now that didn't before: - indexer.New(boltStore, …) — full indexing pipeline through bbolt - indexer.New(sqliteStore, …) — full indexing pipeline through sqlite - resolver.New(anyStore) — resolver works against any backend - All downstream passes (contracts, semantic, modules, clones, test-edge, search-index build) take the Store interface Conformance: all 3 backends still pass the 93-subtest storetest suite. The 1 166 tests across indexer / resolver / contracts / semantic / modules / storetest / store_bolt / store_sqlite pass with the new signatures. go vet ./... clean. Follow-up commit (bench/store-bench rewrite) will replace the "migrate in-memory graph into store" pattern with "drive the full indexer per backend" to produce the apples-to-apples comparison the old harness only approximated.
…etEdgeProvenanceBatch)
The resolver applies per-edge ReindexEdge / SetEdgeProvenance inside
tight loops over thousands of edges per pass (the main worker-join
mutation loop, cross-package guard, cross-repo / temporal / external
/ relative-imports / module-attribution / grpc-stub-call passes — 13
call sites in total). For the in-memory store each call is a couple
of map updates; for bbolt and sqlite each call is an ACID round-trip
(transaction begin, page mutations, WAL/journal append, fsync,
commit). The first end-to-end bench through the bolt-backed indexer
got stuck in the resolver pass for 22+ minutes — exactly because
~10k single-edge ReindexEdge calls were committing one at a time.
Adds two batched siblings of the per-edge methods. The interface
stays simple — callers pass the whole batch slice in one call; each
backend chooses its own chunk-size internally and runs one tx per
chunk:
ReindexEdges(batch []EdgeReindex)
SetEdgeProvenanceBatch(batch []EdgeProvenanceUpdate) (changed int)
Backend implementations:
- Memory: straight loop through the existing per-edge methods.
Zero behaviour change for in-memory callers.
- bbolt: chunks at reindexChunkSize=5000 (same constant /
rationale as addBatchChunkSize) and wraps each chunk in one
db.Update. The setEdgeProvenanceTx helper is factored out of
SetEdgeProvenance so the batch variant can call it inside a
shared Tx; bumpEdgeIdentityRevisions still fires per actual
change so the persisted counter matches the per-edge contract.
- sqlite: chunks at the same 5000 boundary, opens one BEGIN/COMMIT
per chunk, and re-uses prepared statements across the chunk
(tx.Stmt wraps the Store's pooled stmts so the SQL parse step
happens once per Store, not per call). edgeIdentityRevs.Add
fires once per chunk by the actual change count.
Conformance: two new storetest subtests cover batch semantics
(round-trip across all three backends including the chunk boundary)
and empty-batch / nil-batch invariants. 99 conformance subtests
across the three backends now green with -race, up from 93.
Caller migration follows in a separate commit so the surface area
changes (Store methods) and the consumer changes (resolver call
sites) read cleanly in git history.
Migrates all 13 call sites in the resolver from the per-edge ReindexEdge / SetEdgeProvenance calls to the new batched siblings landed in the previous commit. Each pass now accumulates its mutations into a local []EdgeReindex / []EdgeProvenanceUpdate slice and hands the whole batch to the Store at the end of the loop, so a single resolver pass produces ≤(N/5000) backend commits instead of one commit per mutated edge. Sites covered: resolver.go::ResolveAll (the worker-join apply loop) resolver.go::ResolveFile (per-file single-threaded apply) resolver.go (override-hierarchy provenance upgrades) cross_pkg_guard.go (revert weak-tier cross-package binds) cross_repo.go::ResolveAll (full-graph cross-repo resolution) cross_repo.go::ResolveForRepo (per-repo cross-repo resolution) cross_repo.go::resolveEdge (signature change: accepts *batch) relative_imports.go (Python / Dart relative import lift) grpc_stub_calls.go (gRPC stub → handler binding) temporal_calls.go (Temporal activity / workflow dispatch) external_calls.go (external-call synthesis) module_attribution.go (rewrite + DependsOnModule materialise) No behaviour change for the in-memory Store — graph.ReindexEdges / SetEdgeProvenanceBatch are loop wrappers around the existing per-edge methods on *graph.Graph. The win is entirely on disk backends, where the resolver was previously committing one transaction per mutated edge. Expected impact (extrapolated from the killed 22-min bolt bench run): the resolver pass through bbolt drops from minutes to ≤1s plus the actual page-mutation cost; sqlite similar. The bench follow-up commit re-measures end-to-end and confirms. 823 resolver + indexer + graph + storetest tests pass.
Replaces the "build one in-memory reference graph, AddBatch into each
backend" pattern with "construct each backend separately and run the
real indexer.IndexCtx pipeline against the source repo". The
previous shape measured migration cost (one shared graph copied into
each store) and structurally couldn't expose the disk backends'
per-pass commit characteristics — every backend got the same one-Tx
AddBatch and nothing else. This shape measures what a daemon would
actually pay on a cold start through each backend: parse → resolve
→ search-index build → contracts → clones → stub resolution →
external-call synthesis.
Notable changes:
- Each backend gets its own indexer.New(store, registry, cfg,
logger), its own IndexCtx call, its own query workload sampled
from its own populated state.
- The shared "reference graph" is gone; heap measurements are no
longer contaminated by a previous backend's resident state.
- Heap reporting now includes both HeapAlloc (live bytes — honest
"what would the daemon really hold") and HeapInuse (span
footprint — what ps would show). The earlier table only had
HeapInuse and was misleading at that.
Possible because: indexer.New now takes graph.Store (commit
b091850), so the same Indexer code path runs against any backend.
Possible to *use*: because the resolver's per-edge mutation calls
were batched (preceding commits), the disk-backend indexer pass
no longer hangs for tens of minutes.
Result table re-runs land in the next commits.
…d / EdgesWithUnresolvedTarget)
The pre-Store idiom across the codebase was
for _, e := range g.AllEdges() {
if e.Kind == X { ... }
}
Cheap on the in-memory graph (return existing slice, filter in Go),
catastrophic through disk backends — every call materialised the
whole table only to throw away >99% of the rows. On a 122 k-node
gortex graph the resolver alone fires 34 AllEdges/AllNodes scans per
pass; the same workload through the bolt-backed Store took 141 s,
through sqlite 503 s, almost all of it spent in those scans.
Three predicate-shaped Store methods that push the filter into the
backend:
EdgesByKind(kind EdgeKind) iter.Seq[*Edge]
NodesByKind(kind NodeKind) iter.Seq[*Node]
EdgesWithUnresolvedTarget() iter.Seq[*Edge]
Backend implementations:
- Memory (*Graph): iterate the existing AllEdges/AllNodes slice
and filter inline — same algorithmic cost as the pre-existing
hand-written loop, so in-memory callers see zero regression.
- bbolt (*store_bolt.Store): new secondary buckets
idx_edge_kind key=kind\x00edgeKeyBytes value=empty
idx_edge_unres key=edgeKeyBytes value=empty (sparse,
populated only for edges with the prefix)
plus reuse of the existing idx_node_kind for NodesByKind. Predicate
method = one prefix-scan over the relevant index bucket + decode of
only matching rows. putEdgeTx maintains both new indexes;
reindexEdgeTx / RemoveEdge / EvictFile/Repo clean them up.
- sqlite (*store_sqlite.Store): indexed SELECT against existing
(kind) and (to_id) indexes; the unresolved scan is a half-open
range query (to_id >= 'unresolved::' AND to_id < 'unresolved:;')
so SQLite uses the to_id b-tree to seek directly to the relevant
slice.
iter.Seq[T] (Go 1.23+) is the iterator shape so callers use
range-over-func; implementations honour early stop when yield
returns false.
storetest.RunConformance grows 3 subtests covering both happy-path
yields, empty-result cases, and early-stop semantics. All 36
conformance subtests pass across all 3 backends (108 tests total)
with -race.
Caller migration follows in the next commit so the API change and
the consumer change read separately in git history.
…methods
Replaces the per-pass `for _, e := range r.graph.AllEdges() { if cond { ... } }`
pattern across the resolver with calls to the predicate-shaped Store
methods landed in the previous commit. Disk backends now scan only
the matching rows instead of pulling the whole table back and
filtering in Go.
Sites migrated:
resolver.go::ResolveAll EdgesWithUnresolvedTarget
resolver.go::buildDirIndexes NodesByKind(KindFile)
resolver.go::buildDepModuleIndex NodesByKind(KindContract)
resolver.go::buildProvidesForIndex EdgesByKind(EdgeProvides)
resolver.go::buildReachabilityIndex NodesByKind(KindFile)
EdgesByKind(EdgeImports)
resolver.go::InferImplements (Ifaces) NodesByKind(KindInterface)
resolver.go::InferImplements (members) EdgesByKind(EdgeMemberOf)
resolver.go::InferOverrides EdgesByKind(EdgeMemberOf)
resolver.go (name-only fallback) NodesByKind(KindFile)
cross_repo.go::ResolveAll EdgesWithUnresolvedTarget
cross_repo.go::buildDirIndexes NodesByKind(KindFile)
cross_repo.go::buildDepModuleIndex NodesByKind(KindContract)
cross_repo.go::buildReachableReposIndex EdgesByKind(EdgeImports)
cross_repo.go (name-only fallback) NodesByKind(KindFile)
cross_pkg_guard.go (closure seed) NodesByKind(KindFile)
EdgesByKind(EdgeImports)
relative_imports.go EdgesByKind(EdgeImports)
grpc_stub_calls.go EdgesByKind(EdgeCalls)
temporal_calls.go (stub resolution) EdgesByKind(EdgeCalls)
temporal_calls.go (register index) EdgesByKind(EdgeCalls)
temporal_calls.go (Java annotation) EdgesByKind(EdgeAnnotated)
module_attribution.go (rewrites) EdgesByKind(EdgeImports)
module_attribution.go (file langs) NodesByKind(KindFile)
Expected impact (extrapolated from the 503-second sqlite resolver
pass that prompted the predicate-API design): 30+ full-table SELECTs
collapse to 30+ predicate-targeted scans whose row count is
proportional to the result set, not the table. For the cold-index
through bbolt and sqlite this is the single largest perf lever
remaining.
832 resolver / indexer / graph / storetest / store_bolt / store_sqlite
tests pass with -race. Behaviour-preserving — in-memory call sites
see the same nodes/edges in the same order they did before (the
predicate methods iterate the same backing buckets the pre-existing
filter loops walked).
Sites left on AllEdges/AllNodes: the indexer's clone detection,
search-index snapshot, contracts cache walk, and module linker —
these are genuinely "I need every node/edge" passes (TRULY_NEEDS_ALL
per the audit). The few BY_KIND_SET sites in the resolver
(external_calls.go, parentKinds walk in InferOverrides) still use
AllEdges + Go-side kind-set check — they could be addressed with a
future EdgesByKindIn variant if benchmarks demand it.
…) + sqlite deadlock fix
Two related pieces of work shipped together because they share the
sqlite store as their primary win surface.
## Batched lookup methods on Store
GetNodesByIDs(ids []string) map[string]*Node
FindNodesByNames(names []string) map[string][]*Node
The resolver fires ~3-10 per-edge GetNode / FindNodesByName calls
inside its worker fan-out. Across 10-30k pending edges that's
100k-300k individual queries. On the in-memory backend that's free
(map lookups); on sqlite each prepared-stmt Exec costs ~1ms through
modernc.org/sqlite's pure-Go executor, so 100k+ point lookups
translate to hundreds of seconds of wall time per resolver pass.
The batched siblings collapse those calls into one (or chunked) bulk
operation:
- memory: loop the existing per-id methods — no change in cost,
but provides the API surface.
- bbolt: one View transaction with multi-Get (nodes) or
multi-prefix-scan over idx_node_name (names). Connection
contention isn't a concern under bolt's MVCC reads.
- sqlite: chunked `SELECT … WHERE id IN (?,?,…)` /
`WHERE name IN (?,?,…)` queries (chunk size 5000 to stay well
under SQLITE_MAX_VARIABLE_NUMBER). 100k point lookups become
~20 chunked SELECTs.
Two new storetest conformance subtests cover the new methods: empty
input, missing entries, duplicates, presence checks. 114 conformance
subtests across all 3 backends pass with -race (up from 108).
## sqlite predicate-iterator deadlock fix
While benching the predicate API (commit 2a6b74a) I tripped a
single-connection deadlock: an EdgesByKind iterator holds the lone
sqlite connection through its rows-cursor, and any callback in the
yield body that re-enters the store (e.g. GetNode to resolve a
cross-package edge) blocks forever waiting on the same connection.
Fix: materialise the SELECT result into a slice inside the iterator
function and yield from the slice, releasing the connection BEFORE
the body runs. The "predicate-shaped" win is structural (row count,
not memory), so trading streaming memory for a deadlock-free
callback is unambiguously the right tradeoff. queryEdgesSQL /
queryNodesSQL helpers added so each predicate method stays a
single-statement implementation.
The bench's resolver pass on the SQLite-backed gortex graph dropped
from 347s (v3, with the deadlock-prone streaming impl avoided by
not actually entering callbacks) to 337s — small once we measured
end-to-end, but the alternative was "hangs forever on any backend
backed by a single-conn pool." The bigger win lands in the next
commit (resolver per-pass cache) plus the MaxOpenConns bump after
that.
The resolver's worker fan-out (resolveEdge across NumCPU goroutines)
calls store.GetNode for edge endpoints and store.FindNodesByName for
resolution candidates — ~3-10 calls per pending edge × 10-30k
pending edges = 100k+ point lookups per pass. On the in-memory
backend that's effectively free; on sqlite each prepared-stmt query
is ~1ms through modernc.org/sqlite's pure-Go executor, so the
worker phase wall is per-call cost × N.
Pre-warm a per-pass node-by-id / nodes-by-name cache before the
worker fan-out. ResolveAll now:
1. Collects every e.From id and every identifierFromTarget(e.To)
name across the pending slice.
2. Calls store.GetNodesByIDs(allIDs) + store.FindNodesByNames(
allNames) — two batched queries that hit dedicated indexes on
each backend.
3. Folds the candidate nodes returned by the name lookup back into
the id cache so downstream guard code that calls GetNode on a
candidate ID hits the cache too.
4. Stashes both maps on the Resolver struct, cleared via defer on
return so outside-pass callers degrade to direct store calls.
cachedGetNode / cachedFindNodesByName are positive-only fast paths
— a cache miss falls through to the underlying store. They've
replaced direct r.graph.GetNode / r.graph.FindNodesByName calls in
the worker hot path (resolveFunctionCall's candidate scan, the
EdgeReads→EdgeReferences promotion, cross_pkg_guard's
edgeCallerFile / target lookup).
Measured on the gortex-scale bench (122k nodes / 518k edges):
sqlite total: 399s → 384s (−4%)
bbolt total: 124s → 146s (parsing noise; cache wiring itself
is no-op on a backend whose direct
store calls were already µs)
The headline number is modest because the cache only covers the
worker phase. Subsequent serial post-passes inside ResolveAll
(resolveRelativeImports, attributeNonGoModuleImports) keep doing
per-edge work outside the cache. Those are a follow-up target if
sqlite needs to be pushed further; the connection-pool bump that
follows in the next commit pulled a much bigger win out of the
parallel phase that this commit now actually parallelises.
… actually parallelise
The agent-generated first cut of the SQLite store set
db.SetMaxOpenConns(1) "because SQLite is single-writer regardless"
and to dodge SQLITE_BUSY in the conformance Concurrency test. The
trade-off ate the resolver's parallel worker fan-out — every
goroutine doing GetNode / FindNodesByName / GetOutEdges queued
behind THE single connection, collapsing the worker phase to a
single CPU.
bbolt's read txns are concurrent under MVCC, so the same worker
fan-out actually parallelises and finishes its share in ~µs. SQLite
forced single-threaded execution at ms-per-call cost; the gap that
made sqlite ~3× slower than bbolt on the gortex bench was this,
not modernc.org/sqlite's per-statement overhead alone.
Fix: db.SetMaxOpenConns(runtime.NumCPU()). The DSN pragmas (WAL,
synchronous=NORMAL, busy_timeout=5000) are already on every new
connection — they're embedded in the DSN string, so the
"only-one-connection-saw-the-PRAGMA" justification the original
comment cited was already moot. WAL mode allows concurrent readers
across multiple connections by design.
Write contention is unaffected:
- writeMu (the Go-side mutex on Store) still serialises every
mutating method, so the conformance Concurrency test's 8
AddNode goroutines never collide at the SQLite level.
- SQLite's internal write lock + busy_timeout=5000 covers the
case where a write tries to land while a long-running read txn
holds the WAL.
Measured on the gortex bench (123k nodes / 514k edges):
sqlite total: 384s → 290s (-24%)
sqlite resolve: 337s → 243s (-28%)
The single biggest sqlite win on the entire branch. Conformance:
76 tests (including the 8-goroutine Concurrency test) pass under
-race. bbolt unchanged. In-memory unchanged.
Total trajectory across the predicate-API + batched-mutation +
batched-lookup-cache + this commit:
v2 baseline (per-edge tx, full-table scans): 503s
v3 (predicate API + batched mutations): 399s (-21%)
v4 (+ per-pass batched-lookup cache): 384s (-24%)
v5 (+ connection pool fix): 290s (-42%)
…-ID pre-load) Two follow-on optimisations targeting the serial post-pass phases inside ResolveAll. Both replace per-edge / per-candidate store lookups with pre-loaded maps, same pattern as the per-pass cache landed in 13b2c15 but for code paths the worker cache doesn't cover. ## attributeNonGoModuleImports The dup-check `hasDependsOnModule(fileID, moduleID)` called GetOutEdges per pending import rewrite — ~10-30k pending rewrites × one SQL SELECT each = tens of thousands of per-file queries on a disk-backed store. Replace with one EdgesByKind(EdgeDependsOnModule) scan that builds map[fileID][moduleID]struct{} upfront; the dup check becomes a constant-time map hit. Same module-seed materialise loop batches its presence check via GetNodesByIDs instead of per-seed GetNode. ## resolveRelativeImports resolvePythonRelativeImport / resolveDartRelativeImport each call GetNode on 1-2 candidate file IDs per import edge — for an import- heavy repo that's thousands of per-candidate queries on every pass. Replace the per-call store reads with a once-per-pass NodesByKind(KindFile) scan that fills a set of every file-node ID; the candidate-existence check is now a map lookup. The two resolver functions become closures over that set for the duration of the pass and degrade to the store-backed versions outside. ## Bench These changes did NOT measurably shift the gortex-scale numbers (sqlite total 290s → 292s = parsing noise; resolve 243s → 243s). The two post-passes weren't the dominant cost on this workload — the time is going somewhere else inside ResolveAll that I haven't yet pinpointed. Logging them as correct-but-not-dominant optimisations; the next round needs profiling, not speculation. 423 resolver / indexer / graph / storetest tests pass under -race. Behaviour-preserving on every backend.
…ph.Store Adds a third on-disk backend for the persistence layer, alongside bbolt (708be69) and SQLite (1e0bdaa). Cayley is a quad store with multiple query-language frontends (Gremlin / MQL / GraphQL); we use it specifically because it stays pure-Go, so the binary that the existing in-memory + bbolt + sqlite stack ships in keeps its CGO-free disk path. cayley v0.7.7; quad v1.3.0. ## Quad layout Each Node is stored under an IRI subject `node:<id>`. Each Edge under a composite IRI `edge:<from>|<to>|<kind>|<file>|<line>` — the composite makes the (From, To, Kind, FilePath, Line) identity tuple deduplicate naturally so AddEdge stays idempotent on same-line repeats while disambiguating different-line repeats. Every Node / Edge expands into one quad per non-zero field with predicate IRIs like `kind` / `name` / `startLine` / `from` / `to` / `confidence` / `origin` / `meta`. Numeric fields use `quad.Int` / `quad.Float` / `quad.Bool` so types survive round-trip; `meta map[string]any` is gob-encoded into a `quad.String` (bytes-safe). Two label discriminators (`kind:node`, `kind:edge`) let a single scan partition by entity type. ## Storage + concurrency cayley's KV-bolt backend (`cayley/graph/kv/bolt`) registered via blank import; `Open(path)` runs `graph.InitQuadStore("bolt", path, nil)` then `graph.NewQuadStore("bolt", path, nil)`. Mutations flow through `qs.ApplyDeltas` with `IgnoreOpts{IgnoreDup: true, IgnoreMissing: true}` so re-adds and stale removes never error. Batched mutations (AddBatch, ReindexEdges, SetEdgeProvenanceBatch) chunk by 5000. The store keeps the canonical bytes in cayley + rebuilds an in-memory mirror on Open for hot reads; every mutation updates both layers under the same `sync.RWMutex` write critical section so readers always see a consistent view. The mirror lets the predicate- shaped reads (EdgesByKind, NodesByKind, EdgesWithUnresolvedTarget, GetNodesByIDs, FindNodesByNames) run at in-memory speed without having to translate every Cayley path query. ## Race-detector caveat `go test -race` trips `fatal error: checkptr: converted pointer straddles multiple allocations` deep inside `github.com/boltdb/bolt@v1.3.1` — cayley v0.7.7 pins the legacy boltdb, which predates the move to `go.etcd.io/bbolt` that store_bolt uses. Not a bug in our code; documented in the package doc on store.go. Tests pass cleanly without -race (`go test -count=1 ./internal/ graph/store_cayley/...` — 38/38 subtests green) and with race when checkptr is muted (`-gcflags=all=-d=checkptr=0`). Conformance is identical to bbolt and SQLite — every behaviour the rest of gortex depends on from *graph.Graph is exercised and matches. ## Nothing waived All 37 conformance subtests pass: idempotency, line-disambiguation, EvictFile/Repo completeness, 8-goroutine Concurrency, batched mutations, predicate-iterator early-stop. No methods skipped, no weakened tests.
…h.Store
Adds a fourth on-disk backend — embedded property-graph database
with Cypher as its query language, the first non-relational disk
backend in the persistence layer. KuzuDB's columnar storage +
Cypher fit graph workloads natively in a way that bbolt's KV and
SQLite's relational shape don't try to. kuzu v0.11.3 via
`github.com/kuzudb/go-kuzu`.
## Schema
One `Node` table (PK `id`, columns mirroring graph.Node: `kind`,
`name`, `qual_name`, `file_path`, `start_line` / `end_line` INT64,
`language`, `repo_prefix`, `workspace_id`, `project_id`, `meta`)
and one `Edge` rel table (`FROM Node TO Node`, identity columns
`kind` / `file_path` / `line`, plus `confidence` DOUBLE,
`confidence_label`, `origin`, `tier`, `cross_repo` INT64, `meta`).
Two structural quirks from KuzuDB's data model dictate the
implementation:
1. KuzuDB rel tables can't carry their own primary key, so edge
dedup on the (from, to, kind, file_path, line) identity tuple
is enforced via `MERGE` rather than INSERT-or-replace.
2. The Go binding's BLOB column path has bugs (BLOB read goes
through `strlen()`, so NUL bytes in a gob-encoded payload
truncate; BLOB write coerces `[]byte` to `UINT8[]` rather than
BLOB). Workaround: gob-encode meta then base64-encode into a
STRING column. Documented inline; remove the base64 wrap when
the binding fixes its BLOB path.
## Endpoint stub behaviour
KuzuDB rel tables require both endpoints to exist in the node
table — but the in-memory store happily holds edges whose endpoints
are unresolved placeholders (the resolver creates these for
`unresolved::*` targets). The KuzuDB AddEdge therefore MERGE-stubs
the endpoints with empty columns before MERGEing the rel; later
AddNode calls overwrite the stub columns in place. Faithful match
to in-memory semantics for the only conformance-test path that
exercises this (`EdgesWithUnresolvedTarget`).
## Platform / CGO
CGO required. The Go binding ships `libkuzu.dylib` / `libkuzu.so` /
`libkuzu_shared.dll` inside the module's `lib/dynamic/<platform>/`
directory and points the linker + runtime loader at them via
LDFLAGS + `-Wl,-rpath`. No system-side install needed. Validated
on macOS arm64; the Linux + Windows binaries are bundled.
## Notes on batched writes
The Go binding doesn't expose an explicit transaction API, so the
batched mutators (AddBatch, ReindexEdges, SetEdgeProvenanceBatch)
loop their per-call mutators under one `writeMu` acquisition rather
than batching into a Cypher `UNWIND $rows AS row …` statement. The
conformance suite only verifies post-batch totals, and the indexer-
scale UNWIND fast path can be layered on without changing
semantics — flagged as the natural next perf win once cold-start
benchmarks expose where wins land.
## Conformance
All 37 RunConformance subtests pass under `-race`: idempotency,
line-disambiguation, EvictFile/Repo, 8-goroutine Concurrency,
batched mutations, predicate-iterator early-stop, MetaPreserved
(round-trips through the base64-wrapped gob blob). VerifyEdge-
Identities is a documented no-op — the rel table carries one
canonical row per edge, so the in-memory store's "same pointer in
both adjacency views" invariant has nothing structural to verify
(same justification bbolt + SQLite use).
Nothing waived. Nothing skipped. go vet clean. Wider tree builds
clean.
… of graph.Store
Adds a fifth on-disk backend — DuckDB is an embedded columnar OLAP
engine with mature SQL + a query planner that uses real indexes
properly. Round-trips the same conformance suite as the four
existing backends. CGO via `github.com/marcboeker/go-duckdb/v2`
v2.4.3.
The motivation versus the SQLite backend: DuckDB's columnar storage
+ native bulk-insert (Appender) API + indexed query planner give a
different performance profile than SQLite's row-oriented engine.
Analytical queries (counts, group-bys, scan-heavy aggregations)
push down better; bulk loads stream through the Appender at speeds
SQLite's prepared-INSERT path can't match. The cross-backend bench
will tell us where this lands relative to bbolt and SQLite.
## Schema
Two tables, indexed for the query shapes the resolver hits:
nodes(id VARCHAR PK, kind, name, qual_name, file_path,
start_line INTEGER, end_line INTEGER, language,
repo_prefix, workspace_id, project_id,
absolute_file_path, meta BLOB)
+ indexes on name, kind, file_path, repo_prefix, qual_name
edges(edge_id BIGINT PK, from_id, to_id, kind,
file_path, line INTEGER, confidence DOUBLE,
confidence_label, origin, tier,
cross_repo BOOLEAN, meta BLOB)
+ edges_by_from(from_id, kind), edges_by_to(to_id, kind),
UNIQUE(from_id, to_id, kind, file_path, line)
DuckDB doesn't have AUTOINCREMENT, so edge_id is allocated by an
atomic.Int64 seeded from `SELECT MAX(edge_id)` on Open.
## Bulk insert via Appender
`AddBatch` leases a raw `driver.Conn` via `db.Conn(ctx).Raw(...)`,
opens one `duckdb.NewAppenderFromConn` per table, streams rows
through `AppendRow`, and `Close()`s the appender (which auto-
flushes). DuckDB has no INSERT OR REPLACE / OR IGNORE, so the
implementation pre-deletes colliding logical keys inside a
transaction before the Appender writes — keeps the idempotency
contract intact.
This is the columnar fast path. Per-row prepared INSERT also works
(used by AddNode / AddEdge) but at indexer scale the Appender
shaves an order of magnitude off the load wall.
## Concurrency
`db.SetMaxOpenConns(runtime.NumCPU())` — DuckDB supports concurrent
readers natively, and writes serialize through the Store-level
`writeMu` so the 8-goroutine conformance Concurrency test passes
without races. ResolveMutex returns a dedicated `*sync.Mutex`.
## Prepared-statement bug worth knowing
duckdb-go-bindings v0.1.21 (vendored by go-duckdb v2.4.3) has a
prepared-statement bug where any GROUP BY / DISTINCT / aggregate
statement *prepared before rows exist* returns mangled (single-
character) string columns when later executed against populated
data. Reproduced with a minimal three-column repro.
Workaround: aggregate methods (Stats, RepoStats, RepoPrefixes,
RepoMemoryEstimate, AllRepoMemoryEstimates) run inline via
`s.db.Query(...)` instead of being pre-prepared. Point-lookup
statements (INSERT, DELETE, SELECT by id / name / kind / file /
repo) that aren't aggregates stay prepared — those work fine.
Documented inline on the Store struct.
## Conformance
All 37 RunConformance subtests pass under `-race`: idempotency,
line-disambiguation, EvictFile/Repo, 8-goroutine Concurrency,
batched mutations, predicate-iterator early-stop, MetaPreserved.
Nothing waived. go vet clean. Wider tree builds clean.
… filter Extends the cross-backend bench harness to drive all five disk backends through the real indexer pipeline: -only memory,bolt,sqlite,kuzu,cayley,duckdb (any subset) --skip-kuzu / --skip-cayley / --skip-duckdb (additive skips) dirSize() helper sums every regular file under a backend's data directory — kuzu and cayley both produce a directory of catalog + data + wal files rather than a single .db, so the reported disk size matches what an operator would see in their data dir. Same per-backend protocol as the existing three: fresh Open into a t.TempDir, idx.IndexCtx through the real pipeline, sample its own query workload from the populated state, report (load, disk, heap alloc + inuse, p50/p95). No shared reference graph across backends; heap is per-backend honest. go build clean. Smoke run memory + bolt completed (exit 0). The full 6-backend run lands in the next bench-output commit alongside the comparison + the per-backend perf findings.
The -only flag was only consulted for the three new (kuzu/cayley/ duckdb) backends — the original three (memory/bolt/sqlite) still checked their per-backend -skip-* flag, so `-only kuzu` would still run memory+bolt+sqlite first (8+ min on gortex). Hoisted the want-* resolution above all six backend blocks so the flag does what its name promises.
… path
DuckDB's Appender enforces UNIQUE on (from,to,kind,file,line) for
edges and on id for nodes. The pre-delete pass before the appender
write handles cross-batch duplicates, but the indexer's per-file
AddBatch slice can legitimately contain the same logical key
twice — e.g. a file declaring the same identifier (`buf`) in
multiple function scopes produces multiple Node entries with id
`<file>::buf`. The original implementation crashed mid-bench:
panic: duplicate key "bench/baselines/adapters.go::buf"
could not close appender: appended and not yet flushed data
has been invalidated due to error
Dedupe the input slice in-place before the Appender write —
last-write-wins, matching the per-row AddNode's `INSERT OR
REPLACE` semantics. The seen-map indexes positions in the
validated slice so we update in place when a duplicate id appears
later in the same batch.
Conformance: 38 subtests still pass under -race.
…eProvenanceBatch
The agent-generated first cut looped per-record MERGE through the
Go binding for every batched mutator. Each Cypher Execute through
go-kuzu costs ~5ms (parse + plan + execute + CGO round-trip), and
the indexer fires ~124k nodes + ~524k edges per cold gortex pass,
so the per-call shape hung the bench in parsing at >23 minutes
with no end in sight.
Three batched mutators now drive Cypher's UNWIND construct:
AddBatch
UNWIND $rows AS row
MERGE (n:Node {id: row.id})
SET n.kind = row.kind, n.name = row.name, ...
then for edges:
UNWIND $rows AS row
MERGE (a:Node {id: row.from})
MERGE (b:Node {id: row.to})
MERGE (a)-[e:Edge {kind, file_path, line}]->(b)
SET e.confidence, e.origin, e.tier, e.cross_repo, e.meta
ReindexEdges
phase 1: UNWIND $rows AS row MATCH … DELETE e (old keys)
phase 2: standard UNWIND-driven edge insert (new keys)
SetEdgeProvenanceBatch
UNWIND $rows AS row
MATCH (a:Node {id: row.from})-[e:Edge {kind, file_path, line}]->(b:Node {id: row.to})
WHERE e.origin <> row.origin
SET e.origin = row.origin, e.tier = row.tier
RETURN row.from, row.to, ...
The RETURN gives back exactly the rows that the WHERE filter
let through to the SET; we use that to update the caller's
*Edge pointer in-place (per-call SetEdgeProvenance contract)
and to count the actual changes for the identity-revision
counter bump.
Chunk size: kuzuBatchChunkSize = 5000 — same shape as the bbolt
and SQLite backends, picked to amortise parse+plan+execute cost
without ballooning the Cypher parameter list past what the binding
likes.
Conformance: 38 subtests (one per RunConformance subtest + the
parent) still pass under -race. Parse phase on a single-backend
kuzu smoke went 23+ min hang → 9.3 min. The remaining 9-min wall
is the resolver's per-call point-lookup hot path (cachedGetNode
falling through to kuzu's per-call MATCH for misses) — a future
follow-up matching the per-pass batched-lookup cache work that
landed for SQLite.
The cold-start indexer fires ~2000 small AddBatch calls during its
parse phase (one per source file, ~30 nodes / ~100 edges each). On
backends where every AddBatch round-trips through a query parser
(Kuzu / DuckDB / Cayley) that per-call cost dominates wall time —
the previous Kuzu+UNWIND smoke spent 9.3 minutes in parsing alone,
4.5 minutes for DuckDB Appender open/close churn, and 13+ minutes
for Cayley's per-quad mirror sync.
This commit lands the optional-interface seam that lets each
backend expose a native bulk-load fast path without changing the
per-call AddBatch contract every other caller sees:
type BulkLoader interface {
BeginBulkLoad()
FlushBulk() error
}
Backends that don't implement BulkLoader (in-memory *Graph, bbolt,
sqlite — all already optimal at the per-call path) continue to
serve AddBatch inline. Backends that do implement it (kuzu / duckdb
/ cayley in follow-up commits) buffer rows in memory during the
bracket and commit them through the engine's native primitive
(COPY FROM, long-lived Appender, batched ApplyDeltas with deferred
mirror rebuild) at FlushBulk time.
Indexer side wires the probe + bracket in IndexCtx:
- Type-asserts idx.graph against graph.BulkLoader.
- Guard NodeCount == 0 && EdgeCount == 0 — bulk-load is only
safe on an empty store (the contract documented on the
BulkLoader interface). Incremental / re-index paths fall
through to the per-call AddBatch path uniformly.
- BeginBulkLoad before the parse worker pool starts, FlushBulk
after wg.Wait() and before the resolver passes. Reads inside
the bracket are not supported by the contract; the resolver
runs strictly after FlushBulk so it sees the committed graph.
- FlushBulk gets its own `flushing bulk load` progress stage so
the bench can attribute the cost separately from parsing.
Implements graph.BulkLoader on the Kuzu backend. When the indexer
brackets its parse phase with BeginBulkLoad / FlushBulk:
AddBatch routes nodes/edges into in-memory buffers instead of
running its per-batch UNWIND-MERGE statement. The buffer lock is
held only across the slice append, so the indexer's parse workers
still fan out in parallel with minimal contention.
FlushBulk dedupes the buffers globally (last-write-wins on node
ID and on the edge identity tuple), auto-stubs edge endpoints not
present in the node buffer (the rel-table foreign-key constraint
requires both endpoints to exist; the per-call AddEdge handles
this with mergeStubNodeLocked, but COPY has no per-row hook), and
commits everything through one COPY Node + one COPY Edge —
bypassing Cypher parse + plan + MERGE cost on the hot path
entirely.
Wire format is tab-separated values, not RFC-4180 CSV. Kuzu's
COPY parser does NOT honour quoted strings containing the
delimiter — a quoted field with embedded commas is split naively.
TSV sidesteps the problem because tabs never appear in code
identifiers, qualified names, file paths, or base64-encoded meta
blobs; the sanitizeTSV helper exists purely as a safety net for a
malformed extractor output and replaces stray tabs/CR/LF with
spaces. File extension stays `.csv` because Kuzu's binder rejects
`.tsv` (`Cannot load from file type tsv`) — DELIM='\t' on the
COPY statement is what actually configures the parser.
Gortex-scale smoke (1978 files, 124k nodes, 524k edges):
parsing 1/1978 → 0.13s
flushing bulk load → 2.59s (parse buffer fill)
bulk flush complete → 5.12s (the COPY pass)
resolving references → 7.92s
Parse + flush total 5.12s, down from 9.3 minutes on the UNWIND
path (~110x speedup). Resolver is the new bottleneck — its
per-call point-lookup MATCHes are what dominates the remaining
wall, and is the subject of a follow-up Cypher-side resolver
delegation.
Conformance: 38 subtests still pass under -race.
Implements graph.BulkLoader on the DuckDB backend. The per-batch AddBatch path already used DuckDB's native Appender, but the indexer's per-file shape opened+closed ~2000 Appender pairs across the parse phase — each open/close pays a fresh transaction begin, the pre-DELETE pass for cross-batch idempotency, and the Appender flush. On the previous gortex smoke that loop took 4.5 minutes of parsing alone. When the indexer brackets its parse phase with BeginBulkLoad / FlushBulk: AddBatch routes nodes/edges into in-memory buffers instead of opening an Appender per call. Buffer lock held only across the slice append. FlushBulk dedupes the buffers globally (last-write-wins on node ID and edge identity tuple, mirroring the within-batch dedup AddBatch already does), then streams everything through one Appender per table — skipping the per-batch DELETE pre-pass entirely. BulkLoad's empty-store contract means no rows can collide; the global dedup means the appender's UNIQUE constraint never trips from within the buffer either. Conformance: 38 subtests still pass under -race.
…build
Implements graph.BulkLoader on the Cayley backend. The per-record
AddBatch path was the catastrophic case in the previous bench —
parsing took >13 minutes on gortex and was killed before the stage
ever turned over. Two costs dominated:
- Per-record applyDeltas: ~10 quad inserts × 130 records × 2000
files = 2.6M ApplyDeltas calls, each opening + committing one
bolt transaction.
- Per-record mirror sync: every addNodeLocked / addEdgeLocked
updated the 11 in-memory dedup / lookup indexes (nodesByName,
nodesByQual, nodesByFile, nodesByRepo, nodesByKind, outEdges,
inEdges, edgesByKind, allEdges, unresolvedES) row-by-row.
When the indexer brackets its parse phase with BeginBulkLoad /
FlushBulk:
AddBatch routes nodes/edges into in-memory buffers — no quads,
no mirror updates, no bolt transactions. Buffer lock held only
across the slice append.
FlushBulk dedupes the buffers, builds all deltas at once
(cayleyBulkApplyChunk = 20000 quads per ApplyDeltas), runs them
through the quad store in big chunks, then calls rebuildMirror()
exactly once — turning N small-txn + N small-mirror-syncs into a
small fixed number of large-txn + one mirror-scan.
Conformance: 38 subtests still pass without -race (the boltdb/bolt
dependency tied into Cayley triggers a pre-existing checkptr false
positive under -race that is not introduced by this change).
… at end Continues the BulkLoader work. The previous shape bracketed only the parse phase: AddBatch buffered, FlushBulk committed before the resolver ran, and the resolver then hammered the disk store with ~100k+ per-call point lookups. That collapsed parse from minutes to seconds but left resolve at ~11 min on DuckDB and ~9+ min on Kuzu / Cayley before the smokes were killed. The fix is structural rather than per-call. When the backing Store implements graph.BulkLoader AND the store is empty (the cold-start contract), the entire IndexCtx pipeline runs against an in-memory *Graph shadow. Parse fills the shadow at native AddBatch speed; the resolver and every post-resolve sub-pass (interface inference, test edges, clone detection, gRPC stubs, external-call synthesis) do their reads and writes against the shadow at nanosecond latency. A single defer at function entry, gated on the named return error, dumps the final shadow state to the disk backend via one BulkLoader cycle. Reads against the disk store during indexing return nothing — this is the documented BulkLoader contract. Bench is the only consumer of the disk store during this window and it reads only after IndexCtx returns. Incremental and re-index paths fall through to the per-call AddBatch path against the disk store directly because they don't start from an empty store. Gortex-scale results (1980+ files, ~125k nodes, ~515k edges): Backend | bulk-only-buffer | in-mem-shadow | speedup ---------|-----------------:|--------------:|-------: duckdb | 747s | 10.67s | 70x kuzu | >540s (k) | 6.64s | 80x+ cayley | >540s (k) | 104.65s | 5x+ DuckDB and Kuzu now outright beat bbolt's 135s on the same workload. Cayley's 100s sits almost entirely in the FlushBulk phase — Cayley's per-quad ApplyDeltas + mirror rebuild remain the write-side floor at this backend's wire format. Scope caveat: the shadow holds the full graph in RAM during indexing. Gortex / vscode / rate_checkers_detector all fit; Linux kernel and Firefox are larger than the in-memory store's existing limits (~8.6GB peak RSS on drivers/ alone per prior profiling) and would OOM. A memory-budgeted spillover or a NodeCount-threshold config switch is the obvious follow-up for those workloads.
… swap Bolt and sqlite both implement graph.BulkLoader as marker-only (empty BeginBulkLoad + nil-returning FlushBulk). Their AddBatch paths are already chunked-transaction and don't need a separate bulk fast path. What they were missing was the interface bit that lets the indexer's in-memory shadow swap activate for them — without the marker the swap probe took the per-call path against the disk store and burned minutes on per-mutator round-trips during the resolver pass. Gortex-scale rebench (1988 files, ~125k nodes, ~515k edges): Backend | before BulkLoader marker | after | speedup ---------|-------------------------:|------:|-------: bbolt | 130.47s | 25.96s| 5x sqlite | 283.04s | 16.05s| 18x Sqlite is now second-fastest disk backend behind Kuzu (5.38s) and ahead of DuckDB (14.81s). The shadow swap replaces ~2000 per-file AddBatch calls + ~100k+ per-call resolver lookups with one big AddBatch at the end and an in-memory resolver pass — exactly the shape both backends needed. Conformance: 38 subtests still pass on each, under -race.
The shadow swap is unconditionally bounded by available RAM. The in-memory *Graph at gortex's ~125k nodes / 515k edges sits around 600MB peak; at Linux drivers/ (~35k files, prior profiling captured 8.6GB peak RSS); at the full Linux kernel or Firefox (~60k+ source files, ~10M+ edges) the shadow's heap dwarfs the per-call cost it was meant to save and pushes the process toward OOM. The threshold guard refuses the swap above shadowMaxFileCount() — default 50,000 source files (the safe ceiling on a 32 GB dev machine), overridable via GORTEX_SHADOW_MAX_FILES. Above the threshold IndexCtx falls through to the per-call path against the disk store directly: slower per cold IndexCtx but bounded RAM. Below the threshold (covering gortex / vscode / rate_checkers and every public OSS repo we currently bench), the shadow path runs and delivers the 5-18x cold-start speedup. GORTEX_SHADOW_MAX_FILES=0 # force disk-only path always GORTEX_SHADOW_MAX_FILES=200000 # raise ceiling for big-RAM box GORTEX_SHADOW_MAX_FILES=<invalid> # fall back to default The probe also moved from "before file walk" to "after file walk" so the file count is available for the threshold check. The defer-based persist hook is unchanged.
Persist per-file mtimes to the ladybug store during indexing and read them back on startup, so a daemon that completed a warmup takes the reconcile path instead of re-walking every repo. Adds the FileMtime table + Load/BulkSet mtime methods, the snapshot plumbing, and contracts hydrated from the persisted graph.
- multi-repo FTS isolation (per-repo prefix wipe) - backend Cypher resolver default-on; unresolved-stub normalisation via graph.IsUnresolvedTarget/UnresolvedName across resolver/query/mcp - tier-0 in-memory name cache for SearchSymbols - WHERE-form PK reads (GetNode/GetOutEdges/file subgraph) to dodge the empty-result-under-concurrent-writers planner bug - get_file_summary resolves file members via GetFileNodes (file_path accelerator) instead of the never-persisted defines/contains edges - build ladybug unconditionally (drop the noladybug stub + build tag)
A pooled liblbug connection whose last statement errored (most often a COPY that hit a duplicated-primary-key exception during warmup) is left with corrupt internal transaction/mutex state. executeOrQuery used to return it to the pool; the next Prepare on that handle panicked with "mutex lock failed: Invalid argument", crashing the daemon on an unrelated goroutine. - connPool.discard closes the errored connection and opens a fresh replacement so the pool stays at size; executeOrQuery now discards (never returns) a connection whose op failed. - global panic firewall in wrapToolHandler: any tool handler panic is converted to a tool error instead of unwinding past the mcp-go loop and taking down the daemon and every MCP session.
The GetFileNodes-based file subgraph pulls in every node anchored to the file — including locals, params, closures, generic params, and builtins. get_file_summary's contract is "symbols a file defines", so broaden the post-fetch strip (stripNonDefinitionNodes) to drop those body-internal kinds alongside the file node and imports. Restores the top-level-definition view the old defines-edge query produced by construction.
Warm restarts (reopened, already-populated ladybug store) crashed in several distinct liblbug CGo faults and replayed the full cold-warmup cost on every start. Root-caused and fixed each: - Bulk COPY into an index-bearing table errored mid-COPY, poisoned the pooled connection, and crashed in lbug_connection_destroy. Drop the FTS / vector index before the DELETE+COPY in BulkUpsertSymbolFTS and BulkUpsertEmbeddings; the Build* paths recreate it afterward. - A re-track bulk-COPY'd over already-persisted node rows (duplicate PK SIGSEGV): the shadow-swap firstIndex sentinel is per-Indexer, so it is true on every restart. Evict the repo before the shadow COPY when the store already holds its rows. - EvictRepo only deleted nodes by the repo_prefix column, but edge-endpoint stubs in a repo's namespace (gortex/unresolved::X) are written by mergeStubNodeLocked with an empty repo_prefix. The evict missed them, so a re-track's INSERT-only COPY collided on the leftover stub, failed, and — the repo's real rows already evicted — dropped the whole repo from the graph. Also evict by id-prefix (<prefix>/). - The first per-edge write to a reopened store hangs forever in lbug_connection_prepare. Route repos that changed during downtime through the shadow/bulk re-track path (HasChangesSinceMtimes) instead of per-edge IncrementalReindex; gated to disk-backed stores so the in-memory backend keeps in-place eviction of offline-deleted files. - Reads racing a COPY faulted: writeMu is now an RWMutex (reads RLock, writes exclusive Lock), so no read runs during a write. Speed: skip the global resolution passes (RunDeferredPassesAll / RunGlobalResolve / graph-wide derivations) and per-repo search-index rebuilds when no file changed — the persisted graph already carries the resolved/derived edges, native FTS, and native HNSW vectors. No-change warm restart drops from 30-500s (+ crash) to ~6s. Also fix a FileMtime primary-key collision: file_id was the bare relative path, so repos sharing paths (src/parser.c, grammar.js across tree-sitter grammars) collided on MERGE and all but the last writer loaded zero mtimes, full-re-indexing (and crash-looping) every restart. Prefix file_id with the repo prefix; strip on load.
attributeGoExternalCalls built the KindModule id via
StubID(repo, StubKindModule, "go", importPath), which joins parts with
"::" and emitted module::go::<path>. The convention (and every
consumer — tools_analyze_external_calls + the attribution tests) is the
single-colon module::go:<path>, matching module::npm:<pkg>. Pass the
ecosystem+path as one segment ("go:"+importPath). Fixes 3 failing
TestAttributeGoExternalCalls tests (pre-existing since the per-repo
stub-prefix migration).
find_usages/get_callers missed EVERY method caller (s.Foo(), the dominant Go call shape). Parsers emit such calls as unresolved::*.<method> (golang.go:646); upgradeUnresolvedStubs leaves stub.name = "*.<method>" so the name-equality backend rules never match, and the Go-side resolver's EdgesWithUnresolvedTarget scan (literal 'unresolved::' prefix) never sees the repo-prefixed <repo>::unresolved::*.<method> form — so in multi-repo mode method callers were invisible. Add backend rule ResolveMethodCalls (in the ResolveAllBulk chain): bind a *.<method> stub to a concrete method node when EXACTLY ONE method in the caller's repo carries that name (segment after the last '.' of the qualified <Recv>.<method> Name). Uniqueness guard = no false edges; ambiguous names (String/Close/Get) stay unresolved for a future receiver-type-aware pass (edges carry a receiver_type meta hint). Validated against real Kuzu: unique binds, ambiguous stays, GetInEdges surfaces the caller.
Method nodes store the BARE method name in the `name` column
("querySelect"; receiver lives in meta.receiver / enclosing) — NOT the
qualified "Store.querySelect" form search_symbols displays. The first
cut matched `name ENDS WITH concat('.', mname)`, which a bare name
never satisfies (no leading dot) → 0 matches at scale (and the unit
test passed only because its fixture used qualified names, baking in
the wrong assumption).
Match `target.name = mname` (exact, indexed) after stripping `*.`.
Live-verified against the real store: resolved 15937 method-call edges,
Store.querySelect callers 4 -> 99, no false edges (ambiguous names like
Close stay unresolved). Test fixture corrected to bare names.
The loop did `if err != nil { return total, err }` — directly
contradicting its own docstring ("non-fatal... continues so a buggy
rule can't block the others"). One rule erroring on a large graph thus
silently skipped every rule after it (e.g. ResolveMethodCalls,
ResolveExternalCallStubs). Now it runs every rule and returns a
combined, rule-named error. The Store has no logger, so the failing
rule names ride on the returned error for the caller to surface
(the resolver.go call site still discards it — a separate latent trap
worth fixing: `_ = n` should log).
…disk backends reach.compute walked incoming edges one node at a time (GetInEdges + GetNode per node). On disk backends that is one Cypher query + cgo crossing per reachable node, turning a single AnalyzeImpact live walk into a multi-minute / timeout call. Batch each BFS level through GetInEdgesByNodeIDs + GetNodesByIDs so it costs one round-trip per depth instead of O(reachable-nodes). Output is unchanged (tiers still sorted by id). reach.Lookup also cached its result by mutating Node.Meta in place, which only persists on the in-memory backend (pointer identity); on disk backends GetNode returns a per-call reconstruction, so the cache was discarded after every query and recomputed forever. Round-trip the stamped node back through the store (AddNode in Lookup, batched AddBatch in BuildIndex), matching the releases/churn enrichers. The fast-path perf gate asserted a 1.3x speedup over the live walk; batching made the live walk fast too, so on the in-memory backend the gap collapses to ~1.0x (the precompute win now lands on disk backends). Updated the gate to keep the sub-ms absolute guarantee plus a fast-path-regression guard instead of the obsolete relative premise.
Both enrichers stamped node Meta in place — coverage_pct/coverage in coverage.EnrichGraph, last_authored in blame.EnrichGraph — but never wrote the symbol node back through the store (blame wrote back only the person KindTeam node, not the blamed symbol). On the in-memory backend that persists via pointer identity; on disk backends the stamp is discarded the moment AllNodes' slice goes out of scope, so analyze:coverage_gaps / ownership / stale_code and health_score's coverage + recency axes were silently empty even after a successful `gortex enrich coverage|blame`. Collect the stamped nodes and round-trip them via AddBatch, matching releases/churn which already do this. Verified on the ladybug backend: blame now persists last_authored on 597/597 nodes (was 0).
The semantic providers (goanalysis / scip / lsp) stamped semantic_type and return_type via EnrichNodeMeta, and ResolveTemporalCalls stamped temporal_role / temporal_name — all in place, with no write-back. On the in-memory backend that persists via pointer identity; on disk backends (Ladybug) the node is a per-call GetNode / AllNodes reconstruction, so the stamps were silently discarded, leaving type-aware features and temporal role queries empty on the default backend. These passes run at warmup / via RunGlobalGraphPasses, after the bulk-load buffer is flushed, so the in-place mutation is not captured by the bulk COPY either. Collect the stamped nodes per provider and AddBatch them; stampTemporalRole now takes the store and re-upserts each node. Same write-back idiom as reach / coverage / blame / releases / churn. Closes the last instances of the in-place-Meta-mutation bug class found by a backend-parity sweep.
golangci-lint (v2.11.4) was red on 9 issues: - enrich_churn.go: SA9003 empty branch — dropped the no-op os.Getwd() guard (and the now-unused os import); the comment it carried moves to the return. - githooks/install.go: QF1012 — fmt.Fprintf(&out, …) over WriteString(fmt.Sprintf(…)). - store_ladybug/file_index.go: removed unused remove() and reset() (removeFile/removeFiles remain, they are the live eviction path). - daemon.go / daemon_snapshot.go: removed the unused metadata-snapshot cluster — startPeriodicMetadataSnapshots, saveSnapshotMetadata, saveSnapshotMetadataTo, loadSnapshotMetadata, loadSnapshotMetadataFrom. It was a self-contained, never-called path superseded by the live warm-restart durability (graph -> store.lbug + FileMtimes -> FileMtime sidecar + reconcile janitor). `make lint` now reports 0 issues; go build ./... and the touched packages' tests pass.
…dows liblbug native libs are no longer committed — scripts/fetch-lbug.sh fetches them (pinned LBUG_VERSION=0.17.0) for make / CI / release: - linux + darwin: STATIC (liblbug.a linked in -> self-contained binary; libstdc++ forced static via -Wl,-Bstatic so the binary carries no runtime libstdc++.so dependency). - windows: DYNAMIC — lbug's windows build is MSVC and can't be static-linked from mingw; the .exe links lbug_shared.dll directly (-l:lbug_shared.dll) and ships the DLL + mingw and VC++ runtime alongside. cgo_shared.go now points at lib/static/<os>-<arch>/ (unix) and lib/dynamic/windows/ (windows). The committed darwin dylib and the old download_lbug.sh are removed; .gitignore ignores the fetched lib tree. CI: every job that builds cmd/gortex or runs go test ./... fetches liblbug first (ci.yml test/build-windows/build-onnx, init-smoke), so the link is validated natively on all three OSes. Release: .goreleaser.yml builds the unix targets only (static); a new native-windows job in release.yml builds the dynamic .exe, bundles the runtime DLLs (hard-failing if any is missing), zips, cosign-signs and appends to the release. Scoop manifest is a follow-up (windows is no longer a goreleaser artifact). Validated on darwin: static build is self-contained (no liblbug runtime dep) and the store_ladybug suite passes against the static lib. Linux and windows links are validated by CI on their native runners.
The windows release is now a zip containing gortex.exe + lbug_shared.dll + the mingw and VC++ runtime DLLs (gortex links liblbug dynamically on windows). install.ps1 moved only gortex.exe into the install dir, so the installed binary couldn't start (missing DLLs). It now installs the whole archive — exe + DLLs together — since windows resolves DLLs from the executable's own directory. The windows zip is built by the separate native-windows release job, so it isn't in goreleaser's checksums.txt and install.ps1 was silently skipping SHA-256 verification on windows. The windows job now appends the zip's sha256 to the release checksums.txt, restoring verification. install.sh (unix) is unchanged — static linking keeps the tar.gz a single self-contained binary.
Untrack the throwaway benchmark drivers and the lbug probe command.
The files stay on local disk (git rm --cached) and are now gitignored
so they neither nag in status nor get re-added. None were imported or
built by anything tracked.
Removed: bench/{all-tools-bench,daemon-bench,edge-diff,
ladybug-bundle-probe,multi-repo-bench,node-diff,store-bench,
unresolved-audit}, bench/run-linux{,-rest}.sh, cmd/lbug-probe.
237cce3 to
09af007
Compare
…tic builds CI (linux) failed: TestLadybugStoreConformance/SymbolBundleSearcher -> libfts.lbug_extension: undefined symbol _ZTIN4lbug7catalog12IndexAuxInfoE. liblbug loads its FTS (and other) extensions via dlopen at runtime; those extensions resolve liblbug's C++ symbols FROM THE HOST PROCESS. With a shared liblbug those symbols are globally visible, but static-linked they aren't in the binary's dynamic symbol table, so the extension can't find them. Add -rdynamic to the unix (static) cgo LDFLAGS — the portable driver flag (clang -> -export_dynamic, gcc -> --export-dynamic), on cgo's allowlist — to export them. Windows is dynamic, so unaffected. Verified on darwin: builds and the FTS conformance test passes. Linux is validated by CI.
store.go had grown to 2346 lines mixing lifecycle, writes, reads, stats, row decoding, query plumbing, the meta codec, and the bulk loader. Split it into same-package files along those seams (zero behavior change — pure decl moves, verified by the full test suite): store.go lifecycle/core (Store, Open, Close) 245 store_meta.go encode/decodeMeta store_write.go Add/upsert/Reindex/Evict/provenance store_read.go point + predicate + batched reads store_stats.go counts, Stats, memory estimates store_rows.go row<->struct decoders + projection cols store_query.go runWriteLocked/querySelect/executeOrQuery store_bulk.go BulkLoader (BeginBulkLoad/FlushBulk/COPY/TSV) ResolveUniqueNames moves to backend_resolver.go beside its kin. Interspersed consts (kuzuBatchChunkSize, perNodeByteEstimate, node/edgeReturnCols) and the BulkLoader/BackendResolver interface assertions travel with their consumers.
Adds a SchemaMeta(k,v) table and a version-gated migration mechanism so schema changes can ship without blowing away the warm cache. Open reads schema_version on the raw setup conn (before the pool exists) and applies ordered steps above the stored version: additive ALTERs (ALTER TABLE ... ADD IF NOT EXISTS ..., empirically confirmed against liblbug v0.13.1) preserve the cache; a step that ALTER cannot express (Meta-payload reshape, table restructure) sets a rebuild flag surfaced via NeedsRebuild() so the caller re-indexes. Forward-only — no down migrations; you never roll an embedded derived cache back, you rebuild. Deliberately NOT a golang-migrate/Flyway framework: the graph tables are a re-buildable cache, so this is the embedded-store user_version + switch pattern (~5 small funcs, no deps). The ladder is empty (currentSchema Version=1, the baseline); a pre-versioning DB is detected and stamped v1. Wiring NeedsRebuild() into the daemon warmup lands with the first rebuild-requiring step.
…rExpander Engine.bfs issued one edge fetch plus a GetNode per neighbour (twice when workspace-scoped) against the read-through ladybug store, and the edge fetch carried no LIMIT — so a high-degree hub dragged its entire adjacency across the cgo boundary. A graded smart_context fanning this over hub symbols hung for minutes and grew the heap into the tens of GB while holding the store, freezing concurrent reads (and daemon status). Add an optional graph.FrontierExpander capability implemented by the ladybug store: one Cypher per BFS level returns the frontier's edges of the requested kinds plus the neighbour node columns, meta-free, with a server-side LIMIT (frontierRowCap) and unresolved/external targets filtered in-query. Rewrite bfs to use it for directed walks (the in-memory backend and bidirectional/overlay walks keep the per-node path), cap allEdges by the node limit, drop the duplicate per-neighbour GetNode, and re-hydrate full-detail neighbours in one batched GetNodesByIDs. A 2000-fan-in hub now returns GetCallers as 64 nodes / 63 edges in ~16ms; a live multi-repo graded smart_context that previously hung at ~40 GB returns in seconds at a flat ~4.8 GB footprint. Covered by frontier_test.go (Cypher correctness) and frontier_scale_test.go (bounding). Also fix two pre-existing errcheck issues (unchecked Store.Close) in migrate_test.go.
…ine can't split the COPY row
writeSymbolVecTSV wrote it.NodeID raw into the tab-delimited file that 'COPY SymbolVec ... DELIM=\t' reads. A node id carrying a raw tab or newline — e.g. a ws:: WebSocket-contract node (fmt.Sprintf("ws::%s", event) over a raw regex submatch) or a string-literal-derived node — split the physical row, so the continuation line had a single field and the COPY aborted the whole batch with "expected 2 values per row, but got 1". The vector index for that batch was silently lost.
Route the id through sanitizeTSV (tab/CR/LF -> space), the same canonicalisation writeNodesTSV and copyBulkLocked already apply to the Node primary key, so SymbolVec.id stays byte-equal to the persisted Node.id and the SimilarTo join still matches. A lossless escape would be wrong here: it would round-trip the raw newline back into SymbolVec.id, breaking the join against the sanitized Node id. The Node/Edge bulk writers already sanitize every field; the vector writer was the lone gap.
vector_escape_test.go round-trips a tab+newline id through BulkUpsertEmbeddings -> BuildVectorIndex -> SimilarTo: it fails pre-fix with the COPY exception and passes after, retrievable under the sanitized id.
… can't hit the non-empty-PK rejection BulkUpsertEmbeddings cleared the table with 'MATCH (v:SymbolVec) DELETE v' then COPYed back. Kuzu COPY into a node table is only legal into an empty table or one that already carries a materialized PK hash index; DELETE empties rows logically but leaves the table non-empty for COPY, and whether the PK hash index is present at COPY time depends on uncontrolled auto-checkpoint timing. So the 2nd+ bulk upsert failed non-deterministically with 'COPY into a non-empty primary-key node table without a hash index is not supported'. It fires in production on any reindex / warm-restart reconcile that re-enters buildSearchIndex, not just tests. SymbolVec is uniquely exposed: it is the only PK table created lazily right before its first COPY (absent from the static schema DDL), so its PK index isn't checkpointed by warmup the way Node/Edge/SymbolFTS are. Drop the vector index first (DROP TABLE has no cascade and is rejected while the HNSW index references the table), then DROP TABLE IF EXISTS, reset s.vec.dim to 0 so ensureSymbolVecSchemaLocked recreates instead of short-circuiting on cur==dim, recreate the table, and COPY into the fresh empty table — an empty table is unconditionally a valid COPY target, so the racy state class is removed. Pool-safe: each statement borrows its own pooled connection, serialized by the writeMu write lock held across the call. Also drop the index before DROP TABLE in ensureSymbolVecSchemaLocked's dim-change branch (same latent index-reference hazard). vector_recopy_test.go loops the wipe-and-rewrite (bulk -> BuildVectorIndex -> bulk -> ...) in one store. Pre-fix the full -tags ladybug vector suite at -count=8 produced 16 failures; post-fix it is 48/48 (and 36/36 under -race). These vector tests are //go:build ladybug and not in the default 'make test' gate, which is why the flake went unnoticed. Note: BulkUpsertSymbolFTS shares the same DELETE-then-COPY hazard but its per-repo clear in multi-repo mode means DROP TABLE is unsafe there (would wipe sibling repos); that path needs a separate remedy and is left for a follow-up.
4d42c2b to
28e65a9
Compare
Wires store_ladybug.NeedsRebuild() into the daemon warm-restart loop. When a schema migration crosses a rung ALTER cannot satisfy (a Meta-payload reshape), the on-disk rows are in the old shape and an incremental reconcile would trust stale data. The warmup loop now drops prior FileMtimes for such a backend so every repo takes the full TrackRepoCtx path (and is marked changed, so the global resolve/derivation passes re-run too) — mirroring the existing snapshotPartial override. Uses an optional-interface check (storeNeedsRebuild), so non-implementing backends (in-memory) are unaffected; a compile-time assertion in backend_ladybug.go keeps the concrete store and the check in sync. Strict no-op today: the ladder is empty, so NeedsRebuild() is always false. A note in migrate.go flags the crash-mid-rebuild/version-stamp consideration for whoever ships the first rebuild migration.
… a non-empty per-repo COPY can't be rejected
BulkUpsertSymbolFTS cleared the corpus then COPYed it back. In multi-repo mode the clear is per-repo (MATCH (f) WHERE f.id STARTS WITH $p DELETE f) and intentionally keeps sibling repos' rows, so SymbolFTS is non-empty by design. Kuzu COPY into a node table is only legal when the table is empty or already carries a materialized PK hash index, whose presence depends on auto-checkpoint timing, so the COPY failed non-deterministically with 'COPY into a non-empty primary-key node table without a hash index is not supported'. This is the same class as the SymbolVec re-COPY bug and fires on multi-repo reindex / warm-restart reconcile, but DROP TABLE + recreate (the SymbolVec remedy) is unsafe here — it would wipe the sibling repos.
Replace the COPY with a single 'LOAD FROM <csv> (header=false, delim=tab) MERGE (f:SymbolFTS {id: column0}) SET f.tokens = column1'. LOAD FROM scans the file as a row source and MERGEs straight into SymbolFTS — a DML write with no empty-table precondition — in one statement, no staging table. Measured on a 20k-row corpus (liblbug 0.17.0): direct COPY into empty 74ms; staging COPY-into-temp + MERGE 193ms; LOAD FROM + MERGE 91ms. So it is ~2x faster than staging and within ~23% of a raw COPY while removing the rejection entirely. (CHECKPOINT before COPY was tried and made it deterministically worse, 8/8 fail.)
fts_recopy_test.go drives the per-repo non-empty re-bulk repeatedly (pre-fix the full -tags ladybug run at -count=4 failed 3/4; deterministic after). fts_timing_test.go is the 3-way COPY/staging/LOAD-FROM perf comparison. Both are //go:build ladybug and excluded from the default 'make test' gate.
…File doesn't race Rule.matcher was lazily compiled and cached in matchPattern with no synchronisation. applyCoverageDomains matches files across goroutines against one shared []Rule (MatchFile takes &rules[i]), so concurrent first calls raced on r.matcher and on the half-published GitIgnore — go test -race flagged it at parser.go:35/36. Parse now precompiles rule.matcher in its single goroutine; matchPattern is read-only (returns the cached matcher, or a throwaway compile for a Rule built outside Parse) and never writes the field, so the concurrent MatchFile hot path only reads. No lock is added — Rule is a value type copied by append, which would trip copylocks. Cost is negligible: a CODEOWNERS file is small and compiled once per file, not per source file. parser_race_test.go drives 64 goroutines x MatchFile over a shared rule list; pre-fix it tripped the race detector, clean after.
…tic builds so the dlopen'd FTS extension resolves The prior -rdynamic fix exports symbols into the dynamic symbol table, but -rdynamic cannot export a symbol that was never linked in. liblbug's dlopen'd FTS (and other) extensions resolve liblbug's C++ RTTI (typeinfo/vtable for e.g. lbug::catalog::IndexAuxInfo) FROM THE HOST PROCESS; those are weak COMDAT objects in liblbug.a that gortex's plain-C API never references, so demand-driven archive selection drops them and -rdynamic has nothing to export. On linux, wrap -llbug in -Wl,--whole-archive / -Wl,--no-whole-archive so every liblbug object (and thus every weak typeinfo/vtable) is linked into the binary, exactly as a shared liblbug would expose them; -rdynamic then puts them in the dynamic symbol table for the extension to bind at load. darwin needs none of this — ld64 pulls the typeinfo objects in on its own, so -rdynamic alone suffices. The matching CGO_LDFLAGS_ALLOW='-Wl,--(no-)?whole-archive' (cgo doesn't allowlist --whole-archive) is already wired into the Makefile / ci.yml / init-smoke.yml / goreleaser build paths.
…go packages cgo_shared.go now passes -Wl,--whole-archive in its #cgo LDFLAGS (forces liblbug's weak C++ RTTI into static builds so the dlopen'd FTS extension resolves). That flag is not on cgo's #cgo LDFLAGS allowlist, so govulncheck — which loads and compiles the cgo packages through the Go toolchain — failed with 'invalid flag in #cgo LDFLAGS: -Wl,--whole-archive'. ci.yml, init-smoke.yml and goreleaser already export CGO_LDFLAGS_ALLOW; the security workflow did not. Set CGO_LDFLAGS_ALLOW at the workflow level, the same value ci.yml uses. Checked the other workflows: release.yml builds via the goreleaser-cross container driven by .goreleaser.yml (which carries its own env), and bench-arm.yml benches no liblbug-importing package, so neither needs the flag.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Benchmark of various options for the persistence layer for non-in-memory-only launch in case of huge repositories and local (non-remote server) setup.
Reslts
gortex scale (~2000 files, ~125k nodes, ~520k edges):
Declined due to performance issues
On the Linux scale,
cozoconfirmed the issue with extremely slow queries, so thecozooption declined too.Kuzu - public archive - declined too