Skip to content

Latest commit

 

History

History
104 lines (80 loc) · 5.89 KB

File metadata and controls

104 lines (80 loc) · 5.89 KB

TODO

Project

  • Move project to a dedicated organization
  • Create website
    • build something like hardwood.dev but for vortex files

Performance

  • Benchmark publishing — drop CI workflow, add bench-publish script; see ADR-0006.
  • Performance tests must be peer-reviewed
  • Run performance tests on other machines (I have access only to Apple M5)
  • Vector API adoption — deferred; see ADR-0005 for adoption criteria and candidate loops.

Security

Contract: the reader memory-maps and parses untrusted binary input. Every malformed input must throw VortexException, never ArrayIndexOutOfBoundsException, NegativeArraySizeException, OutOfMemoryError, StackOverflowError, a raw FlatBuffer runtime exception, or a Protobuf parser exception. Each entry below is either a known gap, a contract audit, or supporting infra.

Per-encoding adversarial tests

Each encoding's decode(DecodeContext) should be exercised against:

  • bufferIndices[i] >= ctx.bufferCount() → centralize check in DecodeContext.buffer(i).
  • Crafted metadata that decodes but disagrees with the buffer payload.

Per-encoding gotchas:

  • VarBin: offsets non-monotonic, negative, past data-buffer length.
  • Dict: codes[i] >= values.length; codes ptype declared u8 but values count > 256.
  • Bitpacked: bit_width < 0 || > 64; packed_len < n * bit_width / 8.
  • ALP: dim < 0, f_or_d byte out of enum range; exceptions_count > n.
  • Sparse: indices non-sorted or indices[i] >= length; values count mismatches indices count.
  • Chunked: zero children with non-zero row_count; child layout self-referencing (already protected by depth limit, but add explicit test).
  • Struct: fieldNames.size() != children.size(); field name UTF-8 invalid.
  • RLE / RunEnd: run_ends non-monotonic; last run_endrow_count.
  • Constant: protobuf scalar value missing or type-mismatched against declared DType.
  • Zoned: zone-map min > max; zone count ≠ child chunk count.
  • Pco: bits_per_offset > 64; bin_count == 0 with non-empty page; per-page n greater than DEFAULT_MAX_PAGE_N; ANS state values inconsistent with weight table.

Resource caps

  • Implement ResourceLimits + ReadOptions — see ADR-0004 for design, defaults, and enforcement points. Also covers Pco page/bin caps.

Fuzz infrastructure

  • Jazzer + JUnit 5 — add com.code-intelligence:jazzer-junit test dep. Two modes: regression (./mvnw test, replays saved corpus + crashes) and fuzz (JAZZER_FUZZ=1, nightly profile). See research notes in branch worktree-security-fuzz commit history.
  • Seed corpus from integration fixtures — drop existing .vortex test files into reader/src/test/resources/fuzz-corpus/full-file/. Per-encoding sub-corpora extracted via a small tool that walks fixtures and dumps each segment to core/src/test/resources/fuzz-corpus/<encoding>/.
  • Fuzz targets: VortexReader.open(byte[]), PostscriptParser.parseBlobs, and one @FuzzTest per encoding Encoding.decode. Crash oracle: ignore = {VortexException.class}.
  • Differential fuzz (Java vs Rust) — round-trip random bytes through Java decode and vortex-jni; assert both throw or both return identical row count + values. Reuse RustWritesJavaReadsIntegrationTest harness.
  • OSS-Fuzz submission — Jazzer is a first-class OSS-Fuzz engine; submit the project once the corpus + targets stabilize. Free continuous fuzzing.

Build

  • use JPMS, watch out for "dfa1" in package name

Tooling

  • Optional vortex-arrow bridge module for Arrow ecosystem interop — see ADR-0016

API

  • Error messages — structural sanitization of VortexException — Phase E (bounds typing via IoBounds) shipped; remaining is Phases A–D (the Sanitize helper + VortexError catalog). See ADR-0003 for design and phasing.
  • Use domain primitives (UInt32, UInt64, etc.) as value classes via Project Valhalla instead of raw long/int

Compute

  • Compute primitives — encoded-domain specialization & façade — the remaining ADR-0013 follow-ups now the fused kernels have shipped. See ADR-0013. Done: §4 Predicate; §5 RowFilter unified over Predicate; §6 zone-map aggregate push-down in both tiers — the whole-zone ZoneReducer fold wired into VortexAggregatePushDownRule (rewrites a whole-table MIN/MAX/COUNT/SUM/AVG to a single-row Values, auto-registered over a bare jdbc:calcite: connection), plus the boundary/residual tier so a SUM/MIN/MAX/COUNT with a WHERE folds fully-selected zones from stats and decodes only the straddling boundary chunks via the fused Compute.filteredAggregate. §1–§3 (Mask, the FilterKernel/MapKernel/ReduceKernel interfaces) were built then removed for the fused single-pass kernels — no intermediate bitmap. Next: (1) benchmark the fused-kernel speedup (ComputeKernelBenchmark, 100M rows) to record §6's payoff; (2) encoded-domain kernel specialization — push the predicate into the ALP/FoR/Dict integer domain without decoding (its own perf ADR); (3) the columnar transducer façade (ADR 0019).

Encodings

See docs/compatibility.md for the full encoding support table and S3 fixture status.