diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 000000000..cb6004c7f --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,64 @@ +When performing a code review, check that: + +## SemVer and API Compatibility + + The workspace obeys SemVer. Removing or changing public API signatures (functions, types, re-exports) requires a version bump, suggested migration impact in the PR description. + +## Dependency and Build Hygiene +- Check for changes that introduce new dependencies or increase build times. If a dependency is added, it must be justified in the PR description. + + +## Error Handling + +- Do not introduce `panic!` paths for recoverable errors — propagate with `Result` instead. +- Keep error types small. Avoid large enums/structs that blow up the stack; look for ways to reduce field sizes (e.g., compute derivable fields, use enums instead of `&'static str`). +- Prefer `ANNError::new(ANNErrorKind::…, e)` over the old `log_*`-style constructors, which force eager string formatting and double-log errors. +- When using `thiserror`, rely on `#[from]` for automatic `Error::source` chaining — do not format the inner error in the `#[error("…")]` display string. +- Include relevant context values (e.g., the kind, key, or dimension) in error messages for debuggability. + +## Documentation + +- Doc comments and README examples must match actual API signatures and serialized shapes. +- Stale examples that fail to compile or deserialize are treated as bugs. +- Do not leave dead references to APIs that no longer exist. +- When changing a function signature or removing a parameter, update all doc comments that mention the old signature. + +## Constants and Assumptions + +- Do not hardcode magic values — make them configurable with sensible defaults and document the rationale. +- Add assertions for invariants that callers or maintainers would otherwise have to discover by reading the implementation. + +## SIMD and Platform Portability + +- Do not assume specific SIMD lane widths (e.g., `f32s::LANES == 8`). Code must be correct on AVX2, AVX-512, and ARM/NEON. +- When touching architecture-specific intrinsics, verify cross-platform behavior per `diskann-wide/README.md`. + +## Testing +- Do not drop existing unit tests without a strong reason. +- Keep test helpers close to the code they exercise, typically in a `mod tests` at the bottom of the file or in an adjacent test module, guarded with `#[cfg(test)]`. +- Do not add tests for derived traits (`Clone`, `Debug`, `PartialEq`) or enums unless they have explicit behavior beyond the derive. +- Test edge cases like empty inputs (e.g., empty iterators) to lock in defined behavior and prevent divide-by-zero or NaN results. + +## Rayon and Parallelism + +- Never use the global Rayon thread pool. Always execute parallel work within the provided `RayonThreadPool` or `RayonThreadPoolRef`. +- Preserve deadlock-avoidance intent when modifying nested parallel loops. +- Be aware that combining blocking synchronization (e.g., mutex acquisition) with Rayon work-stealing can cause deadlocks. + +## Unsafe Code and Safety + +- Every `unsafe` block must have a `// SAFETY:` comment directly above it explaining why the operation is sound. +- Safety comments must be specific and verifiable — state the concrete precondition that makes the operation safe (e.g., `// SAFETY: i + width <= len ensures this read is in-bounds`). Do not use vague justifications like `// SAFETY: this is safe`. +- Safety contracts on `unsafe fn` signatures must be internally consistent — if the documented precondition says `scratch.len() >= n`, ensure the implementation does not write beyond `n` elements (e.g., due to rounding up to a panel/block size). +- Prefer safe abstractions over raw `unsafe` when possible. Use `unsafe` only when there is a measurable performance benefit or when interfacing with FFI/intrinsics. +- For pointer arithmetic, prefer `offset_from` to express bounds rather than `wrapping_add` unless wrapping behavior is intentionally needed — and document why. +- When calling SIMD intrinsics or FFI, list the specific preconditions being satisfied (alignment, length, non-null, valid initialization). + +## License Headers +- Each file has a license header. +``` +/* + * Copyright (c) Microsoft Corporation. + * Licensed under the MIT license. + */ +``` \ No newline at end of file diff --git a/.github/instructions.md b/.github/instructions.md deleted file mode 100644 index 784882c81..000000000 --- a/.github/instructions.md +++ /dev/null @@ -1,11 +0,0 @@ -When performing a code review, check that: -- No unit tests were eliminated without a strong reason. -- Additional dependencies introduced have a strong justification. -- Changes are not likely to increase build times. -- Each file has a license header. -``` -/* - * Copyright (c) Microsoft Corporation. - * Licensed under the MIT license. - */ -``` diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..b916bc394 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,182 @@ +# DiskANN Repository - Agent Onboarding Guide + +**Last Updated**: 2026-05-04 (based on v0.50.1, Rust 1.92) + + +## Crate Organization + +**Tier 1: Foundation** +- `diskann-wide/` - Low-level SIMD, bit manipulation, type width abstractions +- `diskann-vector/` - Vector primitives and operations +- `diskann-platform/` - Platform-specific utilities + +**Tier 2: Core Libraries** +- `diskann-linalg/` - Linear algebra operations +- `diskann-utils/` - Shared utilities (Reborrow, MatrixView traits) +- `diskann-quantization/` - Vector quantization + +**Tier 3: Algorithm & Proivders** +- `diskann/` - Core indexing logic +- `diskann-providers/` - Hodge-podge of stuff, will be dismantled +- `diskann-disk/` - Disk-based provider for the index with io_uring support +- `diskann-label-filter/` - Inverted index for filtered search +- `diskann-garnet/` - Garnet (Redis-compatible) Provider and FFI endpoints for vector sets + +**Tier 4: Infrastructure & Tools** +- `diskann-benchmark-runner/` - Test runner infrastructure +- `diskann-benchmark-core/` - Benchmark framework +- `diskann-benchmark-simd/` - SIMD-specific benchmarks +- `diskann-benchmark/` - Benchmark definitions and runners +- `diskann-tools/` - CLI utilities (autotuner, etc.) +- `vectorset/` - Garnet client for benchmarking vector set workloads + +--- + +### Internal Dependencies + + +- Tier 1 and Tier 2 crates may be added as dependencies of any internal crate +- `diskann` may be added as a dependency of any equal or higher tier internal crate except those below +- Do not add Tier 3 crates as dependencies of these Tier 4 crates: + - `diskann-benchmark-runner` + - `diskann-benchmark-core` (`diskann` is allowed) + - `diskann-benchmark-simd` + +--- + +## Boundaries + +### Never + +- Modify files in `diskann/tests/generated/` by hand — these are auto-generated baselines. Regenerate with `DISKANN_TEST=overwrite`. +- Modify `rust-toolchain.toml`, `.github/workflows/`, or `.codecov.yml` without explicit approval. +- Use the global Rayon thread pool — use `RayonThreadPool`/`RayonThreadPoolRef` (enforced by `clippy.toml` disallowed methods). +- Use `rand::thread_rng` — use the project's `random.rs` utilities instead (enforced by `clippy.toml`). +- Use `vfs::PhysicalFS::new` or `VirtualStorageProvider::new_physical()` in tests — use `VirtualStorageProvider::new_overlay()`. +- Remove or weaken existing tests without a strong, documented reason. +- Commit secrets, credentials, or API keys. + +### Ask First + +- Adding new workspace dependencies to `Cargo.toml` — justify the addition. +- Changing public API signatures in any `diskann-*` crate — requires SemVer analysis (may need a major version bump). +- Modifying tier dependency rules (e.g., adding a Tier 3 dependency to a Tier 4 benchmark crate). +- Changing `clippy.toml` or `rustfmt.toml` lint/formatting configuration. + +### Always + +- Include the MIT license header in every new source file. +- Run `cargo fmt --all` and `cargo clippy --workspace --all-targets -- -Dwarnings` before committing. +- Update doc comments when changing function signatures or removing parameters. +- Add a `// SAFETY:` comment above every `unsafe` block with specific, verifiable preconditions. + +--- + + + + +## Error Handling + +There are three regimes of error handling and the strategy to use depends on the regime. + +### Low-Level + +Low-level crates should use bespoke, precise, non-allocating error types. +Use `thiserror` for boilerplate. Chain with `std::error::Error::source`. +`diskann::ANNError` is not a suitable low-level error type. + +```rust +// ✅ Good — bespoke error type with thiserror, uses #[from] for source chaining +#[derive(Debug, thiserror::Error)] +pub enum SgemmError { + #[error("dimension overflow: {expected_rows}×{expected_cols} exceeds usize")] + DimensionOverflow { expected_rows: usize, expected_cols: usize }, + + #[error(transparent)] + Allocation(#[from] AllocatorError), +} + +// ❌ Bad — single crate-level enum, formats inner error in display string +#[derive(Debug, thiserror::Error)] +pub enum MyLibError { + #[error("sgemm failed: {0}")] // Don't format the inner error here + Sgemm(#[source] SgemmError), + #[error("io failed: {0}")] + Io(#[from] std::io::Error), + // ... 20 more variants — too broad +} +``` + +### Mid-Level (diskann algorithms) + +Use `diskann::ANNError` and its context machinery. This type: + +- Is 16 bytes with niche optimization for `Option`, allowing return in registers. +- Records stack trace of its first construction under `RUST_BACKTRACE=1`. +- Precisely records file and line of creation via `#[track_caller]`. +- Has context layering machinery to add additional information as an error is passed up the stack. +- Is backed internally by `anyhow::Error`. + +When converting to `ANNError`, use `#[track_caller]` for better source reporting. +Prefer `ANNError::new(ANNErrorKind::…, e)` over the old `log_*`-style constructors, which force eager string formatting and double-log errors. + +```rust +// ✅ Good — deferred formatting, precise kind, track_caller +#[track_caller] +fn process_vectors(data: &[f32]) -> Result<(), ANNError> { + validate(data).map_err(|e| ANNError::new(ANNErrorKind::IndexError, e))?; + Ok(()) +} + +// ❌ Bad — eager string formatting, double-logs on creation +fn process_vectors(data: &[f32]) -> Result<(), ANNError> { + validate(data).map_err(|e| ANNError::log_index_error(format!("validation failed: {e}")))?; + Ok(()) +} +``` + +Traits with associated error types should consider constraining with `diskann::error::ToRanked` instead of `Into` if non-critical errors should be supported. +`ANNError` is the mid-level propagated error type; use `ToRanked` and `RankedError` to distinguish transient/recoverable failures from fatal ones. + +### High Level (tooling) + +At this level `anyhow::Error` is appropriate for binaries and CLI entry points. + Note that some tooling helpers still use `ANNError` for compatibility. + +### Do Not + +Do not use a single crate-level error enum. Problems: + +- Provides no documentation on how an individual function could fail +- Encourages **worse** error messages than bespoke types +- Generates large structs that blow up the stack +- Branch-heavy `Drop` implementations which bloat code + + +## Document unsafe usage + +```rust +// ✅ Good — specific, verifiable precondition +// SAFETY: `i + width <= len` ensures this read is in-bounds. +let val = unsafe { ptr.add(i).read() }; + +// ✅ Good — FFI with listed preconditions +// SAFETY: `a` and `b` are non-null, properly aligned, and do not alias. +// `m`, `n`, `k` match the actual matrix dimensions. +unsafe { ffi::sgemm(m, n, k, a.as_ptr(), b.as_ptr(), c.as_mut_ptr()) }; + +// ❌ Bad — vague, unverifiable +// SAFETY: this is safe +let val = unsafe { ptr.add(i).read() }; +``` + + +## CI Pipeline + +CI workflow is defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml). Key jobs include: +- Format and clippy checks +- Tests on multiple platforms (Linux, Windows) +- [Code coverage](.codecov.yml) +- Architecture compatibility (SDE) + +**Note**: CI uses `cargo-nextest` for running tests. See [`.cargo/nextest.toml`](.cargo/nextest.toml) for test configuration (timeouts, retries, etc.). diff --git a/agents.md b/agents.md deleted file mode 100644 index 0feeca54d..000000000 --- a/agents.md +++ /dev/null @@ -1,233 +0,0 @@ -# DiskANN Repository - Agent Onboarding Guide - -**Last Updated**: 2026-02-11 (based on v0.45.0, Rust 1.92) - -This guide helps coding agents understand how to work efficiently with the DiskANN repository. - ---- - -## Table of Contents - -1. [Repository Overview](#repository-overview) -2. [Repository Structure](#repository-structure) -3. [Testing](#testing) -4. [Code Quality & Linting](#code-quality--linting) - ---- - -## Repository Overview - -**DiskANN** is a Rust implementation of scalable approximate nearest neighbor (ANN) search algorithms. The project is a rewrite from C++ to Rust. - -- **Language**: Rust (Edition 2021), toolchain version in [`rust-toolchain.toml`](rust-toolchain.toml) -- **License**: MIT (see [`LICENSE.txt`](LICENSE.txt)) -- **Version**: See [`Cargo.toml`](Cargo.toml) -- **Architecture**: Cargo workspace with 15+ crates -- **Legacy Code**: Older C++ code is on the `cpp_main` branch (not maintained) - -### Key Resources -- **Contributing**: See [`CONTRIBUTING.md`](CONTRIBUTING.md) (requires CLA) -- **Code of Conduct**: See [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md) - ---- - -## Repository Structure - -The repository uses a Cargo workspace with crates organized into functional tiers. See [`Cargo.toml`](Cargo.toml) for: -- Workspace members and their dependencies -- Shared dependency versions -- Build profiles (release, ci) -- Workspace-level lints - -### Crate Organization - -**Tier 1: Foundation** -- `diskann-wide/` - Low-level SIMD, bit manipulation, type width abstractions -- `diskann-vector/` - Vector primitives and operations -- `diskann-platform/` - Platform-specific utilities - -**Tier 2: Core Libraries** -- `diskann-linalg/` - Linear algebra operations -- `diskann-utils/` - Shared utilities (Reborrow, MatrixView traits) -- `diskann-quantization/` - Vector quantization (PQ, SQ) - -**Tier 3: Algorithm & Storage** -- `diskann/` - Core ANN graph algorithm and in-memory indexing (CENTRAL crate) -- `diskann-providers/` - Storage abstraction layer -- `diskann-disk/` - Disk-based indexing with io_uring support -- `diskann-label-filter/` - Inverted index for filtered search - -**Tier 4: Infrastructure & Tools** -- `diskann-benchmark-runner/` - Test runner infrastructure -- `diskann-benchmark-core/` - Benchmark framework -- `diskann-benchmark-simd/` - SIMD-specific benchmarks -- `diskann-benchmark/` - Benchmark definitions and runners -- `diskann-tools/` - CLI utilities (autotuner, etc.) - ---- - -## Dependencies - -### Internal - -- Tier 1 and Tier 2 crates may be added as dependencies of any internal crate -- `diskann` may be added as a dependency of any equal or higher tier internal crate except those below -- Do not add Tier 3 crates as dependencies of these Tier 4 crates: - - `diskann-benchmark-runner` - - `diskann-benchmark-core` (`diskann` is allowed) - - `diskann-benchmark-simd` - ---- - -## Testing - -### Test Execution - -```bash -# Run all tests -cargo test - -# Run tests for specific crate -cargo test -p diskann - -# Run specific test -cargo test -p diskann -- --exact test_name - -# Run with CI profile (faster) -cargo test --profile ci - -# Run doc tests -cargo test --doc -``` - -**Note**: CI uses `cargo-nextest` for running tests. See [`.cargo/nextest.toml`](.cargo/nextest.toml) for test configuration (timeouts, retries, etc.). - -### Test Baseline Caching System - -DiskANN uses a baseline caching system for regression detection. See [`diskann/README.md`](diskann/README.md) for a high-level overview of how the baseline system works. For implementation and API details, refer directly to: -- [`diskann/src/test/cache.rs`](diskann/src/test/cache.rs) — core baseline caching APIs -- [`diskann/src/test/cmp.rs`](diskann/src/test/cmp.rs) — `VerboseEq` and related helpers for better test error messages - -### AVX-512, Aarch64, and multi-platform - -When touching architecture-specific intrinsics, run cross-platform validation per `diskann-wide/README.md`: - -- Testing AVX-512 code on non-AVX-512 capable x86-64 machines. -- Testing Aarch64 code on x86-64 machines. -- Testing code compiled for and running on the `x86-64` CPU (no AVX/AVX2) does not execute unsupported instructions. - ---- - -## Code Quality & Linting - -### Error Handling - -There are three regimes of error handling and the strategy to use depends on the regime. - -#### Low-Level - -Low-level crates should use bespoke, precise, non-allocating error types. Use `thiserror` for boilerplate. Chain with `std::error::Error::source`. - -`diskann::ANNError` is not a suitable low-level error type. - -#### Mid-Level (diskann algorithms) - -Use `diskann::ANNError` and its context machinery. This type: - -- Has a small size and `Drop` implementation, so is efficient in function ABIs. -- Records stack trace of its first creation under `RUST_BACKTRACE=1`. -- Precisely records line numbers of creation. -- Has a context layering machinery to add additional information as an error is passed up the stack. - -When converting to `ANNError`, use `#[track_caller]` for better source reporting. - -Traits with associated error types should consider constraining with `diskann::error::ToRanked` instead of `Into` if non-critical errors should be supported. -`diskann::ANNError` should be used only for unrecoverable errors. - -#### High Level (tooling) - -At this level `anyhow::Error` is an appropriate type to use. - -#### Do Not - -Do not use a single crate-level error enum. Problems: - -- Provides no documentation on how an individual function could fail -- Encourages **worse** error messages than bespoke types -- Generates large structs that blow up the stack -- Branch-heavy `Drop` implementations which bloat code - -### Formatting - -**Note**: `rustfmt` is not installed by default. Run `rustup component add rustfmt` if needed. - -```bash -# Check formatting (matches CI) -cargo fmt --all --check - -# Apply formatting to all crates -cargo fmt --all -``` - -See [`rustfmt.toml`](rustfmt.toml) for formatting configuration. - -### Clippy (Linting) - -**Note**: `clippy` is not installed by default. Run `rustup component add clippy` if needed. - -```bash -# Basic clippy check -cargo clippy --workspace --all-targets - -# CI-style check (warnings as errors) -cargo clippy --workspace --all-targets --config 'build.rustflags=["-Dwarnings"]' - -# Check with no default features (for specific crates) -cargo clippy -p diskann --no-default-features -``` - -See [`clippy.toml`](clippy.toml) for linting rules, including: -- Disallowed methods (rayon global thread pool, rand::thread_rng, etc.) -- Required documentation for unsafe blocks - -### Code Coverage - -Code coverage of changes is required for PRs. See [`.codecov.yml`](.codecov.yml) for coverage policy and thresholds. - -### CI Pipeline - -CI workflow is defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml). Key jobs include: -- Format and clippy checks -- Tests on multiple platforms (Linux, Windows) -- Code coverage -- Architecture compatibility (SDE) - -### Test Patterns - -**DO**: -- Look for existing setup/execution infrastructure -- Factor out common patterns - -**DON'T**: -- Add tests for derived traits (Clone, Debug, PartialEq) -- Add tests for enums unless they have explicit functionality - ---- - -## Pre-commit Checklist - -Before committing changes, always run: - -```bash -# Format all code -cargo fmt --all - -# Run clippy with warnings as errors -cargo clippy --workspace --all-targets -- -D warnings -``` - ---- - -**End of Agent Onboarding Guide** - -*This guide should be updated when major changes occur to the repository structure or development workflows.* diff --git a/diskann-wide/AGENTS.md b/diskann-wide/AGENTS.md new file mode 100644 index 000000000..6b6d9c781 --- /dev/null +++ b/diskann-wide/AGENTS.md @@ -0,0 +1,7 @@ +### Multi-platform support + +When touching architecture-specific intrinsics, run cross-platform validation per `diskann-wide/README.md` and test: + +- AVX-512 code on non-AVX-512 capable x86-64 machines. +- Aarch64 code on x86-64 machines. +- code compiled for and running on the `x86-64` CPU (no AVX/AVX2) does not execute unsupported instructions. diff --git a/diskann/AGENTS.md b/diskann/AGENTS.md new file mode 100644 index 000000000..411b61bc9 --- /dev/null +++ b/diskann/AGENTS.md @@ -0,0 +1,15 @@ +### Test Baseline Caching System + +DiskANN uses a baseline caching system for regression detection. +Test results are serialized as JSON into `diskann/tests/generated/` and compared against on subsequent runs. +Any difference is flagged as a test failure. + +- To regenerate baselines: run tests with `DISKANN_TEST=overwrite` +- Before checking in: delete `diskann/tests/generated/` first, then regenerate to prune unused baselines +- Regenerated JSON files should be inspected via `git diff` during review + +The APIs are **`pub(crate)`** (internal to the `diskann` crate only): +- [`diskann/src/test/cache.rs`](diskann/src/test/cache.rs) — `get_or_save_test_results`, `TestRoot`, `TestPath` +- [`diskann/src/test/cmp.rs`](diskann/src/test/cmp.rs) — `VerboseEq` trait, `verbose_eq!` macro, `assert_eq_verbose!` + +See [`diskann/README.md`](diskann/README.md) for additional details. \ No newline at end of file diff --git a/diskann/README.md b/diskann/README.md index 1ca0587b8..1ff46e940 100644 --- a/diskann/README.md +++ b/diskann/README.md @@ -6,7 +6,7 @@ Stay tuned for more updates! ### Test Baselines -Developers are strongly encouraged to consider the [caching infrastructure]() +Developers are strongly encouraged to consider the [caching infrastructure](src/test/cache.rs) when writing index tests to provide an early warning of algorithmic changes. This infrastructure serializes test results into a file in `diskann/tests/generated` @@ -23,7 +23,7 @@ during the review process to flag regressions early. Before checking in new test results, it's a good idea to completely delete `diskann/tests/generated` to ensure that unused baselines get removed from the repository. -The API for registering and retrieving test results is in `diskann/src/tests/cache` +The API for registering and retrieving test results is in [`diskann/src/test/cache.rs`](src/test/cache.rs) and consists of: * `fn get_or_save_test_results(test_name: &str, results: &R) -> R`: Get the results for @@ -37,11 +37,11 @@ and consists of: The above API will return the previously saved baseline in the normal test mode, which can be compared with the `results` argument. -When comparing baselines, developers should use the `diskann::tests::cmp::VerboseEq` +When comparing baselines, developers should use the `diskann::test::cmp::VerboseEq` which provides more diagnostics regarding the source of structural inequality than the standard libraries `PartialEq` trait. Additional utilities include -* `diskann::tests::cmp::verbose_eq!`: A trait for automatically implementing `VerboseEq`. +* `diskann::test::cmp::verbose_eq!`: A macro for automatically implementing `VerboseEq`. This macro can be used until a proper `derive` macro is implemented: ```rust use diskann::test::cmp::verbose_eq;