Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
When performing a code review, check that:

## SemVer and API Compatibility

The workspace obeys SemVer. Removing or changing public API signatures (functions, types, re-exports) requires a version bump, suggested migration impact in the PR description.

## Dependency and Build Hygiene
- Check for changes that introduce new dependencies or increase build times. If a dependency is added, it must be justified in the PR description.


## Error Handling

- Do not introduce `panic!` paths for recoverable errors — propagate with `Result` instead.
- Keep error types small. Avoid large enums/structs that blow up the stack; look for ways to reduce field sizes (e.g., compute derivable fields, use enums instead of `&'static str`).
- Prefer `ANNError::new(ANNErrorKind::…, e)` over the old `log_*`-style constructors, which force eager string formatting and double-log errors.
- When using `thiserror`, rely on `#[from]` for automatic `Error::source` chaining — do not format the inner error in the `#[error("…")]` display string.
- Include relevant context values (e.g., the kind, key, or dimension) in error messages for debuggability.

## Documentation

- Doc comments and README examples must match actual API signatures and serialized shapes.
- Stale examples that fail to compile or deserialize are treated as bugs.
- Do not leave dead references to APIs that no longer exist.
- When changing a function signature or removing a parameter, update all doc comments that mention the old signature.

## Constants and Assumptions

- Do not hardcode magic values — make them configurable with sensible defaults and document the rationale.
- Add assertions for invariants that callers or maintainers would otherwise have to discover by reading the implementation.

## SIMD and Platform Portability

- Do not assume specific SIMD lane widths (e.g., `f32s::LANES == 8`). Code must be correct on AVX2, AVX-512, and ARM/NEON.
- When touching architecture-specific intrinsics, verify cross-platform behavior per `diskann-wide/README.md`.

## Testing
- Do not drop existing unit tests without a strong reason.
- Keep test helpers close to the code they exercise, typically in a `mod tests` at the bottom of the file or in an adjacent test module, guarded with `#[cfg(test)]`.
- Do not add tests for derived traits (`Clone`, `Debug`, `PartialEq`) or enums unless they have explicit behavior beyond the derive.
- Test edge cases like empty inputs (e.g., empty iterators) to lock in defined behavior and prevent divide-by-zero or NaN results.

## Rayon and Parallelism

- Never use the global Rayon thread pool. Always execute parallel work within the provided `RayonThreadPool` or `RayonThreadPoolRef`.
- Preserve deadlock-avoidance intent when modifying nested parallel loops.
- Be aware that combining blocking synchronization (e.g., mutex acquisition) with Rayon work-stealing can cause deadlocks.

## Unsafe Code and Safety

- Every `unsafe` block must have a `// SAFETY:` comment directly above it explaining why the operation is sound.
- Safety comments must be specific and verifiable — state the concrete precondition that makes the operation safe (e.g., `// SAFETY: i + width <= len ensures this read is in-bounds`). Do not use vague justifications like `// SAFETY: this is safe`.
- Safety contracts on `unsafe fn` signatures must be internally consistent — if the documented precondition says `scratch.len() >= n`, ensure the implementation does not write beyond `n` elements (e.g., due to rounding up to a panel/block size).
- Prefer safe abstractions over raw `unsafe` when possible. Use `unsafe` only when there is a measurable performance benefit or when interfacing with FFI/intrinsics.
- For pointer arithmetic, prefer `offset_from` to express bounds rather than `wrapping_add` unless wrapping behavior is intentionally needed — and document why.
- When calling SIMD intrinsics or FFI, list the specific preconditions being satisfied (alignment, length, non-null, valid initialization).

## License Headers
- Each file has a license header.
```
/*
* Copyright (c) Microsoft Corporation.
* Licensed under the MIT license.
*/
```
11 changes: 0 additions & 11 deletions .github/instructions.md

This file was deleted.

182 changes: 182 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# DiskANN Repository - Agent Onboarding Guide

**Last Updated**: 2026-05-04 (based on v0.50.1, Rust 1.92)


## Crate Organization

**Tier 1: Foundation**
- `diskann-wide/` - Low-level SIMD, bit manipulation, type width abstractions
- `diskann-vector/` - Vector primitives and operations
- `diskann-platform/` - Platform-specific utilities

**Tier 2: Core Libraries**
- `diskann-linalg/` - Linear algebra operations
- `diskann-utils/` - Shared utilities (Reborrow, MatrixView traits)
- `diskann-quantization/` - Vector quantization

**Tier 3: Algorithm & Proivders**
- `diskann/` - Core indexing logic
- `diskann-providers/` - Hodge-podge of stuff, will be dismantled
- `diskann-disk/` - Disk-based provider for the index with io_uring support
- `diskann-label-filter/` - Inverted index for filtered search
- `diskann-garnet/` - Garnet (Redis-compatible) Provider and FFI endpoints for vector sets

**Tier 4: Infrastructure & Tools**
- `diskann-benchmark-runner/` - Test runner infrastructure
- `diskann-benchmark-core/` - Benchmark framework
- `diskann-benchmark-simd/` - SIMD-specific benchmarks
- `diskann-benchmark/` - Benchmark definitions and runners
- `diskann-tools/` - CLI utilities (autotuner, etc.)
- `vectorset/` - Garnet client for benchmarking vector set workloads

---

### Internal Dependencies


- Tier 1 and Tier 2 crates may be added as dependencies of any internal crate
- `diskann` may be added as a dependency of any equal or higher tier internal crate except those below
- Do not add Tier 3 crates as dependencies of these Tier 4 crates:
- `diskann-benchmark-runner`
- `diskann-benchmark-core` (`diskann` is allowed)
- `diskann-benchmark-simd`

---

## Boundaries

### Never

- Modify files in `diskann/tests/generated/` by hand — these are auto-generated baselines. Regenerate with `DISKANN_TEST=overwrite`.
- Modify `rust-toolchain.toml`, `.github/workflows/`, or `.codecov.yml` without explicit approval.
- Use the global Rayon thread pool — use `RayonThreadPool`/`RayonThreadPoolRef` (enforced by `clippy.toml` disallowed methods).
- Use `rand::thread_rng` — use the project's `random.rs` utilities instead (enforced by `clippy.toml`).
- Use `vfs::PhysicalFS::new` or `VirtualStorageProvider::new_physical()` in tests — use `VirtualStorageProvider::new_overlay()`.
Comment on lines +49 to +55
- Remove or weaken existing tests without a strong, documented reason.
- Commit secrets, credentials, or API keys.

### Ask First

- Adding new workspace dependencies to `Cargo.toml` — justify the addition.
- Changing public API signatures in any `diskann-*` crate — requires SemVer analysis (may need a major version bump).
- Modifying tier dependency rules (e.g., adding a Tier 3 dependency to a Tier 4 benchmark crate).
- Changing `clippy.toml` or `rustfmt.toml` lint/formatting configuration.

### Always

- Include the MIT license header in every new source file.
- Run `cargo fmt --all` and `cargo clippy --workspace --all-targets -- -Dwarnings` before committing.
- Update doc comments when changing function signatures or removing parameters.
- Add a `// SAFETY:` comment above every `unsafe` block with specific, verifiable preconditions.

---




## Error Handling

There are three regimes of error handling and the strategy to use depends on the regime.

### Low-Level

Low-level crates should use bespoke, precise, non-allocating error types.
Use `thiserror` for boilerplate. Chain with `std::error::Error::source`.
`diskann::ANNError` is not a suitable low-level error type.

```rust
// ✅ Good — bespoke error type with thiserror, uses #[from] for source chaining
#[derive(Debug, thiserror::Error)]
pub enum SgemmError {
#[error("dimension overflow: {expected_rows}×{expected_cols} exceeds usize")]
DimensionOverflow { expected_rows: usize, expected_cols: usize },

#[error(transparent)]
Allocation(#[from] AllocatorError),
}

// ❌ Bad — single crate-level enum, formats inner error in display string
#[derive(Debug, thiserror::Error)]
pub enum MyLibError {
#[error("sgemm failed: {0}")] // Don't format the inner error here
Sgemm(#[source] SgemmError),
#[error("io failed: {0}")]
Io(#[from] std::io::Error),
// ... 20 more variants — too broad
}
```

### Mid-Level (diskann algorithms)

Use `diskann::ANNError` and its context machinery. This type:

- Is 16 bytes with niche optimization for `Option<ANNError>`, allowing return in registers.
- Records stack trace of its first construction under `RUST_BACKTRACE=1`.
- Precisely records file and line of creation via `#[track_caller]`.
- Has context layering machinery to add additional information as an error is passed up the stack.
- Is backed internally by `anyhow::Error`.

When converting to `ANNError`, use `#[track_caller]` for better source reporting.
Prefer `ANNError::new(ANNErrorKind::…, e)` over the old `log_*`-style constructors, which force eager string formatting and double-log errors.

```rust
// ✅ Good — deferred formatting, precise kind, track_caller
#[track_caller]
fn process_vectors(data: &[f32]) -> Result<(), ANNError> {
validate(data).map_err(|e| ANNError::new(ANNErrorKind::IndexError, e))?;
Ok(())
}

// ❌ Bad — eager string formatting, double-logs on creation
fn process_vectors(data: &[f32]) -> Result<(), ANNError> {
validate(data).map_err(|e| ANNError::log_index_error(format!("validation failed: {e}")))?;
Ok(())
}
```

Traits with associated error types should consider constraining with `diskann::error::ToRanked` instead of `Into<ANNError>` if non-critical errors should be supported.
`ANNError` is the mid-level propagated error type; use `ToRanked` and `RankedError` to distinguish transient/recoverable failures from fatal ones.

### High Level (tooling)

At this level `anyhow::Error` is appropriate for binaries and CLI entry points.
Note that some tooling helpers still use `ANNError` for compatibility.

### Do Not

Do not use a single crate-level error enum. Problems:

- Provides no documentation on how an individual function could fail
- Encourages **worse** error messages than bespoke types
- Generates large structs that blow up the stack
- Branch-heavy `Drop` implementations which bloat code


## Document unsafe usage

```rust
// ✅ Good — specific, verifiable precondition
// SAFETY: `i + width <= len` ensures this read is in-bounds.
let val = unsafe { ptr.add(i).read() };

// ✅ Good — FFI with listed preconditions
// SAFETY: `a` and `b` are non-null, properly aligned, and do not alias.
// `m`, `n`, `k` match the actual matrix dimensions.
unsafe { ffi::sgemm(m, n, k, a.as_ptr(), b.as_ptr(), c.as_mut_ptr()) };

// ❌ Bad — vague, unverifiable
// SAFETY: this is safe
let val = unsafe { ptr.add(i).read() };
```


## CI Pipeline

CI workflow is defined in [`.github/workflows/ci.yml`](.github/workflows/ci.yml). Key jobs include:
- Format and clippy checks
- Tests on multiple platforms (Linux, Windows)
- [Code coverage](.codecov.yml)
- Architecture compatibility (SDE)

**Note**: CI uses `cargo-nextest` for running tests. See [`.cargo/nextest.toml`](.cargo/nextest.toml) for test configuration (timeouts, retries, etc.).
Loading
Loading