(diversity) Integrate determinant diversity in tokp full precision search and disk search#1011
(diversity) Integrate determinant diversity in tokp full precision search and disk search#1011narendatha wants to merge 28 commits intomainfrom
Conversation
Replace kind()-based string equality checks with explicit is_match() and get() phase-shape helpers on plugin structs. This avoids fragile ordering assumptions and makes each plugin responsible for recognising its own phase shape.
Co-authored-by: Copilot <copilot@github.com>
…plugins # Conflicts: # diskann-benchmark/src/backend/index/benchmarks.rs # diskann-benchmark/src/backend/index/product.rs # diskann-benchmark/src/backend/index/scalar.rs # diskann-benchmark/src/backend/index/search/plugins.rs # diskann-benchmark/src/backend/index/spherical.rs # diskann-benchmark/src/inputs/graph_index.rs # diskann-benchmark/src/inputs/mod.rs
Co-authored-by: Copilot <copilot@github.com>
…ojection issue This commit explores approaches to wire real candidate vectors into async determinant-diversity post-processing. Current state: IN COMPILATION ERROR (intentional for analysis) Attempted approaches: 1. Initial shim-trait FullPrecisionVectorAccessor with async get_full_precision_vector() - Resulted in 'implementation not general enough' at search_with() call 2. Removed explicit for<'a> post_processor::DeterminantDiversity bound - Still fails - the constraint is inherent in search_with() signature itself Root cause analysis: - search_with() requires: PP: for<'a> SearchPostProcess<S::SearchAccessor<'a>, T, O> - This means post-processor must work for ANY accessor lifetime 'a - But query = queries.row(query_idx) is borrowed for specific loop iteration lifetime - These are fundamentally incompatible - a borrowed value can't satisfy for<'a> generically Compiler errors (3 total): - 'not general enough': implementation needed for or<'a> but found specific '0 - 'does not live long enough': queries lifetime too short for 'static requirement Files modified: - diskann-benchmark/src/backend/index/benchmarks.rs: * Removed explicit for<'a> post_processor::DeterminantDiversity constraint * Narrowed plugin impl to FullPrecisionProvider<f32> - diskann-benchmark/src/backend/index/post_processor/determinant_diversity.rs: * Added shim trait FullPrecisionVectorAccessor * Async method get_full_precision_vector(&mut self, id) -> impl Future<...> Next steps to investigate: - Move determinant-diversity outside search_with() as post-processing reranking - This avoids HRTB entirely by applying after candidates are returned - Benchmark impact: measure recall/QPS with external reranking vs baseline Related context: - Disk index determinant-diversity works correctly (uses real vectors, shows 51-53% QPS cost) - Shared algorithm fixed (distance-to-similarity scoring direction) - Branch already merged with origin/main
Co-authored-by: Copilot <copilot@github.com>
…educe duplication - Use for<'a, 'b> SearchStrategy bound (user-provided fix) to break HRTB lifetime projection issue in the search_with post-processor constraint - Wire FullPrecisionVectorAccessor shim trait so async det-div post-processor fetches real candidate vectors instead of placeholder distances - Populate QPS/latency metrics in async det-div benchmark path (previously all 'missing') - Extract run_topk_timed helper to eliminate ~100 lines of duplicated loop/timing/recall machinery from DeterminantDiversity::run - Update async-determinant-diversity.json example tag (async-index-build -> graph-index-build) - Fix clippy::manual_async_fn in FullPrecisionVectorAccessor shim trait
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (51.04%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1011 +/- ##
==========================================
- Coverage 90.63% 89.25% -1.39%
==========================================
Files 460 463 +3
Lines 85424 86021 +597
==========================================
- Hits 77427 76781 -646
- Misses 7997 9240 +1243
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
- Use for<'a, 'b> SearchStrategy bound to resolve HRTB lifetime mismatch - Wire FullPrecisionVectorAccessor shim trait for async det-div to fetch real vectors - Implement DeterminantDiversity post-processor for async graph-index path - Extract run_topk_timed helper to eliminate ~100 lines of code duplication - Wire post_processor parameter to disk-index search pipeline - Update search parameter handling and result counting for post-processed results - Add TopkPostProcessor input type and necessary imports - Populate QPS/latency metrics in async det-div benchmark path
…rosoft/DiskANN into u/narendatha/det_div_plugins
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Naren! My biggest concern is diskann-benchmark. I would really prefer if we generalized diskann-benchmark-core to handle post-processing, which would reduce bugs and be more broadly applicable.
On the implementation front, there's a bit of work to do regarding being more opinionated on the nature of inputs, establishing invariants of the diversity search with a bespoke struct (which could help with the previous point), reducing code duplication, testing the implementation, and removing allocations with a Matrix.
I'll leave reviewing the diskann-disk changes to others.
| .is_some_and(|pp| matches!(pp, TopkPostProcessor::DeterminantDiversity { .. })) | ||
| } | ||
|
|
||
| pub(crate) fn get(phase: &SearchPhase) -> anyhow::Result<(&TopkSearchPhase, f32, f32)> { |
There was a problem hiding this comment.
Can we return a richer struct rather than to f32? Ideally - the parsed code would validate that a determinant diversity post-processor can be constructed from the parameters (using the actual constructor).
| /// Encapsulates the outer thread/run/L loops, the per-rep timing harness, per-query latency | ||
| /// collection, and [`SearchResults`] construction. The caller supplies only the actual | ||
| /// per-query search as a closure `search_fn(knn_params, query, output) -> Result<()>`. | ||
| fn run_topk_timed( |
There was a problem hiding this comment.
The code in diskann-benchmark-core abstracting the loop nest exists for several reasons:
- It keeps the behavior more or less the same for all users.
- More importantly, it very carefully minimizes monomorphizations to help keep compile times under control.
Introducing bespoke implementations like this can very quickly get us back to a place where benchmarks take forever to compile and the set of metrics gathered by different search kinds diverges.
I understand that the KNN runner in diskann-benchmark-core predates parametrizable post-process routines, but we should update it with an additional defaulted generic. I understand that's more work, but it improves the core abstraction and will pay dividends when it comes to exploring additional post-processing routines.
| ))?); | ||
| let groundtruth = datafiles::load_groundtruth(datafiles::BinFile(&topk.groundtruth))?; | ||
|
|
||
| let results = run_topk_timed(topk, &queries, &groundtruth, |params, query, output| { |
There was a problem hiding this comment.
Case and point as to why we should use diskann-benchmark-core: This runs all the searches single-threaded, but the loop nest loops over the number of threads. Not only will this make the benchmark take forever to run, it's also misleading and means that we're measuring overhead related to tokio runtime construction.
|
|
||
| fn kind(&self) -> &'static str { | ||
| Self::kind().as_str() | ||
| SearchPhaseKind::Topk.as_str() |
There was a problem hiding this comment.
We should probably have the plugin types implement an as_str() method instead to make sure they stay up-to-date properly and completely avoid SearchPhaseKind here.
| impl CheckDeserialization for TopkPostProcessor { | ||
| fn check_deserialization(&mut self, _checker: &mut Checker) -> Result<(), anyhow::Error> { | ||
| match self { | ||
| TopkPostProcessor::DeterminantDiversity { power, eta } => { |
There was a problem hiding this comment.
Shouldn't we have one struct that validates the inputs in the lower level crate and always use that for input validation? That way, there's no risk of these ad-hoc validations ever being wrong.
| * Licensed under the MIT license. | ||
| */ | ||
|
|
||
| use diskann_vector::{MathematicalValue, PureDistanceFunction, distance::InnerProduct}; |
There was a problem hiding this comment.
A module level doc explaining what this is and how it works would be really helpful.
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Should we have tests that check the algorithmic behavior?
| .collect() | ||
| } | ||
|
|
||
| fn post_process_greedy_orthogonalization_f32<Id: Copy>( |
There was a problem hiding this comment.
These two routines are very similar. Is it possible to merge them?
| for (_, distance_to_query, v) in &candidates { | ||
| let scale = | ||
| distance_to_similarity(*distance_to_query, distance_range).powf(power) * inv_sqrt_eta; | ||
| let residual: Vec<f32> = v.iter().map(|&x| x * scale).collect(); |
There was a problem hiding this comment.
Maybe use a Matrix to store the residuals? This will provide better locality in the processing loop and cut down on the number of allocations.
|
|
||
| mod determinant_diversity_post_process; | ||
| pub(crate) mod postprocess; | ||
| pub use determinant_diversity_post_process::determinant_diversity_post_process; |
There was a problem hiding this comment.
Since this is shared with the disk index, maybe move it up instead of down in the async_ abyss? I know diskann-providers is kind of a nightmare at the moment 😢
Integrate determinant-diversity into topk full-precision search and disk search
Summary
Wires the determinant-diversity post-processing algorithm as an optional
SearchPluginfor two search paths:inmem::FullAccessor)DiskIndexSearcherpost-processor)Determinant-diversity reranks search results to increase geometric diversity among the top-k, at a configurable quality/speed tradeoff.
Motivation
Standard nearest neighbor search often returns highly similar/redundant results. For some scenarios like RAG search, it's more valuable to retrieve diverse, complementary documents that cover different aspects of the query. This PR implements a principled approach based on determinant maximization of the Gram matrix of query-scaled vectors.
Algorithm
The algorithm scales each candidate vector by
(similarity_to_query)^powerand then greedily selects vectors to maximize the determinant of the Gram matrix (equivalent to maximizing the volume of the parallelepiped spanned by the vectors).Two variants are implemented:
Both variants run in O(n k d) time where n = candidates, k = results to return, d = dimension.
Performance Characteristics
Tested on OpenAI embeddings dataset (3072 dimensions, 10K sample):
The recall reduction is expected - Det_Div search intentionally selects diverse vectors rather than the absolute nearest neighbors.
Diversity Improvement
We measure diversity using log-determinant of the Gram matrix of scaled vectors (higher = vectors span more volume = more diverse).
Parameter Tuning Guide
power(default: 2.0)Controls the trade-off between relevance and diversity:
eta(default: 0.01)Ridge regularization parameter that controls numerical robustness and relevance preference:
Recommended Settings
poweretaChanges
Core algorithm (
diskann-providers)determinant_diversity_post_process()indiskann-providers/src/model/graph/provider/async_/determinant_diversity_post_process.rseta:eta == 0: greedy orthogonalizationeta > 0: eta-weighted residual selectionAsync graph-index path (
diskann-benchmark)FullPrecisionVectorAccessorshim trait + impl forinmem::FullAccessor<f32, ...>so the post-processor can fetch real candidate vectorsDeterminantDiversityplugin that satisfiesglue::SearchPostProcessusingfor<'a, 'b> SearchStrategyHRTB bound (credit: @hildebrandmw)run_topk_timedhelper to expose QPS/latency/recall metrics in the benchmark outputDisk-index path (
diskann-disk,diskann-benchmark)DeterminantDiversityAndFilterprocessor toDiskSearchPostProcessorenumsearcher.search()to accept an optionalSearchPostProcessorKind::DeterminantDiversity { power, eta }DiskSearchPhaseJSON input to acceptpost_processorresult_countfrom search stats to correctly bound post-processed resultsInput/config
TopkPostProcessorserde enum indiskann-benchmark/src/inputs/post_processor.rswith validation (power > 0,eta >= 0)async-determinant-diversity.json,disk-index-determinant-diversity.jsonBenchmark JSON examples
Notes