cuSBF is a high-performance GPU implementation of the Super Bloom filter, optimized for high-throughput batch k-mer insertion and query on nucleotide (DNA) and protein sequences (or any other sequence type as long as a valid alphabet is provided).
It exploits the streaming nature of sequence-derived k-mers by using minimizers to group consecutive k-mers sharing the same minimiser into super-k-mers, assigning all k-mers of a super-k-mer to the same 256-bit memory shard. This amortizes random memory accesses across consecutive k-mer queries, reducing memory-bandwidth pressure. The findere scheme further reduces false positives dramatically by inserting overlapping s-mers and requiring a full run of consecutive s-mer matches.
- CUDA-accelerated batch k-mer insert and query from sequences
- Configurable k-mer length, minimiser width, s-mer width, and hash function count
- Minimizer-based shard selection for cache-efficient streaming queries
- Findere false-positive reduction via overlapping s-mer membership
- Header-only library design
- FASTA/FASTQ stream and file support
Benchmarks (Config<31, 28, 16, 4>) comparing cuSBF against NVIDIA's cuco bloom_filter on an NVIDIA RTX PRO 6000 Blackwell GPU with random DNA sequence inputs. Throughput is reported in billions of k-mers per second (Gk-mers/s). Timings include GPU-side sequence encoding for cuco, so both methods consume the same input sequence.
| Dataset Size | Operation | cuSBF | cuco Bloom | Speed-up |
|---|---|---|---|---|
| ~4M k-mers | Insert | 62.9 Gk-mers/s | 31.5 Gk-mers/s | 2.0× |
| ~4M k-mers | Query | 109 Gk-mers/s | 44.2 Gk-mers/s | 2.5× |
| ~268M k-mers | Insert | 46.6 Gk-mers/s | 5.9 Gk-mers/s | 7.9× |
| ~268M k-mers | Query | 101 Gk-mers/s | 13.2 Gk-mers/s | 7.7× |
The findere scheme (s-mer width) provides strong false-positive reduction at equivalent memory. For Config<31, 28, 16, 4> on ~4.6M inserted k-mers queried against 10⁹ random k-mers:
| Bits/k-mer | cuSBF FPR | cuco Bloom FPR |
|---|---|---|
| 58 | 0.0035% | 0.064% |
| 116 | 0.00036% | 0.017% |
| 231 | 0.000092% | 0.0054% |
Benchmarks can be reproduced with:
./build/benchmarks/gpu-filter-comparison --benchmark_filter="CuSBF"
./build/benchmarks/fpr-fastx-sweep- CUDA Toolkit >= 13.1
- C++20 compatible host compiler
- Meson build system
- NVIDIA GPU with compute capability 8.0+ (Ampere, Lovelace, Hopper, Blackwell)
meson setup build
ninja -C buildBenchmarks and tests are built automatically when this is a standalone project, but skipped when used as a subproject. Control with Meson feature options:
| Option | Behaviour |
|---|---|
-Dbenchmarks=auto (default) |
Build benchmarks when standalone, skip when subproject |
-Dbenchmarks=enabled |
Always build benchmarks |
-Dbenchmarks=disabled |
Never build benchmarks |
-Dtests=auto (default) |
Build tests when standalone, skip when subproject |
-Dtests=enabled |
Always build tests |
-Dtests=disabled |
Never build tests |
-Dexamples=auto (default) |
Build examples when standalone, skip when subproject |
-Dexamples=enabled |
Always build examples |
-Dexamples=disabled |
Never build examples |
-Dparam_sweep=disabled (default) |
Never build parameter-sweep benchmark (s, m) for DNA or protein |
-Dparam_sweep=enabled |
Build parameter-sweep benchmark (s, m) for DNA or protein |
-Dparam_sweep_alphabet=dna |
Alphabet for param_sweep: dna or protein |
Important
The parameter sweep is disabled by default for a reason, there are 208 binaries for the entire sweep when using the DNA alphabet.
Examples:
# Standalone: build everything except parameter sweep (default)
meson setup build
# Standalone: skip benchmarks and tests
meson setup build -Dbenchmarks=disabled -Dtests=disabled
# Subproject: force benchmarks and tests on
meson setup build -Dbenchmarks=enabled -Dtests=enabled
# Parameter-sweep benchmark (DNA, default)
meson setup build -Dparam_sweep=enabled
# Parameter-sweep benchmark (protein)
meson setup build -Dparam_sweep=enabled -Dparam_sweep_alphabet=protein#include <cusbf/BloomFilter.cuh>
// Configure the filter: k-mer length, s-mer width, minimizer width, hash count
using Config = cusbf::Config<31, 28, 16, 4>;
// Create a filter with the desired capacity (in bits)
cusbf::Filter<Config> filter(1 << 24); // ~16M bits
// Insert k-mers from a DNA sequence (synchronous)
filter.insertSequence("ACGTACGTACGTACGTACGTACGTACGTACGT");
// Query k-mers (returns vector<uint8_t>, 1 = present, 0 = absent)
auto hits = filter.containsSequence("ACGTACGTACGTACGTACGTACGTACGTACGT");
// Device-resident API (async, no synchronization)
thrust::device_vector<char> d_seq = ...;
thrust::device_vector<uint8_t> d_results(numKmers);
filter.insertSequenceDevice(cusbf::device_span<const char>(d_seq));
filter.containsSequenceDevice(
cusbf::device_span<const char>(d_seq),
cusbf::device_span<uint8_t>(d_results)
);
// Dense record batch API: one concatenated payload plus explicit record ranges.
std::string batchSequence = "ACGTACGTTGCATGCA";
std::vector<cusbf::RecordRange> batchRanges{
{0, 8},
{8, 8},
};
auto batchSummary = filter.queryRecordBatch(
{
batchSequence,
cuda::std::span<const cusbf::RecordRange>{batchRanges.data(), batchRanges.size()},
},
[](const cusbf::RecordQueryView& record) {
// record.sequence is the original record, record.hits is one bool per k-mer
}
);
// FASTA/FASTQ file insertion and aggregate query (chunked internally for large inputs)
filter.insertFastxFile("reference.fasta");
auto summary = filter.queryFastxFile("queries.fastq");
// Streaming FASTX query emits each record as soon as its chunk completes
auto streamed = filter.queryFastxFileRecords(
"queries.fastq",
[](const cusbf::FastxRecordView& record) {
// record.header, record.sequence, record.queriedKmers, record.hits
}
);
// Detailed FASTX query keeps owning per-record copies in source order
auto detailed = filter.queryFastxFileDetailed("queries.fastq");
// Inspect filter state
double load = filter.loadFactor(); // fraction of set bits
uint64_t bits = filter.filterBits();The Config template accepts the following parameters:
| Parameter | Description | Default |
|---|---|---|
K |
k-mer length (max depends on alphabet) | - |
S |
s-mer width for findere Bloom hash seed (1-K) | - |
M |
Minimiser width for shard selection (1-K) | - |
HashCount |
Number of independent Bloom hash functions (1-16) | 4 |
CudaBlockSize |
CUDA threads per block | 256 |
Alphabet |
Symbol encoding (DNA or protein) | DnaAlphabet |
#include <cusbf/BloomFilter.cuh>
using ProteinConfig = cusbf::Config<12, 10, 6, 4, 256, cusbf::ProteinAlphabet>;
cusbf::Filter<ProteinConfig> filter(1 << 24);
filter.insertSequence("ACDEFGHIKLMNPQRSTVWY");
auto hits = filter.containsSequence("ACDEFGHIKLMNPQRSTVWY");- E. Conchon-Kerjan, T. Rouzé, L. Robidou, F. Ingels, and A. Limasset, “Super Bloom: Fast and precise filter for streaming k-mer queries,” bioRxiv, 2026, doi: 10.64898/2026.03.17.712354.
- D. Jünger, K. Kristensen, Y. Wang, X. Yu, and B. Schmidt, “Optimizing Bloom Filters for Modern GPU Architectures.” 2025. [Online]. Available: https://arxiv.org/abs/2512.15595