Add VMP prepared to DFT variant by fraret · Pull Request #60 · tfhe/spqlios-arithmetic

fraret · 2026-06-11T12:56:58Z

Intended to be a more performant alternative to vmp_apply_dft_to_dft when one can prepare the input vector (eg. PIR). The prepared vec is in reim4 dft instead of the the reim dft that the current version uses, which can result in significant performance gains in certain memory-bound vmp workflows.

Motivated by my findings while implementing a variation on the Hypets' tutorial OnionPIR protocol. On my desktop machine, the reim4_extract_1blk_from_contiguous_reim_avx (reim to reim4 copy) inside vmp_apply_dft_to_dft accounted for the majority of execution time.
On a PIR protocol like OnionPIR, the vectors we use in a VMP are the columns of the database, and therefore it makes sense to have them prepared in the most efficient format.

In my current experiments, using vmp_apply_prepared_to_dft instead of vmp_apply_dft_to_dft can result in a performance gain of 1.5x to 2.5x (depending on the machine used) on the overall PIR server execution time.

Intended to be a more performant alternative to vmp_apply_dft_to_dft when one can prepare the input vector (eg. PIR). The prepared vec is in reim4 dft vs the reim dft it would usually be, which can result in significant performance gains in certain memory-bound vmp workflows.

ngama75 · 2026-06-11T13:38:55Z

That's very strange! the product should be an order of magnitude slower: what are the dimensions of the vector and the matrix that you are using in the vector-matrix product? (and how many product are you doing with the same prepared matrix?)

also, I think that there is an avx variant of reim4_extract_1blk_from_contiguous_reim_ref in the code: if it is not used, it may be a bug.

fraret · 2026-06-11T14:15:06Z

also, I think that there is an avx variant of reim4_extract_1blk_from_contiguous_reim_ref in the code: if it is not used, it may be a bug.

I put the wrong variant in the PR description, sorry. The AVX variant is correctly used (and is the bottleneck in my desktop with ~50% of my OnionPIR exec time).

That's very strange! the product should be an order of magnitude slower: what are the dimensions of the vector and the matrix that you are using in the vector-matrix product? (and how many product are you doing with the same prepared matrix?)

I am not sure why the product should be slower. Just to be clear, the version I added is the usual VMP dft to dft which takes a prepared matrix, but changing the input vector to be in reim4 DFT instead of reim DFT (PMAT x reim4 -> reim). Then instead of calling reim4_extract_1blk_from_contiguous_reim_avx to copy a block to the tmp space, we can read directly from the input vector.

In my tests:

N = 4096
The matrix is prepared and then used as many times as columns in the DB (128 or 1024 for my tests)
I am using nrows values between 512 and 4096 for testing (ell_tilde = 4 and either 128 or 1024 DB rows depending on laptop/desktop due to RAM limitations)
For ncols, it should be the glwegadget number of limbs, which is 12
The vectors are also of depth either 512 or 4096

For those I get the very noticeable performance improvements (1.5x to 2.5x).

I have also tried it in a smaller micro-benchmark for a half-external product, with

N = 16384 (2^14)
ell_tilde = nrows = 15
ncols = 30
vectors of ell_tilde (15) lenght/depth

On my laptop these parameters give me a 1.2x speedup for a GLWEGadget x bivariate polynomial half-external product.

ngama75 · 2026-06-11T15:52:30Z

Wow, I did not expect that spqlios product would be that efficient. But if we count the cycles, it can make sense: 30 SIMD complex products (approx 200 avx2 SIMD floating point multiplications) can indeed in certain cases be faster than 2 random access memory read... [which seem to be happening right now]

Ok, if it results in a 2.5x end-to-end speedup in the use-case, we will have to go in the direction of the PR.
It just means that it is a new layout, and in the overall PIR use-case, it cannot be fully opaque.

I need to think about it a little bit.

fraret requested review from MGeorgie and ngama75 June 11, 2026 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VMP prepared to DFT variant#60

Add VMP prepared to DFT variant#60
fraret wants to merge 1 commit into
mainfrom
vmp_prepared_vec

fraret commented Jun 11, 2026 •

edited

Loading

Uh oh!

ngama75 commented Jun 11, 2026 •

edited

Loading

Uh oh!

fraret commented Jun 11, 2026 •

edited

Loading

Uh oh!

ngama75 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fraret commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngama75 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fraret commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngama75 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fraret commented Jun 11, 2026 •

edited

Loading

ngama75 commented Jun 11, 2026 •

edited

Loading

fraret commented Jun 11, 2026 •

edited

Loading