Skip to content

Add VMP prepared to DFT variant#60

Open
fraret wants to merge 1 commit into
mainfrom
vmp_prepared_vec
Open

Add VMP prepared to DFT variant#60
fraret wants to merge 1 commit into
mainfrom
vmp_prepared_vec

Conversation

@fraret

@fraret fraret commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Intended to be a more performant alternative to vmp_apply_dft_to_dft when one can prepare the input vector (eg. PIR). The prepared vec is in reim4 dft instead of the the reim dft that the current version uses, which can result in significant performance gains in certain memory-bound vmp workflows.

Motivated by my findings while implementing a variation on the Hypets' tutorial OnionPIR protocol. On my desktop machine, the reim4_extract_1blk_from_contiguous_reim_avx (reim to reim4 copy) inside vmp_apply_dft_to_dft accounted for the majority of execution time.
On a PIR protocol like OnionPIR, the vectors we use in a VMP are the columns of the database, and therefore it makes sense to have them prepared in the most efficient format.

In my current experiments, using vmp_apply_prepared_to_dft instead of vmp_apply_dft_to_dft can result in a performance gain of 1.5x to 2.5x (depending on the machine used) on the overall PIR server execution time.

Intended to be a more performant alternative to vmp_apply_dft_to_dft
when one can prepare the input vector (eg. PIR). The prepared vec is in
reim4 dft vs the reim dft it would usually be, which can result in
significant performance gains in certain memory-bound vmp workflows.
@fraret fraret requested review from MGeorgie and ngama75 June 11, 2026 12:56
@ngama75

ngama75 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

That's very strange! the product should be an order of magnitude slower: what are the dimensions of the vector and the matrix that you are using in the vector-matrix product? (and how many product are you doing with the same prepared matrix?)

also, I think that there is an avx variant of reim4_extract_1blk_from_contiguous_reim_ref in the code: if it is not used, it may be a bug.

@fraret

fraret commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

also, I think that there is an avx variant of reim4_extract_1blk_from_contiguous_reim_ref in the code: if it is not used, it may be a bug.

I put the wrong variant in the PR description, sorry. The AVX variant is correctly used (and is the bottleneck in my desktop with ~50% of my OnionPIR exec time).

That's very strange! the product should be an order of magnitude slower: what are the dimensions of the vector and the matrix that you are using in the vector-matrix product? (and how many product are you doing with the same prepared matrix?)

I am not sure why the product should be slower. Just to be clear, the version I added is the usual VMP dft to dft which takes a prepared matrix, but changing the input vector to be in reim4 DFT instead of reim DFT (PMAT x reim4 -> reim). Then instead of calling reim4_extract_1blk_from_contiguous_reim_avx to copy a block to the tmp space, we can read directly from the input vector.

In my tests:

  • N = 4096
  • The matrix is prepared and then used as many times as columns in the DB (128 or 1024 for my tests)
  • I am using nrows values between 512 and 4096 for testing (ell_tilde = 4 and either 128 or 1024 DB rows depending on laptop/desktop due to RAM limitations)
  • For ncols, it should be the glwegadget number of limbs, which is 12
  • The vectors are also of depth either 512 or 4096

For those I get the very noticeable performance improvements (1.5x to 2.5x).

I have also tried it in a smaller micro-benchmark for a half-external product, with

  • N = 16384 (2^14)
  • ell_tilde = nrows = 15
  • ncols = 30
  • vectors of ell_tilde (15) lenght/depth

On my laptop these parameters give me a 1.2x speedup for a GLWEGadget x bivariate polynomial half-external product.

@ngama75

ngama75 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Wow, I did not expect that spqlios product would be that efficient. But if we count the cycles, it can make sense: 30 SIMD complex products (approx 200 avx2 SIMD floating point multiplications) can indeed in certain cases be faster than 2 random access memory read... [which seem to be happening right now]

Ok, if it results in a 2.5x end-to-end speedup in the use-case, we will have to go in the direction of the PR.
It just means that it is a new layout, and in the overall PIR use-case, it cannot be fully opaque.

I need to think about it a little bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants