Add VMP prepared to DFT variant#60
Conversation
Intended to be a more performant alternative to vmp_apply_dft_to_dft when one can prepare the input vector (eg. PIR). The prepared vec is in reim4 dft vs the reim dft it would usually be, which can result in significant performance gains in certain memory-bound vmp workflows.
|
That's very strange! the product should be an order of magnitude slower: what are the dimensions of the vector and the matrix that you are using in the vector-matrix product? (and how many product are you doing with the same prepared matrix?) also, I think that there is an avx variant of |
I put the wrong variant in the PR description, sorry. The AVX variant is correctly used (and is the bottleneck in my desktop with ~50% of my OnionPIR exec time).
I am not sure why the product should be slower. Just to be clear, the version I added is the usual VMP dft to dft which takes a prepared matrix, but changing the input vector to be in reim4 DFT instead of reim DFT (PMAT x reim4 -> reim). Then instead of calling In my tests:
For those I get the very noticeable performance improvements (1.5x to 2.5x). I have also tried it in a smaller micro-benchmark for a half-external product, with
On my laptop these parameters give me a 1.2x speedup for a GLWEGadget x bivariate polynomial half-external product. |
|
Wow, I did not expect that spqlios product would be that efficient. But if we count the cycles, it can make sense: 30 SIMD complex products (approx 200 avx2 SIMD floating point multiplications) can indeed in certain cases be faster than 2 random access memory read... [which seem to be happening right now] Ok, if it results in a 2.5x end-to-end speedup in the use-case, we will have to go in the direction of the PR. I need to think about it a little bit. |
Intended to be a more performant alternative to vmp_apply_dft_to_dft when one can prepare the input vector (eg. PIR). The prepared vec is in reim4 dft instead of the the reim dft that the current version uses, which can result in significant performance gains in certain memory-bound vmp workflows.
Motivated by my findings while implementing a variation on the Hypets' tutorial OnionPIR protocol. On my desktop machine, the
reim4_extract_1blk_from_contiguous_reim_avx(reim to reim4 copy) insidevmp_apply_dft_to_dftaccounted for the majority of execution time.On a PIR protocol like OnionPIR, the vectors we use in a VMP are the columns of the database, and therefore it makes sense to have them prepared in the most efficient format.
In my current experiments, using
vmp_apply_prepared_to_dftinstead ofvmp_apply_dft_to_dftcan result in a performance gain of 1.5x to 2.5x (depending on the machine used) on the overall PIR server execution time.