Skip to content

NDFT blocked recurrence#222

Draft
jenskeiner wants to merge 1 commit into
developfrom
feature/nfft-direct-blocked-recurrence
Draft

NDFT blocked recurrence#222
jenskeiner wants to merge 1 commit into
developfrom
feature/nfft-direct-blocked-recurrence

Conversation

@jenskeiner

@jenskeiner jenskeiner commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

This PR seeks to improve the efficiency as well as the accuracy of the 1D direct nfft forward/adjoint transforms for serial and threaded cases.

I was originally just trying to improve the performance but then noticed that the forward transform's error grows like O(sqrt(N)) and that of the adjoint transform like O(N) while the error bound we use in our tests is just O(eps), independent of N.

After some rounding error analysis, the two sources of errors were 1) the calculation of the argument to sin/cos, and 2) the final summation over the frequencies. 1) can be reduced by using an FMA to reduce the argument to [-pi,pi] before feeding into sin/cos. This change has removed the N-dependency of the errors entirely.

After confirming the accuracy improvements, I applied another optimization that avoids the repeated costly sin/cos evaluations, by pulling out a constant factor added to the phase from one loop iteration to the next. This has to be done block-wise to not remove the accuracy improvements from the first set of changes again. There's now a tunable parameter B=32, the block size, that trades accuracy (B smaller) for speed (B larger).

I have attached a HTML write-up generated by my coding assistant that describes the changes and the rationale:
direct-1d-recurrence-optimization.html

This was tested on ARM64 (macOS) as well as in CI on x64 (Linux) architectures. If a platform does not support the single-rounding FMA semantics, the error bound could still be N-dependent again.

@codspeed-hq

codspeed-hq Bot commented Jun 21, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by ×3.7

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 57 improved benchmarks
✅ 75 untouched benchmarks
🆕 24 new benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[256/800] 10.5 ms 1.3 ms ×8.2
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[512/1600] 41.5 ms 5.1 ms ×8.2
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[64/200] 660.5 µs 82 µs ×8.1
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[128/400] 2,641.8 µs 330.4 µs ×8
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[512/1600] 41.3 ms 5.4 ms ×7.6
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[256/800] 10.4 ms 1.4 ms ×7.6
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d_omp[512/1600] 2,999.7 µs 398.2 µs ×7.5
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[128/400] 2,634.5 µs 352.1 µs ×7.5
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[64/200] 652.2 µs 87.2 µs ×7.5
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d_omp[256/800] 774.7 µs 110.1 µs ×7
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[32/100] 160.4 µs 24 µs ×6.7
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[32/100] 160.1 µs 25.3 µs ×6.3
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d_omp[128/400] 205.1 µs 34.3 µs ×6
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d_omp[512/1600] 3,306.2 µs 602.8 µs ×5.5
WallTime codspeed-macro_gcc_kaiserbessel_float/nfft_adjoint_direct_1d[512/1600] 21.6 ms 4.4 ms ×4.9
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d_omp[256/800] 855.9 µs 183.7 µs ×4.7
WallTime codspeed-macro_gcc_kaiserbessel_float/nfft_adjoint_direct_1d[256/800] 5 ms 1.1 ms ×4.5
WallTime codspeed-macro_gcc_kaiserbessel_float/nfft_forward_direct_1d_omp[512/1600] 1,556.9 µs 352.2 µs ×4.4
WallTime codspeed-macro_gcc_kaiserbessel_float/nfft_forward_direct_1d[512/1600] 21.2 ms 4.8 ms ×4.4
WallTime codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d_omp[64/200] 57.5 µs 14 µs ×4.1
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing feature/nfft-direct-blocked-recurrence (b139c5e) with develop (56ea7dc)

Open in CodSpeed

@jenskeiner jenskeiner force-pushed the feature/nfft-direct-blocked-recurrence branch from dc341fa to b139c5e Compare June 22, 2026 11:06
@jenskeiner jenskeiner deployed to benchmarks June 22, 2026 11:06 — with GitHub Actions Active
@jenskeiner jenskeiner changed the title Feature/nfft direct blocked recurrence NDFT blocked recurrence Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant