NDFT blocked recurrence by jenskeiner · Pull Request #222 · NFFT/nfft

jenskeiner · 2026-06-21T19:07:14Z

This PR seeks to improve the efficiency as well as the accuracy of the 1D direct nfft forward/adjoint transforms for serial and threaded cases.

I was originally just trying to improve the performance but then noticed that the forward transform's error grows like O(sqrt(N)) and that of the adjoint transform like O(N) while the error bound we use in our tests is just O(eps), independent of N.

After some rounding error analysis, the two sources of errors were 1) the calculation of the argument to sin/cos, and 2) the final summation over the frequencies. 1) can be reduced by using an FMA to reduce the argument to [-pi,pi] before feeding into sin/cos. This change has removed the N-dependency of the errors entirely.

After confirming the accuracy improvements, I applied another optimization that avoids the repeated costly sin/cos evaluations, by pulling out a constant factor added to the phase from one loop iteration to the next. This has to be done block-wise to not remove the accuracy improvements from the first set of changes again. There's now a tunable parameter B=32, the block size, that trades accuracy (B smaller) for speed (B larger).

I have attached a HTML write-up generated by my coding assistant that describes the changes and the rationale:
direct-1d-recurrence-optimization.html

This was tested on ARM64 (macOS) as well as in CI on x64 (Linux) architectures. If a platform does not support the single-rounding FMA semantics, the error bound could still be N-dependent again.

codspeed-hq · 2026-06-21T20:21:08Z

Merging this PR will improve performance by ×3.7

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 57 improved benchmarks
✅ 75 untouched benchmarks
🆕 24 new benchmarks

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[256/800]`	10.5 ms	1.3 ms	×8.2
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[512/1600]`	41.5 ms	5.1 ms	×8.2
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[64/200]`	660.5 µs	82 µs	×8.1
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[128/400]`	2,641.8 µs	330.4 µs	×8
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[512/1600]`	41.3 ms	5.4 ms	×7.6
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[256/800]`	10.4 ms	1.4 ms	×7.6
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d_omp[512/1600]`	2,999.7 µs	398.2 µs	×7.5
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[128/400]`	2,634.5 µs	352.1 µs	×7.5
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[64/200]`	652.2 µs	87.2 µs	×7.5
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d_omp[256/800]`	774.7 µs	110.1 µs	×7
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d[32/100]`	160.4 µs	24 µs	×6.7
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d[32/100]`	160.1 µs	25.3 µs	×6.3
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d_omp[128/400]`	205.1 µs	34.3 µs	×6
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d_omp[512/1600]`	3,306.2 µs	602.8 µs	×5.5
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_float/nfft_adjoint_direct_1d[512/1600]`	21.6 ms	4.4 ms	×4.9
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_adjoint_direct_1d_omp[256/800]`	855.9 µs	183.7 µs	×4.7
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_float/nfft_adjoint_direct_1d[256/800]`	5 ms	1.1 ms	×4.5
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_float/nfft_forward_direct_1d_omp[512/1600]`	1,556.9 µs	352.2 µs	×4.4
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_float/nfft_forward_direct_1d[512/1600]`	21.2 ms	4.8 ms	×4.4
⚡	WallTime	`codspeed-macro_gcc_kaiserbessel_double/nfft_forward_direct_1d_omp[64/200]`	57.5 µs	14 µs	×4.1
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing feature/nfft-direct-blocked-recurrence (b139c5e) with develop (56ea7dc)}

jenskeiner had a problem deploying to benchmarks June 21, 2026 19:09 — with GitHub Actions Error

jenskeiner force-pushed the feature/nfft-direct-blocked-recurrence branch from e1b769f to 261dfdb Compare June 21, 2026 20:05

jenskeiner temporarily deployed to benchmarks June 21, 2026 20:05 — with GitHub Actions Inactive

jenskeiner temporarily deployed to benchmarks June 21, 2026 20:42 — with GitHub Actions Inactive

jenskeiner temporarily deployed to benchmarks June 22, 2026 09:34 — with GitHub Actions Inactive

jenskeiner temporarily deployed to benchmarks June 22, 2026 10:16 — with GitHub Actions Inactive

Blocked recurrence for nfft direct forward/adjoint transform.

b139c5e

jenskeiner force-pushed the feature/nfft-direct-blocked-recurrence branch from dc341fa to b139c5e Compare June 22, 2026 11:06

jenskeiner deployed to benchmarks June 22, 2026 11:06 — with GitHub Actions Active

jenskeiner temporarily deployed to benchmarks June 22, 2026 11:06 — with GitHub Actions Inactive

jenskeiner requested review from DanielPotts, michaelquellmalz and skunis June 22, 2026 11:10

jenskeiner changed the title ~~Feature/nfft direct blocked recurrence~~ NDFT blocked recurrence Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NDFT blocked recurrence#222

NDFT blocked recurrence#222
jenskeiner wants to merge 1 commit into
developfrom
feature/nfft-direct-blocked-recurrence

jenskeiner commented Jun 21, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jenskeiner commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×3.7

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jenskeiner commented Jun 21, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 21, 2026 •

edited

Loading