NDFT blocked recurrence#222
Draft
jenskeiner wants to merge 1 commit into
Draft
Conversation
e1b769f to
261dfdb
Compare
dc341fa to
b139c5e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR seeks to improve the efficiency as well as the accuracy of the 1D direct nfft forward/adjoint transforms for serial and threaded cases.
I was originally just trying to improve the performance but then noticed that the forward transform's error grows like O(sqrt(N)) and that of the adjoint transform like O(N) while the error bound we use in our tests is just O(eps), independent of N.
After some rounding error analysis, the two sources of errors were 1) the calculation of the argument to sin/cos, and 2) the final summation over the frequencies. 1) can be reduced by using an FMA to reduce the argument to [-pi,pi] before feeding into sin/cos. This change has removed the N-dependency of the errors entirely.
After confirming the accuracy improvements, I applied another optimization that avoids the repeated costly sin/cos evaluations, by pulling out a constant factor added to the phase from one loop iteration to the next. This has to be done block-wise to not remove the accuracy improvements from the first set of changes again. There's now a tunable parameter B=32, the block size, that trades accuracy (B smaller) for speed (B larger).
I have attached a HTML write-up generated by my coding assistant that describes the changes and the rationale:
direct-1d-recurrence-optimization.html
This was tested on ARM64 (macOS) as well as in CI on x64 (Linux) architectures. If a platform does not support the single-rounding FMA semantics, the error bound could still be N-dependent again.