Avoid per-call malloc/free in predict()#95
Open
JackDanger wants to merge 1 commit intocjlin1:masterfrom
Open
Conversation
predict() is invoked once per row in do_predict, cross_validation, and embedders. nr_class is almost always small, so the unconditional Malloc+free of dec_values is pure overhead in the hot loop. Use a 64-double stack buffer for the common case and only fall back to the heap for larger multiclass. Suppress -fstack-protector on the function: macOS clang and gcc -fstack-protector-strong otherwise insert a canary load+compare on every call, which on multiclass models can dominate the dot product itself. The buffer is internal, fixed-size, and bounded by the nr_class check, so the canary protects nothing here. Companion: hoist dec_values out of do_predict's per-row loop, mirroring the existing prob_estimates pattern, and call predict_values directly. No API or ABI change; output is bit-identical (md5 match on heart_scale). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
56e1d7a to
d162b2a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi there! First time contributing!
I noticed that
predict()does amallocandfreeofdec_valueson every call. For services that score one row at a time (which is most of how folks use the library, I imagine), that runs in a very tight loop, and on my setup it ends up costing more than the dot product itself for small models.Quick local numbers (Apple M-series, clang -O3, median of 3 runs,
predict()throughput):The biggest win is on small/medium multiclass. Binary on Apple Silicon barely moves because the system allocator's small-size fast path is already excellent — I'd expect glibc to look more like the multiclass row but I haven't measured it.
Happy to iterate if you want it different.
Disclosure: I used Claude Opus 4.7 to find and fix this performance problem.