Skip to content

Avoid per-call malloc/free in predict()#95

Open
JackDanger wants to merge 1 commit intocjlin1:masterfrom
JackDanger:perf/predict-no-malloc
Open

Avoid per-call malloc/free in predict()#95
JackDanger wants to merge 1 commit intocjlin1:masterfrom
JackDanger:perf/predict-no-malloc

Conversation

@JackDanger
Copy link
Copy Markdown

@JackDanger JackDanger commented May 9, 2026

Hi there! First time contributing!

I noticed that predict() does a malloc and free of dec_values on every call. For services that score one row at a time (which is most of how folks use the library, I imagine), that runs in a very tight loop, and on my setup it ends up costing more than the dot product itself for small models.

Quick local numbers (Apple M-series, clang -O3, median of 3 runs, predict() throughput):

                                                  before          after
heart_scale, binary, nr_class=2                  31.8  Mrows/s    32.1  Mrows/s    (≈ flat — system allocator is fast for 16B)
synthetic,   multiclass, nr_class=32              1.6  Mrows/s     4.3  Mrows/s    (≈2.7× — main win)
synthetic,   multiclass, nr_class=128 (heap)      1.2  Mrows/s     1.2  Mrows/s    (unchanged, as expected on the fallback)
synthetic,   binary, n=1000 features              0.30 Mrows/s     0.29 Mrows/s    (dot product dominates)

The biggest win is on small/medium multiclass. Binary on Apple Silicon barely moves because the system allocator's small-size fast path is already excellent — I'd expect glibc to look more like the multiclass row but I haven't measured it.

Happy to iterate if you want it different.

Disclosure: I used Claude Opus 4.7 to find and fix this performance problem.

predict() is invoked once per row in do_predict, cross_validation, and
embedders. nr_class is almost always small, so the unconditional
Malloc+free of dec_values is pure overhead in the hot loop.

Use a 64-double stack buffer for the common case and only fall back to
the heap for larger multiclass. Suppress -fstack-protector on the
function: macOS clang and gcc -fstack-protector-strong otherwise insert
a canary load+compare on every call, which on multiclass models can
dominate the dot product itself. The buffer is internal, fixed-size,
and bounded by the nr_class check, so the canary protects nothing here.

Companion: hoist dec_values out of do_predict's per-row loop, mirroring
the existing prob_estimates pattern, and call predict_values directly.

No API or ABI change; output is bit-identical (md5 match on heart_scale).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@JackDanger JackDanger force-pushed the perf/predict-no-malloc branch from 56e1d7a to d162b2a Compare May 9, 2026 18:54
@JackDanger JackDanger marked this pull request as ready for review May 9, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant