Avoid per-call malloc/free in predict() by JackDanger · Pull Request #95 · cjlin1/liblinear

JackDanger · 2026-05-09T18:36:38Z

Hi there! First time contributing!

I noticed that predict() does a malloc and free of dec_values on every call. For services that score one row at a time (which is most of how folks use the library, I imagine), that runs in a very tight loop, and on my setup it ends up costing more than the dot product itself for small models.

Quick local numbers (Apple M-series, clang -O3, median of 3 runs, predict() throughput):

                                                  before          after
heart_scale, binary, nr_class=2                  31.8  Mrows/s    32.1  Mrows/s    (≈ flat — system allocator is fast for 16B)
synthetic,   multiclass, nr_class=32              1.6  Mrows/s     4.3  Mrows/s    (≈2.7× — main win)
synthetic,   multiclass, nr_class=128 (heap)      1.2  Mrows/s     1.2  Mrows/s    (unchanged, as expected on the fallback)
synthetic,   binary, n=1000 features              0.30 Mrows/s     0.29 Mrows/s    (dot product dominates)

The biggest win is on small/medium multiclass. Binary on Apple Silicon barely moves because the system allocator's small-size fast path is already excellent — I'd expect glibc to look more like the multiclass row but I haven't measured it.

Happy to iterate if you want it different.

Disclosure: I used Claude Opus 4.7 to find and fix this performance problem.

predict() is invoked once per row in do_predict, cross_validation, and embedders. nr_class is almost always small, so the unconditional Malloc+free of dec_values is pure overhead in the hot loop. Use a 64-double stack buffer for the common case and only fall back to the heap for larger multiclass. Suppress -fstack-protector on the function: macOS clang and gcc -fstack-protector-strong otherwise insert a canary load+compare on every call, which on multiclass models can dominate the dot product itself. The buffer is internal, fixed-size, and bounded by the nr_class check, so the canary protects nothing here. Companion: hoist dec_values out of do_predict's per-row loop, mirroring the existing prob_estimates pattern, and call predict_values directly. No API or ABI change; output is bit-identical (md5 match on heart_scale). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

JackDanger force-pushed the perf/predict-no-malloc branch from 56e1d7a to d162b2a Compare May 9, 2026 18:54

JackDanger marked this pull request as ready for review May 9, 2026 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid per-call malloc/free in predict()#95

Avoid per-call malloc/free in predict()#95
JackDanger wants to merge 1 commit intocjlin1:masterfrom
JackDanger:perf/predict-no-malloc

JackDanger commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JackDanger commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JackDanger commented May 9, 2026 •

edited

Loading