Optimize Transform evaluation with re-entrant TLS scratch buffers#345
Conversation
Use a re-entrancy safe thread local storage for intermediate results.
|
@ikrommyd FYI |
CVMFS benchmarksTop 25 slowest-loading corrections, sorted by mean time:
|
Otherwise, when the vector grows beyond capacity, all existing references become invalid
|
So I'm seeing this reduction in the number of allocations of my analysis with this PR 400093149 -> 369875546. After: |
|
I did not notice any observable performance changes. The analysis takes the same time to run within statistical errors. Now I don't know if you benchmarked the transform node in particular and found it faster, but I assume that the actual analysis' runtime is heavily dominated by other factors so the reduction in the number of allocations is not a noticeable performance improvement in runtime. |
|
Ok, thanks for testing. I'm surprised in only dropped a little; maybe I'm mistaken in understanding where the allocations are happening. Will actually measure rather than shoot from the hip here... |
|
Yeah so the above stats are by wrapping the processor call memray can trace native call stacks and attach them to python call stacks (has a harder time on macos than linux) but when you do Perhaps there's a way to say "I want the flamegraph only for this module" or something to memray but I wasn't able to find it in their API. Another solution would be to instrument correctionlib highlevel.py with a memray tracker so that only memory allocations in this module are tracked and you'd probably be able to see the native stacks easier in the flamegraph. Edit: I will try with |
Developed with LLM help, if it is not already obvious from the below 😄
Summary
Eliminate hot-path allocations in evaluate loops
This branch removes several std::vector heap allocations that occurred on every call to the core evaluate paths:
A regression test for nested Transform correctness is included.
Details for TransformScratch
TransformScratch(anonymous namespace insrc/correction.cc) with:std::vector<std::vector<Variable::Type>>)Transform::evaluateto:valuesinto the slotThis keeps nested
Transformevaluation on the same thread safe while reusing vector capacity across calls.Design alternatives considered (pros/cons)
Correction::evaluateAPITransformcopies ifTransformmutates and restores in place.node_evaluateTransform.Correction::evaluate, pass mutable state downwardTransformnodes.TransformTransformrecursion on the same thread.Transformnode.Transform-heavy trees.Notes