Fix reply memory leak in retryable command retry loop#13
Open
groeneai wants to merge 1 commit into
Open
Conversation
This was referenced Jun 14, 2026
_mongoc_retryable_cmd_run reuses *reply across retry attempts. After an attempt fails with a retryable error, the loop calls select_retry_server() with reply while it still owns that attempt's allocation. Of the three select_retry_server callbacks, only the cursor one (_retryable_cursor_commmand_select_retry_server) forwards reply to mongoc_cluster_stream_for_reads as a real out-param. On its server-selection-failure path, _mongoc_cluster_stream_for_optype re-initializes reply (bson_init / bson_copy_to) without freeing the prior contents. The loop's own bson_destroy(reply) sits after the !server_description break, so it is not reached when selection fails, and the attempt's reply is leaked. The read and write callbacks (_retryable_read_select_retry_server, _retryable_write_select_retry_server) ignore reply and pass NULL downstream, so they never overwrite it and never leak. Fix it where the bug is: reset reply (destroy + init) inside the cursor callback, just before it hands reply to stream_for_reads. The retry loop is left exactly as upstream, so the read and write paths keep returning the failed first attempt's reply (with its errorLabels and server error fields) to the caller, which they relied on before. Observed in ClickHouse CI as a LeakSanitizer 128-byte direct leak: bson_malloc <- bson_copy_to <- _mongoc_cluster_run_opmsg_recv <- ... <- _mongoc_retryable_cmd_run, reached through a find cursor on a retryable read. The trigger (retryable read error followed by a retry server-selection failure) is a rare failover race. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5aad6ab to
6b6b676
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
_mongoc_retryable_cmd_run(src/libmongoc/src/mongoc/mongoc-retryable-cmd.c) reuses*replyacross retry attempts. Per attempt:cmd->execute(...)populatesreply(for a cursor read this ends inbson_copy_to(&body, reply)in_mongoc_cluster_run_opmsg_recv, allocatingreply->own_buffer).replyfor retryability/error labels.cmd->select_retry_server(..., reply, ...)whilereplystill owns the previous attempt's allocation.select_retry_server(cursor path:_retryable_cursor_commmand_select_retry_server->mongoc_cluster_stream_for_reads->_mongoc_cluster_stream_for_optype) treatsreplyas an out-param and, on its server-selection-failure path, overwrites it (bson_copy_to(&first_reply, reply)/bson_init(reply)) without freeing the prior contents. The loop's ownbson_destroy(reply)sits after theif (!server_description) break;, so when selection fails it is never reached and the attempt'sreplyallocation is leaked.Trigger: a retryable read error followed by a retry server-selection failure (a rare failover/topology race).
Fix
Destroy and re-init
replyimmediately beforeselect_retry_serveroverwrites it, and drop the now-redundant trailingbson_destroy. This matches the existingbson_destroy; bson_initreset idiom inmongoc-cluster.c(OIDC reauth path). The reply-inspection reads all run before the destroy, so retry decisions are unchanged.Verification
Observed in downstream (ClickHouse) CI as a LeakSanitizer 128-byte direct leak:
Reproduced deterministically with a standalone AddressSanitizer/LeakSanitizer harness that drives the real
_mongoc_retryable_cmd_runwith mockexecute/select_retry_servercallbacks matching the ownership pattern above: the leak is reported without this change and is gone with it.