Fix reply memory leak in retryable command retry loop by groeneai · Pull Request #13 · ClickHouse/mongo-c-driver

groeneai · 2026-06-14T19:29:32Z

Problem

_mongoc_retryable_cmd_run (src/libmongoc/src/mongoc/mongoc-retryable-cmd.c) reuses *reply across retry attempts. Per attempt:

cmd->execute(...) populates reply (for a cursor read this ends in bson_copy_to(&body, reply) in _mongoc_cluster_run_opmsg_recv, allocating reply->own_buffer).
The loop inspects reply for retryability/error labels.
On a retry it calls cmd->select_retry_server(..., reply, ...) while reply still owns the previous attempt's allocation.

select_retry_server (cursor path: _retryable_cursor_commmand_select_retry_server -> mongoc_cluster_stream_for_reads -> _mongoc_cluster_stream_for_optype) treats reply as an out-param and, on its server-selection-failure path, overwrites it (bson_copy_to(&first_reply, reply) / bson_init(reply)) without freeing the prior contents. The loop's own bson_destroy(reply) sits after the if (!server_description) break;, so when selection fails it is never reached and the attempt's reply allocation is leaked.

Trigger: a retryable read error followed by a retry server-selection failure (a rare failover/topology race).

Fix

Destroy and re-init reply immediately before select_retry_server overwrites it, and drop the now-redundant trailing bson_destroy. This matches the existing bson_destroy; bson_init reset idiom in mongoc-cluster.c (OIDC reauth path). The reply-inspection reads all run before the destroy, so retry decisions are unchanged.

Verification

Observed in downstream (ClickHouse) CI as a LeakSanitizer 128-byte direct leak:

Direct leak of 128 byte(s) in 1 object(s) allocated from:
    bson_malloc           src/libbson/src/bson/memory.c:106
    bson_copy_to          src/libbson/src/bson/bson.c:2106
    _mongoc_cluster_run_opmsg_recv  src/libmongoc/src/mongoc/mongoc-cluster.c:3313
    ...
    _mongoc_retryable_cmd_run       src/libmongoc/src/mongoc/mongoc-retryable-cmd.c
    _mongoc_cursor_run_command / _prime / mongoc_cursor_next

Reproduced deterministically with a standalone AddressSanitizer/LeakSanitizer harness that drives the real _mongoc_retryable_cmd_run with mock execute/select_retry_server callbacks matching the ownership pattern above: the leak is reported without this change and is gone with it.

_mongoc_retryable_cmd_run reuses *reply across retry attempts. After an attempt fails with a retryable error, the loop calls select_retry_server() with reply while it still owns that attempt's allocation. Of the three select_retry_server callbacks, only the cursor one (_retryable_cursor_commmand_select_retry_server) forwards reply to mongoc_cluster_stream_for_reads as a real out-param. On its server-selection-failure path, _mongoc_cluster_stream_for_optype re-initializes reply (bson_init / bson_copy_to) without freeing the prior contents. The loop's own bson_destroy(reply) sits after the !server_description break, so it is not reached when selection fails, and the attempt's reply is leaked. The read and write callbacks (_retryable_read_select_retry_server, _retryable_write_select_retry_server) ignore reply and pass NULL downstream, so they never overwrite it and never leak. Fix it where the bug is: reset reply (destroy + init) inside the cursor callback, just before it hands reply to stream_for_reads. The retry loop is left exactly as upstream, so the read and write paths keep returning the failed first attempt's reply (with its errorLabels and server error fields) to the caller, which they relied on before. Observed in ClickHouse CI as a LeakSanitizer 128-byte direct leak: bson_malloc <- bson_copy_to <- _mongoc_cluster_run_opmsg_recv <- ... <- _mongoc_retryable_cmd_run, reached through a find cursor on a retryable read. The trigger (retryable read error followed by a retry server-selection failure) is a rare failover race. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

This was referenced Jun 14, 2026

Fix memory leak in mongo-c-driver retryable command retry loop ClickHouse/ClickHouse#107448

Open

Fix TSan data race on NuRaft peer config_ ClickHouse/ClickHouse#107429

Closed

groeneai force-pushed the fix-retryable-cmd-reply-leak branch from 5aad6ab to 6b6b676 Compare June 14, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reply memory leak in retryable command retry loop#13

Fix reply memory leak in retryable command retry loop#13
groeneai wants to merge 1 commit into
ClickHouse:ClickHouse/2.3.0from
groeneai:fix-retryable-cmd-reply-leak

groeneai commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

groeneai commented Jun 14, 2026

Problem

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant