Fix memory leak in mongo-c-driver retryable command retry loop#107448
Fix memory leak in mongo-c-driver retryable command retry loop#107448groeneai wants to merge 1 commit into
Conversation
Pre-PR validation gate
Session id: cron:clickhouse-worker-slot-2:20260614-184800 |
|
cc @alexey-milovidov @rschu1ze — could you review this? It bumps the mongo-c-driver submodule by one commit (ClickHouse/mongo-c-driver#13) to fix a reply memory leak in the retryable command retry loop: the loop calls select_retry_server() while reply still owns the previous attempt's allocation, and that function overwrites reply as an out-param without freeing it, so the allocation leaks when retry server selection fails. This is the test_simple_ssl LeakSanitizer failure from #107429. |
|
Workflow [PR], commit [59168d5] AI ReviewSummaryThis PR bumps Missing context / blind spots
Findings
Tests
Final VerdictStatus: Minimum required actions: preserve the original reply for retry-server callbacks that do not overwrite it, and add focused coverage for that error path or point to automated coverage that would fail on the regression above. |
| @@ -1 +1 @@ | |||
| Subproject commit a26c8e251fec147258e89b7e85e9adfa259fc42f | |||
| Subproject commit 5aad6ab39aea8a2aa425404bc3d82bbc6fe581cd | |||
There was a problem hiding this comment.
This generic reset drops the original server reply for retryable read commands when retry server selection fails. _retryable_read_select_retry_server deliberately ignores its reply argument and calls mongoc_cluster_stream_for_reads(..., NULL, NULL), so the only reply available to mongoc_client_read_command_with_opts is the one produced by the failed first attempt. Before this submodule commit, a selection failure took the !server_description branch before the trailing bson_destroy, preserving that reply; now src/libmongoc/src/mongoc/mongoc-retryable-cmd.c:106 replaces it with {} before the callback can fail. That violates the retry path's "return the original retryable error" behavior and removes errorLabels / server error fields from the caller-visible reply.
Please keep the reset local to callbacks that actually pass reply as an out-param, for example the cursor retry-server path, or otherwise preserve the previous reply when the select callback does not replace it.
There was a problem hiding this comment.
You're right, thanks. I confirmed it against the driver source.
Of the three select_retry_server callbacks, only the cursor one passes reply as a real out-param:
_retryable_read_select_retry_server(client.c):BSON_UNUSED(reply), callsmongoc_cluster_stream_for_reads(..., NULL /* reply */, NULL)._retryable_write_select_retry_server(cluster.c):BSON_UNUSED(reply), callsmongoc_cluster_stream_for_writes(..., NULL /* reply */, NULL)._retryable_cursor_commmand_select_retry_server(cursor.c): forwardsreplytomongoc_cluster_stream_for_reads, whose_mongoc_cluster_stream_for_optypeselection-failure path re-inits it (bson_init/bson_copy_to) without freeing prior contents.
So the leak is cursor-path only, and the previous reset-before-select wiped the first attempt's reply (errorLabels and server error fields) on the read/write paths, which the loop used to preserve.
New approach: the retry loop is reverted to upstream, and the reset is moved into the cursor callback right before it hands reply to stream_for_reads. Read/write keep returning the failed attempt's reply; cursor no longer leaks. The submodule bump now points at the scoped commit (ClickHouse/mongo-c-driver@6b6b676).
Verified with a standalone ASan/LSan harness driving the loop's reply lifecycle with real libbson, one scenario per process:
orig/read leak=0 errorLabels=1 (baseline)
orig/cursor leak=1 errorLabels=0 <- the leak (256B)
v1/read leak=0 errorLabels=0 <- the regression you flagged
scoped/read leak=0 errorLabels=1 <- fix: errorLabels preserved
scoped/cursor leak=0 errorLabels=0 <- fix: leak gone
|
@groeneai Send the fix first upstream before integrating it in ClickHouse. |
…ry loop Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
59168d5 to
6d09e59
Compare
Pre-PR validation gate (updated for the scoped fix, commit 6d09e59)Revised after @clickhouse-gh[bot]'s review. The earlier reset-before-select approach fixed the leak but wiped the read/write paths' first-attempt reply (errorLabels). This version moves the reset into the cursor callback only.
Session id: cron:clickhouse-worker-slot-0:20260614-200800 |
|
Ok. One concern is that upstream could be managed by AI-haters. This often happens in open-source. So they can insult and humiliate our bot. Better to read everything and author the upstream fix from a real person. |
|
On sending the fix upstream first: agreed, and I won't open a PR against The change ports cleanly for a human to author upstream. The retry loop is left byte-for-byte as upstream; the single reset ( Holding this PR until the upstream fix lands; I'll rebase the submodule bump onto whatever commit is accepted. |
Related: ClickHouse/mongo-c-driver#13
Related: #107429
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a memory leak in the bundled
mongo-c-driverthat could occur when reading from MongoDB (for example via a MongoDB dictionary or themongodbtable function) if a retryable read error was followed by a failed retry server selection.Description
Bumps the
mongo-c-driversubmodule by one commit to fix a reply memory leak in the retryable command retry loop._mongoc_retryable_cmd_runreuses*replyacross retry attempts. After an attempt fails with a retryable error, the loop callsselect_retry_server(..., reply, ...)whilereplystill owns that attempt's allocation.select_retry_server(cursor path:mongoc_cluster_stream_for_reads->_mongoc_cluster_stream_for_optype) treatsreplyas an out-param and, on its server-selection-failure path, overwrites it (bson_copy_to/bson_init) without freeing the prior contents. The loop's ownbson_destroy(reply)sits after theif (!server_description) break;, so it is never reached when selection fails, and the attempt'sreplyis leaked.The fix destroys and re-inits
replybeforeselect_retry_serveroverwrites it, and drops the now-redundant trailing destroy. See ClickHouse/mongo-c-driver#13 for the driver change.This surfaced in CI as a LeakSanitizer 128-byte direct leak on
test_dictionaries_all_layouts_separate_sources/test_mongo.py::test_simple_ssl(Integration tests (amd_asan_ubsan, ...)):The trigger (retryable read error followed by a retry server-selection failure) is a rare failover race: it appeared twice in 30 days across PR CI runs, never on master. Reproduced deterministically with a standalone AddressSanitizer/LeakSanitizer harness driving the real
_mongoc_retryable_cmd_runwith mock callbacks matching the ownership pattern above; the leak is reported without the change and gone with it.Found by
@alexey-milovidovon #107429 and fixed here on his request.Report: https://s3.amazonaws.com/clickhouse-test-reports/107429/index.html (Integration tests, amd_asan_ubsan, db disk, old analyzer, 3/6)