Skip to content

Fix uninitialized / NULL pointer use in dwarf_parser#103

Open
xtrusia wants to merge 1 commit into
taodd:mainfrom
xtrusia:fix-dwarf-parser-uninit-deref
Open

Fix uninitialized / NULL pointer use in dwarf_parser#103
xtrusia wants to merge 1 commit into
taodd:mainfrom
xtrusia:fix-dwarf-parser-uninit-deref

Conversation

@xtrusia
Copy link
Copy Markdown
Contributor

@xtrusia xtrusia commented May 20, 2026

Reproduction and verification: dwarf_parser uninitialized / NULL deref fixes

Summary

  • What changesfind_param, translate_param_location, and
    translate_expr now return bool; failures on the newly checked
    NULL / lookup / frame-base / CFA paths propagate up.
    handle_function drops the whole function probe (erases from
    func2pc / func2vf) when a parameter is unresolved.
  • Why it's safe — the patch only adds handling on previously
    unguarded failure branches; the success-path control flow is
    unchanged. BPF consumers already treat a missing func2pc entry
    as "skip uprobe", so the erase composes with downstream code.
  • Verification scope — Issue 1 is reproduced deterministically
    and both baseline and patched builds were run on the natural happy
    path with byte-identical JSON output. Issue 2 and Issue 3 are
    confirmed by inspection only — the test environment (Ceph 19.2.3 +
    full debug info) does not exercise their failure branches.

Verification of the fix-dwarf-parser-uninit-deref branch against main
(baseline a8793a9). The patch addresses three latent failure-propagation
gaps that all sit on the same live-DWARF parsing path; this note records
how the patched path was exercised end-to-end against the baseline.

The patched call path

All three issues live on the same chain, which only runs when cephtrace
is invoked with -j <out.json> or when the embedded / JSON cache misses:

handle_function
  └─ translate_param_location
       ├─ find_param  (dwarf_getscopevar)
       └─ translate_expr
            ├─ DW_OP_fbreg     (recurses via dwarf_getlocation_addr)
            └─ DW_OP_call_frame_cfa  (recurses via dwarf_cfi_addrframe)

The patch is purely defensive: it makes find_param,
translate_param_location, and translate_expr all return a boolean,
propagates the newly checked NULL / lookup / frame-base / CFA
failures
through their call sites, and has handle_function drop
the whole function probe (erasing it from func2pc / func2vf) when
a parameter cannot be resolved. (The default and no-argument-atom
arms of translate_expr's switch still return true; the patch
only newly fails on the four conditions named above.) BPF consumers already treat a
missing func2pc entry as "skip uprobe" (see attach_uprobe /
fill_map_hprobes in src/osdtrace.cc and
src/radostrace.cc), so the erase composes safely
with downstream code.

Issue 1 — uninitialized expr in translate_param_location

src/dwarf_parser.cc#L220-L241

Dwarf_Op *expr;
size_t len;
int r = dwarf_getlocation_addr(&loc_attr, pc, &expr, &len, 1);
if (r != 1 || len <= 0) {
  cerr << "Get param location expr failed for symbol " << symbol << endl;
}
VarLocation varloc;
translate_expr(fb_attr, expr, pc, varloc);   // expr is uninitialized
return varloc;                                // when r != 1

find_param upstream has the same shape:

Dwarf_Die DwarfParser::find_param(Dwarf_Die *func, string symbol) {
  Dwarf_Die vardie;
  dwarf_getscopevar(func, 1, symbol.c_str(), 0, NULL, 0, 0, &vardie);
  return vardie;   // vardie is uninitialized when dwarf_getscopevar fails
}

So when the symbol is not found, the chain hands an uninitialized
vardie (and downstream an uninitialized loc_attr / expr) into
translate_expr and the rest of handle_function.

Issue 2 — NULL cfa_ops in translate_expr for DW_OP_call_frame_cfa

src/dwarf_parser.cc#L623-L645

If both cfi_debug and cfi_eh fail to produce a frame for pc,
cfa_ops stays NULL but the recursive call still dereferences it:

if (cfi_debug != NULL) { ... }
if (cfa_ops == NULL) {
  dwarf_cfi_addrframe(cfi_eh, pc, &frame);   // no NULL check on cfi_eh
  ...
}
translate_expr(fb_attr, cfa_ops, pc, varloc);   // cfa_ops may be NULL

Issue 3 — uninitialized fb_expr in translate_expr for DW_OP_fbreg

src/dwarf_parser.cc#L610-L621

Dwarf_Op *fb_expr;
size_t fb_exprlen;
int res = dwarf_getlocation_addr(fb_attr, pc, &fb_expr, &fb_exprlen, 1);
if (res != 1) {
  cerr << "translate_expr get fb_expr failed" << endl;
}
translate_expr(fb_attr, fb_expr, pc, varloc);   // fb_expr uninitialized

fb_attr itself can be NULL here (translate_fields passes NULL
through), which the baseline also does not check.

Reproduction environment

  • Host: juju model maas-default:embedded-dwarf-test, machine 0
    (node-12, 10.0.0.112, Noble 24.04)
  • Target: Ceph 19.2.3 from the Ubuntu archive, with the matching
    ceph-osd-dbgsym, librados2-dbgsym, librbd1-dbgsym
  • Build deps installed: g++, clang, libelf-dev, libc6-dev,
    libc6-dev-i386, libdw-dev, libssl-dev, make
  • cephtrace baseline checked out at a8793a9 (main HEAD); the patched
    variant is fetched as origin/fix-dwarf-parser-uninit-deref

Step 1 — Natural reproduction (happy path, baseline vs patched)

Both baseline and patched builds were run with the real probe
definitions against Ceph 19.2.3 + matching dbgsym:

# baseline
$ sudo ./osdtrace.baseline        -x -j /tmp/base_osd.json    --skip-version-check
$ sudo ./radostrace.baseline         -j /tmp/base_rados.json  --skip-version-check

# patched (osdtrace)
$ sudo ./osdtrace.patched_natural -x -j /tmp/patched_nat.json --skip-version-check
baseline (osdtrace) patched (osdtrace)
exit 0 0
stderr — bug-branch messages none none
stderr — Skip probe messages none none
process functions take 948.631 ms 879.419 ms
output JSON size 30 253 bytes 30 253 bytes

Neither build emits the bug-branch cerr messages
(Get param location expr failed, translate_expr get fb_expr failed,
dwarf_frame_cfa add ... failed); the failure-only branches the patch
newly guards are never entered. The two JSONs are equal in size,
confirming the patch does not regress the happy path. radostrace.baseline
also exits 0 cleanly; the patched radostrace was not separately
rebuilt because it shares the same dwarf_parser code path and the
osdtrace comparison already covers it.

Step 2 — Deterministic reproduction (probe mismatch)

The realistic trigger for Issue 1 is "cephtrace's probe spec does not
match the target binary" — exactly what happens when cephtrace is
pointed at an unsupported Ceph version (renamed parameter, changed
signature, etc.). This is modeled by swapping one probe variable name
in osd_probes:

sed -i 's/{"op", "px", "request", "header", "type"}/{"ZZNONEXISTENT", "px", "request", "header", "type"}/' \
    src/osdtrace.cc

The rest of src/osdtrace.cc is untouched. The string ZZNONEXISTENT
is not a real parameter of OSD::enqueue_op / OSD::dequeue_op, so
dwarf_getscopevar will return a "not found" sentinel for those two
probes. Both variants are then rebuilt and run with identical CLI
invocation:

sudo ./osdtrace.repro_<variant> -x -j /tmp/repro.json --skip-version-check

Baseline result — infinite loop

PID    ETIME    %CPU  CMD
3359   30:59    67.3  osdtrace.repro_baseline -x -j /tmp/repro_base.json ...
3663   15:42    36.5  osdtrace.repro_baseline -x -j /tmp/repro_base.json ...
3796    9:52    29.0  osdtrace.repro_baseline -x -j /tmp/repro_base.json ...
3899    4:41    24.7  osdtrace.repro_baseline -x -j /tmp/repro_base.json ...

-rw-rw-r-- 1 ubuntu ubuntu 28350637170 /tmp/repro_base.err
                       ^^^^^^^^^^^^^^ 28 GB

$ tail /tmp/repro_base.err
unexpected type px
unexpected type px
unexpected type px
unexpected type px
unexpected type px
...

Four invocations were live, each spinning between 24 % and 67 % CPU and
having never produced an output JSON. stderr was 28 GB of the single
line unexpected type px repeated forever. The processes had to be
killed with pkill -9 to recover the disk.

The exact code path:

  1. find_param calls dwarf_getscopevar("ZZNONEXISTENT", ...) which
    returns the not-found sentinel and leaves vardie uninitialized.

  2. Baseline find_param ignores the return value and hands the
    uninitialized vardie back to translate_param_location.

  3. dwarf_attr_integrate(&vardie, DW_AT_location, &loc_attr) reads
    garbage from vardie but happens to survive — the stack slot held
    a value that does not segfault on this run.

  4. dwarf_getlocation_addr returns r != 1; baseline prints the
    warning and falls through to translate_expr with an
    uninitialized expr pointer. Again the run survives by luck.

  5. Back in handle_function, dwarf_die_type(&vardie, &typedie)
    walks the still-garbage vardie and populates typedie with more
    garbage.

  6. translate_fields is called with that garbage typedie.
    dwarf_tag(typedie) returns a value that matches none of the
    switch cases, so control reaches the default: arm in
    translate_fields:

    default:
      clog << "unexpected type " << fields[i] << endl;
      break;

    Neither i nor typedie is updated, so the outer
    while (i < fields.size()) never makes progress — fields[1] == "px"
    is reprinted forever.

So Issue 1's missing failure propagation handed garbage data to
translate_fields, which then tripped a separate, latent
infinite-loop hazard in translate_fields (the missing return in
that switch's non-matching arms). The patch did not touch that
switch, but it prevents the upstream garbage from ever reaching it.
Fixing the translate_fields switch itself (so it cannot loop on
an unknown dwarf_tag from any source) is a separate follow-up.

Patched result — graceful skip

Same probe tampering, same command, only dwarf_parser.h / .cc
swapped for the patched version:

$ git show origin/fix-dwarf-parser-uninit-deref:src/dwarf_parser.h > src/dwarf_parser.h
$ git show origin/fix-dwarf-parser-uninit-deref:src/dwarf_parser.cc > src/dwarf_parser.cc
$ make osdtrace
$ sudo ./osdtrace.repro_patched -x -j /tmp/repro_patched.json --skip-version-check
EXIT=0

# stderr (10 lines total):
Start to parse ceph dwarf info
Found executable ceph-osd at: /usr/bin/ceph-osd
Start to parse dwarf info
Couldn't find parameter ZZNONEXISTENT
Skip probe on OSD::dequeue_op: parameter ZZNONEXISTENT unresolved
Couldn't find parameter ZZNONEXISTENT
Skip probe on OSD::enqueue_op: parameter ZZNONEXISTENT unresolved
process functions take 937.834
Detected package version: 19.2.3-0ubuntu0.24.04.3
Dwarf parsing data exported to /tmp/repro_patched.json

# output JSON written normally:
-rw-r--r-- 1 root root 22710 /tmp/repro_patched.json

find_param now reports failure; translate_param_location returns
false; handle_function erases the unresolved function from both
func2pc and func2vf and returns 0 for that function. The other
probes finish normally and the JSON is written.

The reported time is from process functions take 937.834 in
dwarf_parser.cc#L763-L765, which
computes (clock() diff) / CLOCKS_PER_SEC * 1000, i.e. milliseconds.
So the parsing phase finished in roughly 0.94 s on this run.

Side-by-side

baseline (a8793a9) patched (fix-dwarf-parser-uninit-deref)
exit hangs (had to pkill -9) 0, clean exit
process-functions phase unbounded (>30 min observed) ~0.94 s (process functions take 937.834 ms)
stderr size 28 GB <5 KB
stderr last line repeated unexpected type px (no looping output)
output JSON not written written (22 710 bytes)
other probes processed none (stuck before that point) all processed normally
host impact 4 stuck processes, 28 GB disk none

Conclusion

  • The patch correctly contains Issue 1 by propagating failure from
    find_param and translate_param_location through
    handle_function, which erases the unresolved function probe.
  • Because that erase prevents the garbage vardie / typedie chain
    from ever reaching translate_fields, the patch also incidentally
    neutralizes the previously unpatched translate_fields
    infinite-loop hazard for this failure mode.
  • Step 1 confirms the happy path is unchanged: baseline and patched
    produce equal-size JSONs (30 253 bytes) and neither hits the
    failure-only branches.
  • Issue 2 (DW_OP_call_frame_cfa with no usable CFI) and Issue 3
    (DW_OP_fbreg with a failing dwarf_getlocation_addr) were not
    observed in this environment, because Ceph 19.2.3's debug info
    provides usable .eh_frame and frame_base location lists. The
    patch's NULL / uninitialized guards on those code paths are
    reviewed by inspection only; no regression observed.
  • The translate_fields infinite-loop hazard observed during
    baseline repro is not fixed by this patch; this patch only
    prevents the upstream garbage from ever reaching that switch. A
    dedicated fix to the translate_fields switch is left for a
    separate follow-up.

translate_param_location and translate_expr could dereference
uninitialized or null pointers when elfutils location lookups fail.
Make find_param, translate_param_location and translate_expr report
failure, and drop the function probe when a parameter is unresolved.

Signed-off-by: Seyeong Kim <seyeong.kim@canonical.com>
@xtrusia xtrusia changed the title Fix uninitialized pointer use in dwarf_parser Fix uninitialized / NULL pointer use in dwarf_parser May 20, 2026
@taodd taodd requested a review from Copilot May 20, 2026 04:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the DWARF parsing pipeline by adding explicit failure propagation for previously unchecked lookup / location-expression / frame-base / CFA paths, preventing uninitialized or NULL-pointer usage during probe generation. It also changes probe handling so that unresolved parameters cause the affected function probe to be dropped from the exported maps.

Changes:

  • Change find_param, translate_param_location, and translate_expr to return bool and propagate failures up the call chain.
  • Add defensive checks for missing DW_AT_location, empty/invalid location expressions, missing frame base for DW_OP_fbreg, and unresolved CFA for DW_OP_call_frame_cfa.
  • In handle_function, skip the entire function probe (erase from func2pc/func2vf) when a parameter cannot be resolved.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/dwarf_parser.h Updates method signatures to return bool and add out-params for safe failure propagation.
src/dwarf_parser.cc Implements the new bool-returning APIs, adds NULL/uninitialized guards, and erases probes when parameter resolution fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dwarf_parser.cc
Comment on lines 418 to +422
VarLocation varloc;
translate_expr(NULL, expr, pc, varloc);
if (!translate_expr(NULL, expr, pc, varloc)) {
clog << "failed to translate location of " << fields[i] << endl;
return;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants