Prerequisites
Proposal
Hey guys
Very cool blog post!
I think it's cool to see how much thought and effort is put in making great tools, especially on the linux (/ huge diffs) end
Personally, I experienced some crashes (OOMs) when trying on ginormous diffs
This got me wondering if there was any easy win on the matter, and I tried a few things that I'd like to share and discuss
1. Detaching parsed strings
Detaching makes sense on its own, but I don't understand why it is currently done line by line?
As I understand it, detaching only needs to break the reference from a parsed line back to the giant patch buffer so the GC can reclaim it. That reference can be broken at any granularity, so copying each file's text once is enough, and every line then slices into that small per-file copy instead of the whole patch
I tried exactly that on this branch: it produces a byte-identical parsed model (same file/line/char counts + content hash), but replaces ~22.8M detach calls with ~77k, about 300× fewer (on linux v6...v7: 76,872 files, 22.8M lines). Fewer allocations and GC pauses, so the parse step gets noticeably faster (how much depends on the CPU, since detachString does an encode/decode round-trip whose per-call cost varies a lot between machines)
2. Storing line content as a byte arena
Detaching frees the source patch, but the parsed line copies are still millions of individual js strings, and that's where a big part of the remaining memory goes:
- per-object overhead: even after detaching, every line is still its own v8 string object (tens of millions of them), each carrying V8's fixed per-object header (~16+ bytes: map pointer, hash, length) plus pointer/offset fields, on top of its actual bytes
- UTF-16 inflation: a js string flips to 2 bytes/char the moment it holds one character above U+00FF (e.g. an emoji or CJK glyph). detaching per file makes a whole file one string, so one such char taints the entire file. UTF-8 keeps ASCII at 1 byte and only spends extra on those chars
- heap fragmentation: churning tens of millions of small, varied-size allocations leaves the GC heap full of little holes between still-live objects. V8 can't hand a memory page back to the OS unless the entire page is free, so the process keeps far more RSS resident than the live data actually needs, and a GC doesn't fix it (it's also why this shows up in real browser RSS but barely in Node's (or Bun in my case) cleaner allocator)
So I tried stroring each file's lines as a single UTF-8 byte arena (Uint8Array), plus an Int32Array of per-line byte offsets so a line is just decode(bytes[offset[i]..offset[i+1]]), decoded on-demand, instead of a string[] (branch). Interestingly this makes fix #1 above redundant -> the arena copies each line into its own buffer, which already detaches it from the patch, so this branch doesn't bother detaching at all. (The deployed demo is just this + a dockerfile, ignore the docker stuff, its just clanker made for me to deploy)
It's important to use bytes and not chars, because ASCII stays 1 byte/char, it never taints a whole file to UTF-16, and it lives off the V8 heap.
Two small bits about this: the decoder uses ignoreBOM so a line that genuinely starts with a U+FEFF byte-order mark isn't silently stripped, and it keeps a plain string[] fallback for the rare files with a lone surrogate that wouldn't survive a UTF-8 round-trip
I deployed my forked main (for baseline comparison) and a byte-arena branch on the same server, to compare on identical hardware:
Results
Measuring baseline vs perf on the linux v6...v7 compare (isolated Chrome, reading the renderer's OS-level RSS; M2 MacBook Air 16gb; averaged over 3 runs):
| metric |
baseline |
byte arena |
Δ |
| peak renderer RSS |
2752 MB |
1911 MB |
−31% |
| retained RSS (after forced GC)* |
1688 MB |
849 MB |
−50% |
* we can force GC in devtools > memory > 🗑️
The ranges across runs never overlapped: the worst byte-arena run beats the best baseline run. Parsed output is byte-identical
You can see the same thing live in Chrome's Task Manager (window -> task manager -> memory footprint; force a GC first):

The top tab is the perf branch (2.1 GB footprint, ~454 MB JS), the one right under is the baseline (3.4 GB, ~690 MB JS). However, usedJSHeapSize (and the devtools "JavaScript VM instance") under-counts this and can even read higher with the change, because the bytes move off the V8 heap into the Uint8Array. The correct number is the total renderer footprint shown here
These two changes have some drawbacks though:
- I don't like that it's a class, because a class instance is fragile across every serialization boundary. Structured clone (
postMessage to the highlight worker), IndexedDB, structuredClone(), any cache -> they all drop the prototype, so you have to manually revive it at each boundary or it silently breaks (this actually bit me as a blank render when scrolling past the virtualizationin the worker path until I added a revive()). A plain { bytes, offsets } data object + free functions would probably get the same memory win without that, and could probably also keep the API non-breaking
- it changes a public type for
additionLines/deletionLines, it goes from string[] to a LineList (x[i] → x.get(i), x[i] = j → x.set(i, j), etc), which is breaking for current users
- per-read decode costs microseconds, and only for the visible (virtualized) lines. Small diffs (≤ 500 lines) see a few µs of extra parse time. This could very probably just get ignored with a size threshold
Btw, the clanker helped to prototype parts of these, so just take this as an experiment. If you'd like I can tidy things up to make a proper version later
So yeah, just sharing thoughts and the experiment, feel free to try it and see how it goes for you!
More raw numbers (isolated Chrome over CDP, OS-level RSS, local builds)
Same measurement as above, against local builds instead of the deployments. Linux peak RSS is noisy run-to-run (the stream is long and the peak lands at different GC moments), so the retained-after-GC row is the stabler measure; every run in both setups points the same way
linux v6.0...v7.0 (678 MB patch, 76,872 files, ~22.8M lines):
| metric |
baseline |
byte arena |
Δ |
| peak renderer RSS |
2436 MB |
1258 MB |
−48% |
| retained RSS (after forced GC) |
1482 MB |
627 MB |
−58% |
| stream + parse span |
62.1 s |
54.3 s |
−13% |
bun v1.2.15...v1.3.14* (76 MB patch, ~2M lines):
| metric |
baseline |
byte arena |
Δ |
| peak renderer RSS |
744 MB |
616 MB |
−17% |
| retained RSS (after forced GC) |
557 MB |
468 MB |
−16% |
| stream + parse span |
5.7 s |
5.1 s |
−10% |
*: why this specifically? bun PR from the landing page is big but I like them bigger
(the "span" includes the throttled download, identical for both arms, so it understates the parse-CPU difference)
Motivation and context
Context: I want to discuss possible improvements over diffs and ultimately diffshub.com. I'm not sure if issues is the place for this as the contributing guidelines are not saying where discussions should precisely take place. Lmk if we should move, happy to transfer
Prerequisites
Proposal
Hey guys
Very cool blog post!
I think it's cool to see how much thought and effort is put in making great tools, especially on the linux (/ huge diffs) end
Personally, I experienced some crashes (OOMs) when trying on ginormous diffs
This got me wondering if there was any easy win on the matter, and I tried a few things that I'd like to share and discuss
1. Detaching parsed strings
Detaching makes sense on its own, but I don't understand why it is currently done line by line?
As I understand it, detaching only needs to break the reference from a parsed line back to the giant patch buffer so the GC can reclaim it. That reference can be broken at any granularity, so copying each file's text once is enough, and every line then slices into that small per-file copy instead of the whole patch
I tried exactly that on this branch: it produces a byte-identical parsed model (same file/line/char counts + content hash), but replaces ~22.8M detach calls with ~77k, about 300× fewer (on
linux v6...v7: 76,872 files, 22.8M lines). Fewer allocations and GC pauses, so the parse step gets noticeably faster (how much depends on the CPU, sincedetachStringdoes an encode/decode round-trip whose per-call cost varies a lot between machines)2. Storing line content as a byte arena
Detaching frees the source patch, but the parsed line copies are still millions of individual js strings, and that's where a big part of the remaining memory goes:
So I tried stroring each file's lines as a single UTF-8 byte arena (
Uint8Array), plus anInt32Arrayof per-line byte offsets so a line is justdecode(bytes[offset[i]..offset[i+1]]), decoded on-demand, instead of astring[](branch). Interestingly this makes fix #1 above redundant -> the arena copies each line into its own buffer, which already detaches it from the patch, so this branch doesn't bother detaching at all. (The deployed demo is just this + a dockerfile, ignore the docker stuff, its just clanker made for me to deploy)It's important to use bytes and not chars, because ASCII stays 1 byte/char, it never taints a whole file to UTF-16, and it lives off the V8 heap.
Two small bits about this: the decoder uses
ignoreBOMso a line that genuinely starts with a U+FEFF byte-order mark isn't silently stripped, and it keeps a plainstring[]fallback for the rare files with a lone surrogate that wouldn't survive a UTF-8 round-tripI deployed my forked
main(for baseline comparison) and a byte-arena branch on the same server, to compare on identical hardware:Results
Measuring baseline vs perf on the
linux v6...v7compare (isolated Chrome, reading the renderer's OS-level RSS; M2 MacBook Air 16gb; averaged over 3 runs):* we can force GC in devtools > memory > 🗑️
The ranges across runs never overlapped: the worst byte-arena run beats the best baseline run. Parsed output is byte-identical
You can see the same thing live in Chrome's Task Manager (window -> task manager -> memory footprint; force a GC first):
The top tab is the perf branch (2.1 GB footprint, ~454 MB JS), the one right under is the baseline (3.4 GB, ~690 MB JS). However,
usedJSHeapSize(and the devtools "JavaScript VM instance") under-counts this and can even read higher with the change, because the bytes move off the V8 heap into theUint8Array. The correct number is the total renderer footprint shown hereThese two changes have some drawbacks though:
postMessageto the highlight worker), IndexedDB,structuredClone(), any cache -> they all drop the prototype, so you have to manually revive it at each boundary or it silently breaks (this actually bit me as a blank render when scrolling past the virtualizationin the worker path until I added arevive()). A plain{ bytes, offsets }data object + free functions would probably get the same memory win without that, and could probably also keep the API non-breakingadditionLines/deletionLines, it goes fromstring[]to aLineList(x[i]→x.get(i),x[i] = j→x.set(i, j), etc), which is breaking for current usersBtw, the clanker helped to prototype parts of these, so just take this as an experiment. If you'd like I can tidy things up to make a proper version later
So yeah, just sharing thoughts and the experiment, feel free to try it and see how it goes for you!
More raw numbers (isolated Chrome over CDP, OS-level RSS, local builds)
Same measurement as above, against local builds instead of the deployments. Linux peak RSS is noisy run-to-run (the stream is long and the peak lands at different GC moments), so the retained-after-GC row is the stabler measure; every run in both setups points the same way
linux v6.0...v7.0(678 MB patch, 76,872 files, ~22.8M lines):bun v1.2.15...v1.3.14* (76 MB patch, ~2M lines):*: why this specifically? bun PR from the landing page is big but I like them bigger
(the "span" includes the throttled download, identical for both arms, so it understates the parse-CPU difference)
Motivation and context
Context: I want to discuss possible improvements over diffs and ultimately diffshub.com. I'm not sure if issues is the place for this as the contributing guidelines are not saying where discussions should precisely take place. Lmk if we should move, happy to transfer