Feature: Improve huge diffs performances using byte arenas

### Prerequisites

- [x] I have [searched](https://github.com/pierrecomputer/pierre/issues?q=is%3Aissue) for duplicate or closed feature requests
- [x] I have read the [contributing guidelines](https://github.com/pierrecomputer/pierre/blob/main/.github/CONTRIBUTING.md)

### Proposal

Hey guys

Very cool [blog post](https://pierre.computer/writing/on-rendering-diffs)!

I think it's cool to see how much thought and effort is put in making great tools, especially on the linux (/ huge diffs) end

Personally, I experienced some crashes (OOMs) when trying on ginormous diffs

This got me wondering if there was any easy win on the matter, and I tried a few things that I'd like to share and discuss

## 1. Detaching parsed strings

Detaching makes sense on its own, but I don't understand why it is currently done [line by line](https://github.com/pierrecomputer/pierre/blob/main/packages/diffs/src/utils/parsePatchFiles.ts#L784-L787)?

As I understand it, detaching only needs to break the reference from a parsed line back to the giant patch buffer so the GC can reclaim it. That reference can be broken at any granularity, so copying each _file's_ text once is enough, and every line then slices into that small per-file copy instead of the whole patch

I tried exactly that on [this branch](https://github.com/pierrecomputer/pierre/compare/main...clemg:pierre:clemg/detach-string-per-file): it produces a byte-identical parsed model (same file/line/char counts + content hash), but replaces ~22.8M detach calls with ~77k, about 300× fewer (on `linux v6...v7`: 76,872 files, 22.8M lines). Fewer allocations and GC pauses, so the parse step gets noticeably faster (how much depends on the CPU, since `detachString` does an encode/decode round-trip whose per-call cost varies a lot between machines)

## 2. Storing line content as a byte arena

Detaching frees the source patch, but the parsed line copies are still millions of individual js strings, and that's where a big part of the remaining memory goes:

- per-object overhead: even after detaching, every line is still its own v8 string object (tens of millions of them), each carrying V8's fixed per-object header (~16+ bytes: map pointer, hash, length) plus pointer/offset fields, on top of its actual bytes
- UTF-16 inflation: a js string flips to 2 bytes/char the moment it holds one character above U+00FF (e.g. an emoji or CJK glyph). detaching per file makes a whole file one string, so one such char taints the entire file. UTF-8 keeps ASCII at 1 byte and only spends extra on those chars
- heap fragmentation: churning tens of millions of small, varied-size allocations leaves the GC heap full of little holes between still-live objects. V8 can't hand a memory page back to the OS unless the _entire_ page is free, so the process keeps far more RSS resident than the live data actually needs, and a GC doesn't fix it (it's also why this shows up in real browser RSS but barely in Node's (or Bun in my case) cleaner allocator)

So I tried stroring each file's lines as a single UTF-8 byte arena (`Uint8Array`), plus an `Int32Array` of per-line byte offsets so a line is just `decode(bytes[offset[i]..offset[i+1]])`, decoded on-demand, instead of a `string[]` ([branch](https://github.com/pierrecomputer/pierre/compare/main...clemg:pierre:clemg/diff-line-list-byte-arena)). Interestingly this makes fix #1 above redundant -> the arena copies each line into its own buffer, which already detaches it from the patch, so this branch doesn't bother detaching at all. (The deployed demo is just this + a dockerfile, ignore the docker stuff, its just clanker made for me to deploy)
It's important to use bytes and not chars, because ASCII stays 1 byte/char, it never taints a whole file to UTF-16, and it lives off the V8 heap.
Two small bits about this: the decoder uses `ignoreBOM` so a line that genuinely starts with a U+FEFF byte-order mark isn't silently stripped, and it keeps a plain `string[]` fallback for the rare files with a lone surrogate that wouldn't survive a UTF-8 round-trip

I deployed my forked `main` (for baseline comparison) and a byte-arena branch on the same server, to compare on identical hardware:

- main baseline: [https://diffshub-baseline.clemg.fr](https://diffshub-baseline.clemg.fr)
- perf branch (byte arena): [https://diffshub-perf.clemg.fr](https://diffshub-perf.clemg.fr)

## Results

Measuring baseline vs perf on the `linux v6...v7` compare (isolated Chrome, reading the renderer's OS-level RSS; M2 MacBook Air 16gb; averaged over 3 runs):

| metric                           | baseline | byte arena |    Δ |
| -------------------------------- | -------: | ---------: | ---: |
| peak renderer RSS                |  2752 MB |    1911 MB | −31% |
| retained RSS (after forced GC)\* |  1688 MB |     849 MB | −50% |

##### \* we can force GC in devtools > memory > 🗑️

The ranges across runs never overlapped: the worst byte-arena run beats the best baseline run. Parsed output is byte-identical

You can see the same thing live in Chrome's Task Manager (window -> task manager -> memory footprint; force a GC first):

![Chrome Task Manager: perf tab 2.1 GB / 454 MB JS vs baseline tab 3.4 GB / 690 MB JS](https://github.com/user-attachments/assets/22ca741d-5840-4039-91d3-aaf1d123a703)

The top tab is the perf branch (2.1 GB footprint, ~454 MB JS), the one right under is the baseline (3.4 GB, ~690 MB JS). However, `usedJSHeapSize` (and the devtools "JavaScript VM instance") _under-counts_ this and can even read higher with the change, because the bytes move off the V8 heap into the `Uint8Array`. The correct number is the total renderer footprint shown here

These two changes have some drawbacks though:

- I don't like that it's a class, because a class instance is fragile across every serialization boundary. Structured clone (`postMessage` to the highlight worker), IndexedDB, `structuredClone()`, any cache -> they all drop the prototype, so you have to manually revive it at each boundary or it silently breaks (this actually bit me as a blank render when scrolling past the virtualizationin the worker path until I added a `revive()`). A plain `{ bytes, offsets }` data object + free functions would probably get the same memory win without that, and could probably also keep the API non-breaking
- it changes a public type for `additionLines`/`deletionLines`, it goes from `string[]` to a `LineList` (`x[i]` → `x.get(i)`, `x[i] = j` → `x.set(i, j)`, etc), which is breaking for current users
- per-read decode costs microseconds, and only for the visible (virtualized) lines. Small diffs (≤ 500 lines) see a few µs of extra parse time. This could very probably just get ignored with a size threshold

Btw, the clanker helped to prototype parts of these, so just take this as an experiment. If you'd like I can tidy things up to make a proper version later

So yeah, just sharing thoughts and the experiment, feel free to try it and see how it goes for you!

---

<details>
<summary>More raw numbers (isolated Chrome over CDP, OS-level RSS, local builds)</summary>

Same measurement as above, against local builds instead of the deployments. Linux peak RSS is noisy run-to-run (the stream is long and the peak lands at different GC moments), so the retained-after-GC row is the stabler measure; every run in both setups points the same way

**`linux v6.0...v7.0`** (678 MB patch, 76,872 files, ~22.8M lines):

| metric                         | baseline | byte arena |    Δ |
| ------------------------------ | -------: | ---------: | ---: |
| peak renderer RSS              |  2436 MB |    1258 MB | −48% |
| retained RSS (after forced GC) |  1482 MB |     627 MB | −58% |
| stream + parse span            |   62.1 s |     54.3 s | −13% |

**`bun v1.2.15...v1.3.14`**\* (76 MB patch, ~2M lines):

| metric                         | baseline | byte arena |    Δ |
| ------------------------------ | -------: | ---------: | ---: |
| peak renderer RSS              |   744 MB |     616 MB | −17% |
| retained RSS (after forced GC) |   557 MB |     468 MB | −16% |
| stream + parse span            |    5.7 s |      5.1 s | −10% |

##### \*: why this specifically? bun PR from the landing page is big but I like them bigger

(the "span" includes the throttled download, identical for both arms, so it understates the parse-CPU difference)

</details>

### Motivation and context

Context: I want to discuss possible improvements over diffs and ultimately diffshub.com. I'm not sure if issues is the place for this as the contributing guidelines are not saying where discussions should precisely take place. Lmk if we should move, happy to transfer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Improve huge diffs performances using byte arenas #760

Prerequisites

Proposal

1. Detaching parsed strings

2. Storing line content as a byte arena

Results

* we can force GC in devtools > memory > 🗑️

*: why this specifically? bun PR from the landing page is big but I like them bigger

Motivation and context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

metric	baseline	byte arena	Δ
peak renderer RSS	2752 MB	1911 MB	−31%
retained RSS (after forced GC)*	1688 MB	849 MB	−50%

metric	baseline	byte arena	Δ
peak renderer RSS	2436 MB	1258 MB	−48%
retained RSS (after forced GC)	1482 MB	627 MB	−58%
stream + parse span	62.1 s	54.3 s	−13%

metric	baseline	byte arena	Δ
peak renderer RSS	744 MB	616 MB	−17%
retained RSS (after forced GC)	557 MB	468 MB	−16%
stream + parse span	5.7 s	5.1 s	−10%

Feature: Improve huge diffs performances using byte arenas #760

Description

Prerequisites

Proposal

1. Detaching parsed strings

2. Storing line content as a byte arena

Results

* we can force GC in devtools > memory > 🗑️

*: why this specifically? bun PR from the landing page is big but I like them bigger

Motivation and context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions