WIP: v0.4 — DO NOT MERGE (tracking) by mathieu17g · Pull Request #76 · JuliaData/XML.jl

mathieu17g · 2026-06-22T17:44:29Z

WIP — do not merge. Tracking + collaboration PR for the v0.4 line (supersedes #54).
v0.4-dev = #54's rewrite + the cursor stack + ongoing work.

Highlights

A ground-up rewrite of XML.jl's internals initiated by @joshday — a token-based streaming parser, a pull/cursor (StAX-style) API, XPath support, and substantial parse/read speedups.

Done so far

Still ahead

Performance — confirm and document the parse/read speedup (refresh the README benchmarks; re-measure a representative downstream against the final v0.4) before quoting final figures.
Downstream compatibility (XLSX first) — XLSX is the most prominent dependent needing code changes. Its v0.4 adaptation already exists as @TimG1964's cursor-xml-optimisation branch (an independent effort; @joshday's #361 was an earlier exploration), and its full test suite passes against this v0.4-dev. What remains is to open it as an XLSX PR and wire that branch into this PR's downstream CI — replacing the current red-by-design check against the registered XLSX (pinned to XML 0.3) — so XLSX's own suite runs against this XML 0.4 on every push here. Other broadly-used dependents may get the same treatment as we go.
Register v0.4; close WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54, and perf: avoid per-call ctx allocation in next_no_xml_space #58 as subsumed.

Issues addressed

XML character references are not unescaped/escaped #17 — XML character references are now unescaped/escaped
XPath support #30 — XPath support
Inconsistent type for attributes where nodes have no attributes #33 — Inconsistent type for attributes where nodes have no attributes
Simple XML.write followed by XML.parse fails #35 — Simple XML.write followed by XML.parse no longer fails
get not defined to match getindex #50 — get defined to match getindex
Question: Why the choice not to escape & to &amp; ? #52 — escape now unconditionally escapes '&'
Incorrect unescape result. #53 — Incorrect unescape result (double-unescaping)

Breaking changes & impact on dependent packages

The low-level streaming API (Raw, next/prev, single-argument parent/depth, nodes_equal, escape!/unescape!) is removed in favour of the token parser and cursor API. The high-level DOM API is largely preserved, but note some structural and behavioural changes:

Node is now parametric (Node{S}); attributes is a Vector{Pair} and children may be nothing.
parse/read decode entities into values, so value() returns &, not &.
write auto-escapes text and attributes (double-escape risk if you pre-escape).
Duplicate attributes now error.
Parsing is stricter by default (:structural); a non-document input such as a standalone DTD file now needs wellformed = :lenient.
Inter-element whitespace is preserved as Text nodes, so children(root)[1] is not necessarily the first child element.

Dependents that only use the high-level DOM mostly need a [compat] bump to 0.4 plus a spot-check of those behavioural changes; those using the removed low-level API (notably XLSX.jl) need code changes. See MIGRATING_TO_v0.4.md for the full upgrade guide.

Performance

A preliminary measurement puts parse(_, Node) at roughly 64–84% faster than 0.3.x on v0.4-dev (corpus-dependent; the spread tracks how much inter-element whitespace a document carries, since v0.4 preserves it where 0.3.x dropped it). These figures are being confirmed and documented before they're quoted as final (see Still ahead). Cursor retains the forward-streaming speed of the old LazyNode walk; LazyNode trades speed for roughly an order-of-magnitude lower resident memory.

Drops the underscore prefixes from internal names (module is unexported, the clutter was only needed back when these names leaked into XML.jl). Replaces the name-byte predicate with a 256-entry const lookup table. Also fixes a 1-based indexing off-by-one in read_doctype_body: the '<!--' detection guarded with `pos >= 2` while reading `codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tag, value, keys, and attributes on LazyNode now return SubString{String} views into the source rather than allocating fresh Strings, so traversing a large document lazily does not duplicate its text data. Introduces a small _as_substring helper to promote the String that `unescape` can return into a SubString so Attributes stays homogeneous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

_write_xml now inspects children before reformatting: if any Text child has non-whitespace content (or any CData child exists), the element is treated as mixed content and its whitespace is preserved verbatim. Otherwise the writer drops the whitespace-only Text nodes the parser emits for round-tripping source formatting and generates fresh indentation. Same filter is applied at the Document level. Also adds an unescape(::SubString{String}) specialization that returns the input unchanged when it contains no '&', avoiding an allocation on the lazy scanning path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The medium-file workloads show a ~10–25% regression vs the numbers captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains a 70–80% improvement, so this is a post-release follow-up, not a release blocker. Suspected culprit is the eager Pair{S,S}[] alloc per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mathieu17g · 2026-06-24T20:58:39Z

What landed in `0001d6e..26a47d7` — parser correctness + test harness

Well-formedness enforcement — new wellformed = :lenient | :structural | :strict option (024ce24, 1aa1d9f)
The parser was accepting many ill-formed documents — multiple roots, non-whitespace text outside the root, empty/invalid names, and at the content level -- inside comments, empty/invalid PI targets, and out-of-range character references. It now rejects them, with the level selectable and :structural the default; :strict adds the content-level checks. The level is a Val type parameter, so a mode never pays for checks above its level — :lenient runs none at all.

Non-ASCII names (0001d6e)
<café>, <日本語/>, and non-ASCII attribute/PI/DTD names threw before — the name-byte table was ASCII-only, and a few token slices/accessors used byte arithmetic that broke mid-multibyte-character. They parse now. (Two @test_broken cases in the borrowed suites started passing once this landed and were promoted to real assertions.)

Test-harness hardening (ac86c55, f20bc4f, bfddb19)

The borrowed libxml2 suite (~240 cases) was in the repo but not included — it ran zero assertions. Now wired in.
The W3C conformance suite only @warned on mismatches, so it stayed green no matter what. It now asserts: every well-formed doc must parse, plus a no-regression floor on rejected ill-formed docs.
The W3C suite now runs at :strict — 577/577 valid docs still parse (zero false-positives from the new checks) and not-well-formed rejections rose 156 → 169.

BOM handling + regression guards (26a47d7)
UTF-16 without a BOM now raises a clear not well-formed (XML 1.0 §4.3.3) error instead of a cryptic downstream failure; regression tests pin BOM decoding, that error, and escape on SubString (#60).

- CHANGELOG: Added/Changed/Deprecated/Removed/Fixed for the v0.4 line, with issue/PR links - MIGRATING_TO_v0.4.md: 0.3 -> 0.4 migration guide (removed APIs, structural + behavioral changes, stricter-by-default parsing, reader guidance) - README: link to the migration guide - LazyNode docstring: memory/speed trade-offs + when to use Node / Cursor / LazyNode Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mathieu17g · 2026-06-27T14:38:18Z

Breaking changes documented + migration guide added (71c8d7b)

CHANGELOG.md — full Added / Changed / Deprecated / Removed / Fixed for the v0.4 line, with issue/PR links.
MIGRATING_TO_v0.4.md (new) — the 0.3 → 0.4 upgrade path for dependents.

unescape ran two replace passes: numeric character references first, then the five predefined entities. A numeric reference resolving to '&' (e.g. &) was then re-read by the second pass as the start of & and decoded a second time, so unescape("&amp;") returned "&" instead of "&" — silent data loss, and a violation of unescape's own "processed exactly once" docstring. Resolve every reference in a single replace over one combined regex, so a substituted character is never re-scanned. Affects all three readers (Node/LazyNode/Cursor), which decode parsed content through unescape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

parse(_, Node) dropped a leading U+FEFF (BOM-as-character) from an in-memory string via _drop_bom, but parse(_, LazyNode) and Cursor(::AbstractString) did not. The same BOM'd string therefore parsed to an Element root through Node yet surfaced the BOM as a leading Text node through LazyNode and Cursor, leaving the three readers disagreeing on identical input. Add a type-preserving _drop_bom(::AbstractString) and apply the strip at the LazyNode and Cursor string entry points too. The generic method keeps the cursor's string-view type (SubString today, StringView when mmap lands), so the parametric Cursor constructor stays type-stable; the subtree primitive Cursor(data, startpos) is untouched, since it walks from mid-document. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The :strict "--"-in-comment check (XML 1.0 §2.5) only inspected the comment's COMMENT_CONTENT token. For , the tokenizer scans to the first "-->", so the content token is " foo -" and the offending "--" straddles the content's last character and the close delimiter — invisible to a content-only occursin("--", ...). So  was wrongly accepted at :strict. Also reject when the content token ends with "-", which (because content runs up to the first "-->") happens exactly when the source is "...--->". A "-" not abutting the close (e.g. ) stays well-formed. One more W3C not-wf comment document is now correctly rejected at :strict (170/940, was 169). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

XML 1.0 §3.1 AttValue ::= '"' ([^<&"] | Reference)* '"' forbids a raw '<' in an attribute value; it must be written <. The tokenizer reads the value verbatim up to the closing quote, so a literal '<' was silently accepted at every well-formedness level, even :strict — while a raw '<' in text is rejected (it opens a tag), an asymmetry §3.1 does not permit. Reject a literal '<' in an attribute value at :structural and :strict (checked on the raw, pre-decode value, so the < reference is unaffected); :lenient still accepts it. '>' remains allowed, as §3.1 permits. Three more W3C not-wf documents are now correctly rejected at :strict (173/940, was 170). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

XML 1.0 §2.1 places the DOCTYPE in the prolog: at most one, before the root element ([1] document ::= prolog element Misc*; Misc and content exclude doctypedecl). v0.4 accepted a DOCTYPE after the root, a second DOCTYPE, and a DOCTYPE nested in element content at every level including :strict, because _check_document_wellformed only counted roots and top-level text and the DOCTYPE_CONTENT branch pushed a DTD node with no placement guard. At :structural/:strict, reject a DOCTYPE that follows the root or repeats (checked in _check_document_wellformed over the ordered top-level children) and one nested in element content (guarded at the DOCTYPE_CONTENT branch via the children_stack depth). A single prolog DOCTYPE is unaffected; :lenient opts out. Seven more W3C not-wf documents are now correctly rejected at :strict (180/940, was 173). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

XML 1.0 §2.1 requires exactly one root element. v0.4's :structural/:strict checked for at-most-one root (multi-root) and non-whitespace top-level text, but not for the absence of a root: a document of only prolog markup — a comment, an XML/PI declaration, or a DOCTYPE — parsed without error, though it is not well-formed (libxml2: "no root element"). Reject nroots == 0 when the top level contains any markup (Comment / PI / Declaration / CData / DTD). Truly-empty ("") and whitespace-only input stay lenient (an empty Document), preserving the documented empty-input behavior; :lenient opts out entirely. complex_dtd.xml is a prolog + internal-subset DOCTYPE with no root, i.e. a DTD fixture rather than a well-formed document, so its structure test now reads it with :lenient. One more W3C not-wf document is correctly rejected at :strict (181/940, was 180). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

When a comment/CDATA/PI/DOCTYPE open token consumed through end-of-input ("<!--", "<![CDATA[", "<?pi", "<!DOCTYPE"), the tokenizer's top-of-iterate EOF check returned `nothing` before the body reader could run. So the Node parser silently accepted the truncated construct (an empty Document, even at :strict), and the lazy readers' value() — which re-tokenizes on access — indexed the `nothing` and threw an opaque MethodError (getindex(::Nothing, ::Int64)). End the token stream on EOF only when not mid content-body: in M_COMMENT / M_CDATA / M_PI / M_DOCTYPE, fall through to the body reader so it raises the precise "unterminated ..." ArgumentError. M_DEFAULT (clean end) and the tag modes (which emit a partial token the parser rejects as an unclosed tag) keep returning `nothing`, so the tokenizer's partial-token behavior for "<div" / "</div" is unchanged. Closes both the silent-accept and the opaque MethodError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

_normalize_bom reinterprets the post-BOM bytes of a UTF-16 stream as UInt16 code units. An odd byte count (a truncated stream) made reinterpret throw a cryptic "cannot reinterpret an UInt8 array to UInt16" ArgumentError. Guard both the LE and BE arms with an isodd check raising a clear "malformed UTF-16: odd number of bytes after the BOM (XML 1.0 §4.3.3)" error, mirroring the existing UTF-16-without-a-BOM message. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

skip_quoted was the only input-error path in the tokenizer throwing a bare ErrorException ("Unterminated quoted string") with no position context; every other tokenizer error goes through err(msg, pos), which raises an ArgumentError carrying "XML tokenizer error at position N: ...". Code that catches ArgumentError to handle tokenizer errors (e.g. an unterminated quoted string in a DOCTYPE internal subset, skip_quoted's sole caller) would miss this one path. Switch it to err("unterminated quoted string", pos). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

XML 1.0 §2.6 treats the whitespace right after the PITarget as a separator that is not part of the content, but content then runs verbatim up to "?>", so trailing whitespace IS content. The three readers used strip() on the PI content token, dropping trailing whitespace too: <?target hello ?> read back as "hello", and write∘parse was lossy. (DTD content already used lstrip.) Use lstrip at the three PI-content sites (Node parser, LazyNode, Cursor): the leading separator is still removed, trailing whitespace is kept, and whitespace-only / empty content still collapses to nothing. Round-trip faithful. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The NodeType constructors accepted any string as a tag/target, so Element(""), Element("1bad"), Element("a b"), and Element("a<b") all built nodes that serialize to malformed XML (</>, <1bad/>, <a b/>, <a<b/>) — XML the package's own parser rejects (parse("<1bad/>") throws while Element("1bad") did not). Validate the name against XML 1.0 §2.3 (a NameStartChar followed by NameChars, using the tokenizer's lenient per-character rule, so non-ASCII and namespaced names like "café"/"ns:item" still construct). An invalid name now raises a clear error, so a constructed node round-trips to the same node through write/parse. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

in_close_tag was write-only: initialized false, set true after a CLOSE_TAG, and reset in the TAG_CLOSE branch, but never read to emit a token or take a branch. The TAG_CLOSE branch existed only to perform that reset, so dropping the variable lets the `>` delimiter token fall through harmlessly. No behavior change (covered by the existing parse tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nsupported axes Two cleanups in xpath(): - Remove dead scaffolding: `root = node.nodetype === Document ? node : node` (identical ternary arms) and an if/else whose branches were both `current = Node{S}[node]`. Collapsed to the single equivalent, with a note that an absolute path from a non-Document node is evaluated relative to that node (Node has no parent link to re-anchor at the document root). - Fail fast on unsupported axes. The name lexer sweeps ':' into a name, so an axis specifier like child:: / descendant:: / following-sibling:: lexed as a single literal element name and silently matched nothing (child::a returned 0 where the equivalent `a` returns 2). A name containing "::" now raises a clear "unsupported XPath axis" error; a single ':' (a namespaced name like ns:item) is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The XPath quick-start bound `root = doc[end]`, but the example document's trailing newline parses to a Text node as the document's last child, so doc[end] was that Text node, not <root> — every one of the four example queries returned empty. Bind `root = doc[1]` (the root element; the leading newline does not produce a Text node). A test pins the documented example and the doc[end] pitfall so it can't silently rot again. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A Comment/CData/ProcessingInstruction value containing "-->", "]]>", or "?>" serialized to XML that re-parses split into multiple nodes — silent corruption: write(Comment("a-->b")) == "b-->", which re-parses as two children, not one comment. The parser never yields such content, so this only arises from hand-construction. Reject it at construction, completing the constructor-name validation from the previous commit: a constructed node can no longer serialize to malformed XML. DTD is excluded (its internal subset legitimately contains '>'); the individual delimiter characters (e.g. "a-b->c", "a]b]>c") remain valid content. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

XML 1.0 §2.8 requires the XML declaration (<?xml ...?>) to be the very first thing in the document. The tokenizer keys XML_DECL on spelling alone, so a <?xml?> after content, a repeated one, or one nested in an element all became Declaration nodes and parsed without error at every level. At :structural/:strict, reject a Declaration that is not the first top-level child (checked in _check_document_wellformed — this also catches a duplicate, since the second is never first) and one nested in element content (guarded at the XML_DECL_CLOSE branch via children_stack depth), mirroring the DOCTYPE placement checks. A leading <?xml?> followed by the root is unaffected; :lenient opts out. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

:strict range-checked character references () against XML 1.0 §2.2 but never the raw characters themselves, so a literal NUL or other C0 control passed even at :strict while its &#...; reference form was rejected — a reference-vs-raw asymmetry. Add a _check_chars_strict scan over the five content kinds (text, attribute value, comment, CDATA, PI content); it rejects a raw character outside the §2.2 Char range with a clear "U+XXXX outside the legal XML range" error. The scan is gated on Val{W} === :strict (DCE'd off the :structural/:lenient paths, so they pay nothing). W3C not-wf rejections rise 188 -> 235; all 577 valid docs still parse. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

parse_dtd is a best-effort, non-validating DTD extractor, but it does not expand parameter-entity references (%name;) in declarations: _dtd_read_name can't advance over '%', so a content model like <!ELEMENT e %text;> surfaced an opaque internal BoundsError / position error — and it failed that way on the repo's own complex_dtd.xml fixture. Wrap the implementation: on a parse failure where the DTD contains a '%', raise a clear message explaining that parameter-entity references aren't expanded (and that the raw DTD node still round-trips). Other failures rethrow unchanged, and a PE-free DTD is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@testset

runtests.jl's 46 top-level @testsets are each a *root* testset: a top-level testset throws at finish() when it has a failure or error, which aborts every later top-level testset — so the first failing testset hid all the rest. Push a single parent testset (`_ROOT_TS`) that the file's testsets (and the 7 now-wrapped include()s) nest under, then finish() it at the end: one aggregated summary, still exits 1 on failure. Each @testset stays a separate top-level statement, which avoids wrapping the whole ~3700-line body in one `@testset begin … end` block — that compiles as a single giant thunk and is pathologically slow (minutes vs. ~25s). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… floor test_w3c.jl filters the suite to ENTITIES="none", but read the attribute with get(attrs, "ENTITIES", "") — so the 317 tests that *omit* the attribute (which the catalog DTD defaults to "none") were silently dropped from scope. Default it to "none", which surfaces them: 577->776 valid/invalid and 940->1257 not-wf in-scope tests. All 776 well-formed docs still parse at :strict (no regression). Update the stale no-regression floor on rejected not-wf docs from 169 to 367 — the current count, pushed up by the :strict raw-character-range and DOCTYPE / XML-declaration placement checks added in this fix loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@attr

CHANGELOG, migration guide, and docstrings for the behavior changes this fix loop introduced plus the audit's document-disposition findings: - Stricter default parsing: :structural now also rejects a rootless document, a literal '<' in an attribute value, and a misplaced/duplicate/nested DOCTYPE or XML declaration; :strict adds the comment trailing-'-' and raw §2.2 char-range checks. A standalone DTD file or prolog-only fragment now needs :lenient. - Inter-element whitespace is preserved, so children(root)[1] may be a Text node. - LazyNode/Cursor do not check well-formedness and take no wellformed kwarg. - Node constructors validate names and reject content containing its own close delimiter. - Document the parent()/depth()/siblings()/xpath-`..` value-identity limitation (immutable Node has no identity; a path-based redesign is deferred to a later release). - Node{SubString{String}} returns raw (un-decoded) entities; prev! removed; the XPath @attr step returns Text nodes (not strings). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mathieu17g · 2026-06-28T06:17:46Z

Pre-registration audit + fixes (`a49f091..c9df019`, 21 commits)

A pre-registration audit of the rewrite surfaced correctness and completeness gaps beyond the earlier fixes; all are now closed, test-first.

Well-formedness — measured against the pinned W3C suite at :strict: 367 / 1257 in-scope not-well-formed docs rejected, 776 / 776 well-formed parsed.

:structural now also rejects a document with no root element, a literal < in an attribute value, and a misplaced / duplicate / nested DOCTYPE or XML declaration.
:strict now also rejects -- (or a trailing -) in comments and any character — reference or raw literal — outside the XML 1.0 §2.2 range.

Correctness & robustness

unescape is single-pass: a numeric reference resolving to & is no longer re-scanned as a named entity (unescape("&amp;") → "&", not "&").
A leading U+FEFF BOM in an in-memory string is stripped consistently by Node, LazyNode, and Cursor.
Truncated input (unterminated comment / CDATA / PI / DOCTYPE) and odd-length UTF-16 now raise clear errors instead of being silently accepted or crashing a reader's value().
Node constructors validate names and reject content containing its own close delimiter, so a constructed node can no longer serialize to malformed XML.
XPath: unsupported axes (child::, descendant::, …) now error instead of silently returning the wrong result; the broken README example is fixed.
parse_dtd reports a clear error on parameter-entity references instead of an opaque internal failure.
Processing-instruction content keeps its trailing whitespace on round-trip.

Docs & test infra — migration guide / CHANGELOG / docstrings updated for the new behaviours; the suite now runs under one parent testset so a failure in one set no longer aborts the rest; a W3C-catalog filter bug that silently dropped 317 in-scope tests is fixed.

Deferred to post-registration — parent / depth / siblings and the XPath .. axis locate a node by value, so they can return a result for the wrong one of two value-identical siblings (immutable Node has no identity). Documented for v0.4; a path-based redesign is planned for a later release.

Performance is now measured: parse(_, Node) is 64–84% faster than 0.3.x (corpus-dependent).

The error wrapper raised "parameter-entity references not supported" on any literal '%' in the DTD value, masking the true cause for an already- invalid DTD (e.g. a '%' inside an entity value alongside a malformed NOTATION). Gate the message on an actual %name; reference and append the underlying error; otherwise rethrow the original unchanged. Surfaced by the fix-diff verification audit of 71c8d7b..c9df019. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…udit - CHANGELOG: the constructor close-delimiter guard prevents node-splitting on write, not all malformed output (a Comment containing "--" still constructs and is caught only on re-parse at :strict) — soften the claim. - src/XML.jl + test: clarify that "" yields an empty Document while a whitespace-only input yields a Document with one whitespace Text child; pin both with assertions. - MIGRATING: note that :strict adds a character-range scan and is slower than :structural on text-heavy input; and that top-level CDATA in the prolog/epilog is accepted even at :strict (XML 1.0 Misc excludes it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mathieu17g · 2026-06-28T13:45:52Z

Two minor follow-ups:

337c124 — parse_dtd diagnostic. The "parameter-entity references not supported" message now fires only on a real %name; reference (was: any literal %) and appends the underlying error.
f9bbc65 — documentation accuracy. Softened the CHANGELOG constructor-validation note; clarified empty ("") vs whitespace-only parse results; documented two :strict traits (slower character scan on text-heavy input; top-level CDATA currently accepted).

Suite green (2519 tests).

TimG1964 · 2026-06-29T09:22:42Z

I have made a PR to XLSX that is wired up to this v0.4-dev branch so you should be good to confirm downstream tests now. Let me know if you need more.

… it's fast' - # Benchmarks: a markdown table (XML.jl vs EzXML / LightXML / XMLDict) from the bare-machine benchmarks.jl run, dropping the misleading 'Read file' row (= parse + ~0.6 ms I/O); dated, with library versions. - # How it's fast (new): the theory — lexer = DFA over a regular token grammar; nesting = visibly pushdown / nested word; 'optimal' = matches the asymptotic lower bounds — plus the by-regime decomposition (stream / full-DOM / stages) and the vs-0.3.9 comparison. Theory fact-checked, Julia internals verified empirically; concepts linked to Wikipedia/docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

joshday and others added 30 commits March 5, 2026 09:34

Rewrite XML parser with tokenizer and XPath

6dacef3

remove dead code

97384c3

more test files

1844b16

Add validation tests and remove legacy DTD/raw code

b6f4d47

Update CI actions and add validation tests

21f647d

update ci

c673427

Add XMark benchmark generator and expand benchmarks

46c5a31

Add LazyNode type and StringViews extension

33bcf35

Refactor simple_value checks and use direct attrs iteration

d011424

Refactor tokenizer into XMLTokenizer and add LazyNode

754f8fa

Add benchmarks, StringViews tests, simplify XML module

8483fed

Add GC.gc before tmpfile cleanup for Windows

eb5caeb

Bump version to v0.4.0

b914bfe

Use mktempdir for temp file cleanup in StringViews tests

d76c484

Remove StringViews extension and simplify tokenizer

41836ae

Replace printstyled with print in show methods

b670267

Revamp benchmarks and expand test suite

4a728ee

Add Attributes type and performance optimizations

2f71f9a

Add sourcetext, write, eachchildnode for LazyNode

6c4e8f3

Namespace token kinds and document API

60725db

Add LazyNode perf APIs and XLSX-pattern benchmarks

9d129b8

Refresh XLSX-pattern benchmark snapshot

fb583c4

Add AbstractTrees package extension

cfc1f81

Use byte-level Base.write in XML serializer

895e994

Skip unescape scan when tokenizer saw no entities

ff84960

Use findnext for tokenizer text/attr scans

18d88b1

mathieu17g and others added 21 commits June 27, 2026 21:40

mathieu17g and others added 2 commits June 28, 2026 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: v0.4 — DO NOT MERGE (tracking)#76

WIP: v0.4 — DO NOT MERGE (tracking)#76
mathieu17g wants to merge 76 commits into
mainfrom
v0.4-dev

mathieu17g commented Jun 22, 2026 •

edited

Loading

Uh oh!

mathieu17g commented Jun 24, 2026 •

edited

Loading

Uh oh!

mathieu17g commented Jun 27, 2026

Uh oh!

mathieu17g commented Jun 28, 2026

Uh oh!

mathieu17g commented Jun 28, 2026

Uh oh!

TimG1964 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mathieu17g commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Highlights

Done so far

Still ahead

Issues addressed

Breaking changes & impact on dependent packages

Performance

Uh oh!

mathieu17g commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What landed in 0001d6e..26a47d7 — parser correctness + test harness

Uh oh!

mathieu17g commented Jun 27, 2026

Uh oh!

mathieu17g commented Jun 28, 2026

Pre-registration audit + fixes (a49f091..c9df019, 21 commits)

Uh oh!

mathieu17g commented Jun 28, 2026

Uh oh!

TimG1964 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mathieu17g commented Jun 22, 2026 •

edited

Loading

mathieu17g commented Jun 24, 2026 •

edited

Loading

What landed in `0001d6e..26a47d7` — parser correctness + test harness

Pre-registration audit + fixes (`a49f091..c9df019`, 21 commits)