Skip to content

WIP: v0.4 — DO NOT MERGE (tracking)#76

Draft
mathieu17g wants to merge 76 commits into
mainfrom
v0.4-dev
Draft

WIP: v0.4 — DO NOT MERGE (tracking)#76
mathieu17g wants to merge 76 commits into
mainfrom
v0.4-dev

Conversation

@mathieu17g

@mathieu17g mathieu17g commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

WIP — do not merge. Tracking + collaboration PR for the v0.4 line (supersedes #54).
v0.4-dev = #54's rewrite + the cursor stack + ongoing work.

Highlights

A ground-up rewrite of XML.jl's internals initiated by @joshday — a token-based streaming parser, a pull/cursor (StAX-style) API, XPath support, and substantial parse/read speedups.

Done so far

  • Branch assembledWIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes  #54's rewrite + the cursor stack + main's 0.3.x fixes & infrastructure, combined by merging main in (merge-only, so the cursor branch stays usable for downstreams that build on it).
  • Non-ASCII names — element, attribute, PI, and DTD names with multi-byte UTF-8 (e.g. <café>, <日本語/>) now parse correctly.
  • Well-formedness enforcementwellformed = :lenient | :structural | :strict (default :structural), a compile-time type parameter so a mode never pays for checks above its level (:lenient runs none at all). :structural rejects multiple roots, a document with no root, non-whitespace text outside the root, empty/invalid names, a literal < in an attribute value, and a misplaced/duplicate/nested DOCTYPE or XML declaration. :strict additionally rejects -- (or a trailing -) in comments, empty/invalid PI targets, and characters — references and raw literals — outside the XML 1.0 §2.2 range. Against the pinned W3C conformance suite at :strict this rejects 367 of 1257 in-scope not-well-formed documents while parsing all 776 well-formed ones (the rest need DTD/entity validation, out of scope for a non-validating parser).
  • Pre-registration audit + fixes — a pre-registration audit of the rewrite, fixed test-first. 21 commits close the surfaced gaps: a single-pass unescape (no double-decode of a numeric ref to &), consistent BOM stripping across all three readers, the well-formedness tightening above, clear errors on truncated input and odd-length UTF-16, Node-constructor name/content validation, processing-instruction whitespace fidelity, XPath hardening (unsupported axes now error), parse_dtd error quality, and dead-code cleanup. Findings dispositioned "document" went into the migration guide.
  • BOM handling — UTF-16 LE/BE and UTF-8 BOM decode on read; a leading BOM is stripped on parse (now by all three readers); UTF-16 without a BOM — which XML 1.0 §4.3.3 forbids — and odd-length UTF-16 raise clear errors instead of failing obscurely.
  • Test harness — the borrowed libxml2 suite (240+ cases) is now actually run (it was present but never included); the W3C conformance suite asserts (every well-formed doc must parse; a no-regression floor on rejected ill-formed docs) at :strict. The whole suite now runs under one parent testset so a failure in any one set no longer aborts the rest, and a W3C-catalog filter bug that silently dropped 317 in-scope tests is fixed.
  • Regression guardsescape on SubString (escape should work with AbstractString. #60), BOM decoding, and the UTF-16-without-BOM error are pinned by tests.
  • Breaking-changes documentation — the CHANGELOG [Unreleased] section carries the full Added / Changed / Deprecated / Removed / Fixed breakdown for the v0.4 line, and a standalone MIGRATING_TO_v0.4.md gives the 0.3 → 0.4 upgrade path: a removed-API → replacement table, the structural and behavioural changes (incl. the stricter defaults, preserved inter-element whitespace, and reading a standalone DTD/prolog-only input with :lenient), and which reader to use.
  • Julia floor raised to 1.10 (LTS).
  • Infrastructure carried from main — CI bumps, codecov, the W3C-suite cache, and the 0.3.9 CHANGELOG fold-in.

Still ahead

  • Performance — confirm and document the parse/read speedup (refresh the README benchmarks; re-measure a representative downstream against the final v0.4) before quoting final figures.
  • Downstream compatibility (XLSX first) — XLSX is the most prominent dependent needing code changes. Its v0.4 adaptation already exists as @TimG1964's cursor-xml-optimisation branch (an independent effort; @joshday's #361 was an earlier exploration), and its full test suite passes against this v0.4-dev. What remains is to open it as an XLSX PR and wire that branch into this PR's downstream CI — replacing the current red-by-design check against the registered XLSX (pinned to XML 0.3) — so XLSX's own suite runs against this XML 0.4 on every push here. Other broadly-used dependents may get the same treatment as we go.
  • Register v0.4; close WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes  #54, and perf: avoid per-call ctx allocation in next_no_xml_space #58 as subsumed.

Issues addressed

Breaking changes & impact on dependent packages

The low-level streaming API (Raw, next/prev, single-argument parent/depth, nodes_equal, escape!/unescape!) is removed in favour of the token parser and cursor API. The high-level DOM API is largely preserved, but note some structural and behavioural changes:

  • Node is now parametric (Node{S}); attributes is a Vector{Pair} and children may be nothing.
  • parse/read decode entities into values, so value() returns &, not &amp;.
  • write auto-escapes text and attributes (double-escape risk if you pre-escape).
  • Duplicate attributes now error.
  • Parsing is stricter by default (:structural); a non-document input such as a standalone DTD file now needs wellformed = :lenient.
  • Inter-element whitespace is preserved as Text nodes, so children(root)[1] is not necessarily the first child element.

Dependents that only use the high-level DOM mostly need a [compat] bump to 0.4 plus a spot-check of those behavioural changes; those using the removed low-level API (notably XLSX.jl) need code changes. See MIGRATING_TO_v0.4.md for the full upgrade guide.

Performance

A preliminary measurement puts parse(_, Node) at roughly 64–84% faster than 0.3.x on v0.4-dev (corpus-dependent; the spread tracks how much inter-element whitespace a document carries, since v0.4 preserves it where 0.3.x dropped it). These figures are being confirmed and documented before they're quoted as final (see Still ahead). Cursor retains the forward-streaming speed of the old LazyNode walk; LazyNode trades speed for roughly an order-of-magnitude lower resident memory.

joshday and others added 30 commits March 5, 2026 09:34
Drops the underscore prefixes from internal names (module is unexported,
the clutter was only needed back when these names leaked into XML.jl).
Replaces the name-byte predicate with a 256-entry const lookup table.

Also fixes a 1-based indexing off-by-one in read_doctype_body: the
'<!--' detection guarded with `pos >= 2` while reading
`codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tag, value, keys, and attributes on LazyNode now return
SubString{String} views into the source rather than allocating
fresh Strings, so traversing a large document lazily does not
duplicate its text data.

Introduces a small _as_substring helper to promote the String that
`unescape` can return into a SubString so Attributes stays homogeneous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_write_xml now inspects children before reformatting: if any Text
child has non-whitespace content (or any CData child exists), the
element is treated as mixed content and its whitespace is preserved
verbatim. Otherwise the writer drops the whitespace-only Text nodes
the parser emits for round-tripping source formatting and generates
fresh indentation. Same filter is applied at the Document level.

Also adds an unescape(::SubString{String}) specialization that
returns the input unchanged when it contains no '&', avoiding an
allocation on the lazy scanning path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The medium-file workloads show a ~10–25% regression vs the numbers
captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains
a 70–80% improvement, so this is a post-release follow-up, not a
release blocker. Suspected culprit is the eager Pair{S,S}[] alloc
per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mathieu17g

mathieu17g commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

What landed in 0001d6e..26a47d7 — parser correctness + test harness

Well-formedness enforcement — new wellformed = :lenient | :structural | :strict option (024ce24, 1aa1d9f)
The parser was accepting many ill-formed documents — multiple roots, non-whitespace text outside the root, empty/invalid names, and at the content level -- inside comments, empty/invalid PI targets, and out-of-range character references. It now rejects them, with the level selectable and :structural the default; :strict adds the content-level checks. The level is a Val type parameter, so a mode never pays for checks above its level — :lenient runs none at all.

Non-ASCII names (0001d6e)
<café>, <日本語/>, and non-ASCII attribute/PI/DTD names threw before — the name-byte table was ASCII-only, and a few token slices/accessors used byte arithmetic that broke mid-multibyte-character. They parse now. (Two @test_broken cases in the borrowed suites started passing once this landed and were promoted to real assertions.)

Test-harness hardening (ac86c55, f20bc4f, bfddb19)

  • The borrowed libxml2 suite (~240 cases) was in the repo but not included — it ran zero assertions. Now wired in.
  • The W3C conformance suite only @warned on mismatches, so it stayed green no matter what. It now asserts: every well-formed doc must parse, plus a no-regression floor on rejected ill-formed docs.
  • The W3C suite now runs at :strict577/577 valid docs still parse (zero false-positives from the new checks) and not-well-formed rejections rose 156 → 169.

BOM handling + regression guards (26a47d7)
UTF-16 without a BOM now raises a clear not well-formed (XML 1.0 §4.3.3) error instead of a cryptic downstream failure; regression tests pin BOM decoding, that error, and escape on SubString (#60).

- CHANGELOG: Added/Changed/Deprecated/Removed/Fixed for the v0.4 line, with issue/PR links
- MIGRATING_TO_v0.4.md: 0.3 -> 0.4 migration guide (removed APIs, structural + behavioral changes,
  stricter-by-default parsing, reader guidance)
- README: link to the migration guide
- LazyNode docstring: memory/speed trade-offs + when to use Node / Cursor / LazyNode

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mathieu17g

Copy link
Copy Markdown
Collaborator Author

Breaking changes documented + migration guide added (71c8d7b)

  • CHANGELOG.md — full Added / Changed / Deprecated / Removed / Fixed for the v0.4 line, with issue/PR links.
  • MIGRATING_TO_v0.4.md (new) — the 0.3 → 0.4 upgrade path for dependents.

mathieu17g and others added 21 commits June 27, 2026 21:40
unescape ran two replace passes: numeric character references first, then the
five predefined entities. A numeric reference resolving to '&' (e.g. &#38;)
was then re-read by the second pass as the start of &amp; and decoded a
second time, so unescape("&#38;amp;") returned "&" instead of "&amp;" — silent
data loss, and a violation of unescape's own "processed exactly once" docstring.

Resolve every reference in a single replace over one combined regex, so a
substituted character is never re-scanned. Affects all three readers
(Node/LazyNode/Cursor), which decode parsed content through unescape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parse(_, Node) dropped a leading U+FEFF (BOM-as-character) from an in-memory
string via _drop_bom, but parse(_, LazyNode) and Cursor(::AbstractString) did
not. The same BOM'd string therefore parsed to an Element root through Node yet
surfaced the BOM as a leading Text node through LazyNode and Cursor, leaving the
three readers disagreeing on identical input.

Add a type-preserving _drop_bom(::AbstractString) and apply the strip at the
LazyNode and Cursor string entry points too. The generic method keeps the
cursor's string-view type (SubString today, StringView when mmap lands), so the
parametric Cursor constructor stays type-stable; the subtree primitive
Cursor(data, startpos) is untouched, since it walks from mid-document.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The :strict "--"-in-comment check (XML 1.0 §2.5) only inspected the comment's
COMMENT_CONTENT token. For <!-- foo --->, the tokenizer scans to the first
"-->", so the content token is " foo -" and the offending "--" straddles the
content's last character and the close delimiter — invisible to a content-only
occursin("--", ...). So <!-- foo ---> was wrongly accepted at :strict.

Also reject when the content token ends with "-", which (because content runs
up to the first "-->") happens exactly when the source is "...--->". A "-" not
abutting the close (e.g. <!-- a - b -->) stays well-formed. One more W3C not-wf
comment document is now correctly rejected at :strict (170/940, was 169).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §3.1 AttValue ::= '"' ([^<&"] | Reference)* '"' forbids a raw '<' in an
attribute value; it must be written &lt;. The tokenizer reads the value verbatim
up to the closing quote, so a literal '<' was silently accepted at every
well-formedness level, even :strict — while a raw '<' in text is rejected (it
opens a tag), an asymmetry §3.1 does not permit.

Reject a literal '<' in an attribute value at :structural and :strict (checked
on the raw, pre-decode value, so the &lt; reference is unaffected); :lenient
still accepts it. '>' remains allowed, as §3.1 permits. Three more W3C not-wf
documents are now correctly rejected at :strict (173/940, was 170).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §2.1 places the DOCTYPE in the prolog: at most one, before the root
element ([1] document ::= prolog element Misc*; Misc and content exclude
doctypedecl). v0.4 accepted a DOCTYPE after the root, a second DOCTYPE, and a
DOCTYPE nested in element content at every level including :strict, because
_check_document_wellformed only counted roots and top-level text and the
DOCTYPE_CONTENT branch pushed a DTD node with no placement guard.

At :structural/:strict, reject a DOCTYPE that follows the root or repeats
(checked in _check_document_wellformed over the ordered top-level children) and
one nested in element content (guarded at the DOCTYPE_CONTENT branch via the
children_stack depth). A single prolog DOCTYPE is unaffected; :lenient opts out.
Seven more W3C not-wf documents are now correctly rejected at :strict (180/940,
was 173).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §2.1 requires exactly one root element. v0.4's :structural/:strict
checked for at-most-one root (multi-root) and non-whitespace top-level text, but
not for the absence of a root: a document of only prolog markup — a comment, an
XML/PI declaration, or a DOCTYPE — parsed without error, though it is not
well-formed (libxml2: "no root element").

Reject nroots == 0 when the top level contains any markup (Comment / PI /
Declaration / CData / DTD). Truly-empty ("") and whitespace-only input stay
lenient (an empty Document), preserving the documented empty-input behavior;
:lenient opts out entirely.

complex_dtd.xml is a prolog + internal-subset DOCTYPE with no root, i.e. a DTD
fixture rather than a well-formed document, so its structure test now reads it
with :lenient. One more W3C not-wf document is correctly rejected at :strict
(181/940, was 180).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a comment/CDATA/PI/DOCTYPE open token consumed through end-of-input
("<!--", "<![CDATA[", "<?pi", "<!DOCTYPE"), the tokenizer's top-of-iterate EOF
check returned `nothing` before the body reader could run. So the Node parser
silently accepted the truncated construct (an empty Document, even at :strict),
and the lazy readers' value() — which re-tokenizes on access — indexed the
`nothing` and threw an opaque MethodError (getindex(::Nothing, ::Int64)).

End the token stream on EOF only when not mid content-body: in M_COMMENT /
M_CDATA / M_PI / M_DOCTYPE, fall through to the body reader so it raises the
precise "unterminated ..." ArgumentError. M_DEFAULT (clean end) and the tag
modes (which emit a partial token the parser rejects as an unclosed tag) keep
returning `nothing`, so the tokenizer's partial-token behavior for "<div" /
"</div" is unchanged. Closes both the silent-accept and the opaque MethodError.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_normalize_bom reinterprets the post-BOM bytes of a UTF-16 stream as UInt16
code units. An odd byte count (a truncated stream) made reinterpret throw a
cryptic "cannot reinterpret an UInt8 array to UInt16" ArgumentError. Guard both
the LE and BE arms with an isodd check raising a clear "malformed UTF-16: odd
number of bytes after the BOM (XML 1.0 §4.3.3)" error, mirroring the existing
UTF-16-without-a-BOM message.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
skip_quoted was the only input-error path in the tokenizer throwing a bare
ErrorException ("Unterminated quoted string") with no position context; every
other tokenizer error goes through err(msg, pos), which raises an ArgumentError
carrying "XML tokenizer error at position N: ...". Code that catches
ArgumentError to handle tokenizer errors (e.g. an unterminated quoted string in
a DOCTYPE internal subset, skip_quoted's sole caller) would miss this one path.
Switch it to err("unterminated quoted string", pos).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §2.6 treats the whitespace right after the PITarget as a separator that
is not part of the content, but content then runs verbatim up to "?>", so
trailing whitespace IS content. The three readers used strip() on the PI content
token, dropping trailing whitespace too: <?target hello   ?> read back as
"hello", and write∘parse was lossy. (DTD content already used lstrip.)

Use lstrip at the three PI-content sites (Node parser, LazyNode, Cursor): the
leading separator is still removed, trailing whitespace is kept, and
whitespace-only / empty content still collapses to nothing. Round-trip faithful.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The NodeType constructors accepted any string as a tag/target, so Element(""),
Element("1bad"), Element("a b"), and Element("a<b") all built nodes that
serialize to malformed XML (</>, <1bad/>, <a b/>, <a<b/>) — XML the package's
own parser rejects (parse("<1bad/>") throws while Element("1bad") did not).

Validate the name against XML 1.0 §2.3 (a NameStartChar followed by NameChars,
using the tokenizer's lenient per-character rule, so non-ASCII and namespaced
names like "café"/"ns:item" still construct). An invalid name now raises a clear
error, so a constructed node round-trips to the same node through write/parse.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
in_close_tag was write-only: initialized false, set true after a CLOSE_TAG, and
reset in the TAG_CLOSE branch, but never read to emit a token or take a branch.
The TAG_CLOSE branch existed only to perform that reset, so dropping the variable
lets the `>` delimiter token fall through harmlessly. No behavior change (covered
by the existing parse tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nsupported axes

Two cleanups in xpath():
- Remove dead scaffolding: `root = node.nodetype === Document ? node : node`
  (identical ternary arms) and an if/else whose branches were both
  `current = Node{S}[node]`. Collapsed to the single equivalent, with a note
  that an absolute path from a non-Document node is evaluated relative to that
  node (Node has no parent link to re-anchor at the document root).
- Fail fast on unsupported axes. The name lexer sweeps ':' into a name, so an
  axis specifier like child:: / descendant:: / following-sibling:: lexed as a
  single literal element name and silently matched nothing (child::a returned 0
  where the equivalent `a` returns 2). A name containing "::" now raises a clear
  "unsupported XPath axis" error; a single ':' (a namespaced name like ns:item)
  is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The XPath quick-start bound `root = doc[end]`, but the example document's
trailing newline parses to a Text node as the document's last child, so doc[end]
was that Text node, not <root> — every one of the four example queries returned
empty. Bind `root = doc[1]` (the root element; the leading newline does not
produce a Text node). A test pins the documented example and the doc[end]
pitfall so it can't silently rot again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A Comment/CData/ProcessingInstruction value containing "-->", "]]>", or "?>"
serialized to XML that re-parses split into multiple nodes — silent corruption:
write(Comment("a-->b")) == "<!--a-->b-->", which re-parses as two children, not
one comment. The parser never yields such content, so this only arises from
hand-construction.

Reject it at construction, completing the constructor-name validation from the
previous commit: a constructed node can no longer serialize to malformed XML.
DTD is excluded (its internal subset legitimately contains '>'); the individual
delimiter characters (e.g. "a-b->c", "a]b]>c") remain valid content.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §2.8 requires the XML declaration (<?xml ...?>) to be the very first
thing in the document. The tokenizer keys XML_DECL on spelling alone, so a
<?xml?> after content, a repeated one, or one nested in an element all became
Declaration nodes and parsed without error at every level.

At :structural/:strict, reject a Declaration that is not the first top-level
child (checked in _check_document_wellformed — this also catches a duplicate,
since the second is never first) and one nested in element content (guarded at
the XML_DECL_CLOSE branch via children_stack depth), mirroring the DOCTYPE
placement checks. A leading <?xml?> followed by the root is unaffected;
:lenient opts out.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
:strict range-checked character references (&#0;) against XML 1.0 §2.2 but never
the raw characters themselves, so a literal NUL or other C0 control passed even
at :strict while its &#...; reference form was rejected — a reference-vs-raw
asymmetry.

Add a _check_chars_strict scan over the five content kinds (text, attribute
value, comment, CDATA, PI content); it rejects a raw character outside the §2.2
Char range with a clear "U+XXXX outside the legal XML range" error. The scan is
gated on Val{W} === :strict (DCE'd off the :structural/:lenient paths, so they
pay nothing). W3C not-wf rejections rise 188 -> 235; all 577 valid docs still
parse.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parse_dtd is a best-effort, non-validating DTD extractor, but it does not expand
parameter-entity references (%name;) in declarations: _dtd_read_name can't
advance over '%', so a content model like <!ELEMENT e %text;> surfaced an opaque
internal BoundsError / position error — and it failed that way on the repo's own
complex_dtd.xml fixture.

Wrap the implementation: on a parse failure where the DTD contains a '%', raise a
clear message explaining that parameter-entity references aren't expanded (and
that the raw DTD node still round-trips). Other failures rethrow unchanged, and a
PE-free DTD is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
runtests.jl's 46 top-level @testsets are each a *root* testset: a top-level
testset throws at finish() when it has a failure or error, which aborts every
later top-level testset — so the first failing testset hid all the rest.

Push a single parent testset (`_ROOT_TS`) that the file's testsets (and the 7
now-wrapped include()s) nest under, then finish() it at the end: one aggregated
summary, still exits 1 on failure. Each @testset stays a separate top-level
statement, which avoids wrapping the whole ~3700-line body in one
`@testset begin … end` block — that compiles as a single giant thunk and is
pathologically slow (minutes vs. ~25s).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… floor

test_w3c.jl filters the suite to ENTITIES="none", but read the attribute with
get(attrs, "ENTITIES", "") — so the 317 tests that *omit* the attribute (which
the catalog DTD defaults to "none") were silently dropped from scope. Default it
to "none", which surfaces them: 577->776 valid/invalid and 940->1257 not-wf
in-scope tests. All 776 well-formed docs still parse at :strict (no regression).

Update the stale no-regression floor on rejected not-wf docs from 169 to 367 —
the current count, pushed up by the :strict raw-character-range and DOCTYPE /
XML-declaration placement checks added in this fix loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CHANGELOG, migration guide, and docstrings for the behavior changes this fix
loop introduced plus the audit's document-disposition findings:

- Stricter default parsing: :structural now also rejects a rootless document, a
  literal '<' in an attribute value, and a misplaced/duplicate/nested DOCTYPE or
  XML declaration; :strict adds the comment trailing-'-' and raw §2.2 char-range
  checks. A standalone DTD file or prolog-only fragment now needs :lenient.
- Inter-element whitespace is preserved, so children(root)[1] may be a Text node.
- LazyNode/Cursor do not check well-formedness and take no wellformed kwarg.
- Node constructors validate names and reject content containing its own close
  delimiter.
- Document the parent()/depth()/siblings()/xpath-`..` value-identity limitation
  (immutable Node has no identity; a path-based redesign is deferred to a later
  release).
- Node{SubString{String}} returns raw (un-decoded) entities; prev! removed; the
  XPath @attr step returns Text nodes (not strings).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mathieu17g

Copy link
Copy Markdown
Collaborator Author

Pre-registration audit + fixes (a49f091..c9df019, 21 commits)

A pre-registration audit of the rewrite surfaced correctness and completeness gaps beyond the earlier fixes; all are now closed, test-first.

Well-formedness — measured against the pinned W3C suite at :strict: 367 / 1257 in-scope not-well-formed docs rejected, 776 / 776 well-formed parsed.

  • :structural now also rejects a document with no root element, a literal < in an attribute value, and a misplaced / duplicate / nested DOCTYPE or XML declaration.
  • :strict now also rejects -- (or a trailing -) in comments and any character — reference or raw literal — outside the XML 1.0 §2.2 range.

Correctness & robustness

  • unescape is single-pass: a numeric reference resolving to & is no longer re-scanned as a named entity (unescape("&#38;amp;")"&amp;", not "&").
  • A leading U+FEFF BOM in an in-memory string is stripped consistently by Node, LazyNode, and Cursor.
  • Truncated input (unterminated comment / CDATA / PI / DOCTYPE) and odd-length UTF-16 now raise clear errors instead of being silently accepted or crashing a reader's value().
  • Node constructors validate names and reject content containing its own close delimiter, so a constructed node can no longer serialize to malformed XML.
  • XPath: unsupported axes (child::, descendant::, …) now error instead of silently returning the wrong result; the broken README example is fixed.
  • parse_dtd reports a clear error on parameter-entity references instead of an opaque internal failure.
  • Processing-instruction content keeps its trailing whitespace on round-trip.

Docs & test infra — migration guide / CHANGELOG / docstrings updated for the new behaviours; the suite now runs under one parent testset so a failure in one set no longer aborts the rest; a W3C-catalog filter bug that silently dropped 317 in-scope tests is fixed.

Deferred to post-registrationparent / depth / siblings and the XPath .. axis locate a node by value, so they can return a result for the wrong one of two value-identical siblings (immutable Node has no identity). Documented for v0.4; a path-based redesign is planned for a later release.

Performance is now measured: parse(_, Node) is 64–84% faster than 0.3.x (corpus-dependent).

mathieu17g and others added 2 commits June 28, 2026 15:22
The error wrapper raised "parameter-entity references not supported" on
any literal '%' in the DTD value, masking the true cause for an already-
invalid DTD (e.g. a '%' inside an entity value alongside a malformed
NOTATION). Gate the message on an actual %name; reference and append the
underlying error; otherwise rethrow the original unchanged.

Surfaced by the fix-diff verification audit of 71c8d7b..c9df019.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…udit

- CHANGELOG: the constructor close-delimiter guard prevents node-splitting
  on write, not all malformed output (a Comment containing "--" still
  constructs and is caught only on re-parse at :strict) — soften the claim.
- src/XML.jl + test: clarify that "" yields an empty Document while a
  whitespace-only input yields a Document with one whitespace Text child;
  pin both with assertions.
- MIGRATING: note that :strict adds a character-range scan and is slower
  than :structural on text-heavy input; and that top-level CDATA in the
  prolog/epilog is accepted even at :strict (XML 1.0 Misc excludes it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mathieu17g

Copy link
Copy Markdown
Collaborator Author

Two minor follow-ups:

  • 337c124parse_dtd diagnostic. The "parameter-entity references not supported" message now fires only on a real %name; reference (was: any literal %) and appends the underlying error.
  • f9bbc65 — documentation accuracy. Softened the CHANGELOG constructor-validation note; clarified empty ("") vs whitespace-only parse results; documented two :strict traits (slower character scan on text-heavy input; top-level CDATA currently accepted).

Suite green (2519 tests).

@TimG1964

Copy link
Copy Markdown
Collaborator

I have made a PR to XLSX that is wired up to this v0.4-dev branch so you should be good to confirm downstream tests now. Let me know if you need more.

… it's fast'

- # Benchmarks: a markdown table (XML.jl vs EzXML / LightXML / XMLDict) from the
  bare-machine benchmarks.jl run, dropping the misleading 'Read file' row
  (= parse + ~0.6 ms I/O); dated, with library versions.
- # How it's fast (new): the theory — lexer = DFA over a regular token grammar;
  nesting = visibly pushdown / nested word; 'optimal' = matches the asymptotic
  lower bounds — plus the by-regime decomposition (stream / full-DOM / stages)
  and the vs-0.3.9 comparison. Theory fact-checked, Julia internals verified
  empirically; concepts linked to Wikipedia/docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants