WIP: v0.4 — DO NOT MERGE (tracking)#76
Conversation
Drops the underscore prefixes from internal names (module is unexported, the clutter was only needed back when these names leaked into XML.jl). Replaces the name-byte predicate with a 256-entry const lookup table. Also fixes a 1-based indexing off-by-one in read_doctype_body: the '<!--' detection guarded with `pos >= 2` while reading `codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tag, value, keys, and attributes on LazyNode now return
SubString{String} views into the source rather than allocating
fresh Strings, so traversing a large document lazily does not
duplicate its text data.
Introduces a small _as_substring helper to promote the String that
`unescape` can return into a SubString so Attributes stays homogeneous.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_write_xml now inspects children before reformatting: if any Text
child has non-whitespace content (or any CData child exists), the
element is treated as mixed content and its whitespace is preserved
verbatim. Otherwise the writer drops the whitespace-only Text nodes
the parser emits for round-tripping source formatting and generates
fresh indentation. Same filter is applied at the Document level.
Also adds an unescape(::SubString{String}) specialization that
returns the input unchanged when it contains no '&', avoiding an
allocation on the lazy scanning path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The medium-file workloads show a ~10–25% regression vs the numbers captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains a 70–80% improvement, so this is a post-release follow-up, not a release blocker. Suspected culprit is the eager Pair{S,S}[] alloc per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What landed in
|
- CHANGELOG: Added/Changed/Deprecated/Removed/Fixed for the v0.4 line, with issue/PR links - MIGRATING_TO_v0.4.md: 0.3 -> 0.4 migration guide (removed APIs, structural + behavioral changes, stricter-by-default parsing, reader guidance) - README: link to the migration guide - LazyNode docstring: memory/speed trade-offs + when to use Node / Cursor / LazyNode Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Breaking changes documented + migration guide added (
|
unescape ran two replace passes: numeric character references first, then the five predefined entities. A numeric reference resolving to '&' (e.g. &) was then re-read by the second pass as the start of & and decoded a second time, so unescape("&amp;") returned "&" instead of "&" — silent data loss, and a violation of unescape's own "processed exactly once" docstring. Resolve every reference in a single replace over one combined regex, so a substituted character is never re-scanned. Affects all three readers (Node/LazyNode/Cursor), which decode parsed content through unescape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parse(_, Node) dropped a leading U+FEFF (BOM-as-character) from an in-memory string via _drop_bom, but parse(_, LazyNode) and Cursor(::AbstractString) did not. The same BOM'd string therefore parsed to an Element root through Node yet surfaced the BOM as a leading Text node through LazyNode and Cursor, leaving the three readers disagreeing on identical input. Add a type-preserving _drop_bom(::AbstractString) and apply the strip at the LazyNode and Cursor string entry points too. The generic method keeps the cursor's string-view type (SubString today, StringView when mmap lands), so the parametric Cursor constructor stays type-stable; the subtree primitive Cursor(data, startpos) is untouched, since it walks from mid-document. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The :strict "--"-in-comment check (XML 1.0 §2.5) only inspected the comment's
COMMENT_CONTENT token. For <!-- foo --->, the tokenizer scans to the first
"-->", so the content token is " foo -" and the offending "--" straddles the
content's last character and the close delimiter — invisible to a content-only
occursin("--", ...). So <!-- foo ---> was wrongly accepted at :strict.
Also reject when the content token ends with "-", which (because content runs
up to the first "-->") happens exactly when the source is "...--->". A "-" not
abutting the close (e.g. <!-- a - b -->) stays well-formed. One more W3C not-wf
comment document is now correctly rejected at :strict (170/940, was 169).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §3.1 AttValue ::= '"' ([^<&"] | Reference)* '"' forbids a raw '<' in an attribute value; it must be written <. The tokenizer reads the value verbatim up to the closing quote, so a literal '<' was silently accepted at every well-formedness level, even :strict — while a raw '<' in text is rejected (it opens a tag), an asymmetry §3.1 does not permit. Reject a literal '<' in an attribute value at :structural and :strict (checked on the raw, pre-decode value, so the < reference is unaffected); :lenient still accepts it. '>' remains allowed, as §3.1 permits. Three more W3C not-wf documents are now correctly rejected at :strict (173/940, was 170). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §2.1 places the DOCTYPE in the prolog: at most one, before the root element ([1] document ::= prolog element Misc*; Misc and content exclude doctypedecl). v0.4 accepted a DOCTYPE after the root, a second DOCTYPE, and a DOCTYPE nested in element content at every level including :strict, because _check_document_wellformed only counted roots and top-level text and the DOCTYPE_CONTENT branch pushed a DTD node with no placement guard. At :structural/:strict, reject a DOCTYPE that follows the root or repeats (checked in _check_document_wellformed over the ordered top-level children) and one nested in element content (guarded at the DOCTYPE_CONTENT branch via the children_stack depth). A single prolog DOCTYPE is unaffected; :lenient opts out. Seven more W3C not-wf documents are now correctly rejected at :strict (180/940, was 173). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §2.1 requires exactly one root element. v0.4's :structural/:strict
checked for at-most-one root (multi-root) and non-whitespace top-level text, but
not for the absence of a root: a document of only prolog markup — a comment, an
XML/PI declaration, or a DOCTYPE — parsed without error, though it is not
well-formed (libxml2: "no root element").
Reject nroots == 0 when the top level contains any markup (Comment / PI /
Declaration / CData / DTD). Truly-empty ("") and whitespace-only input stay
lenient (an empty Document), preserving the documented empty-input behavior;
:lenient opts out entirely.
complex_dtd.xml is a prolog + internal-subset DOCTYPE with no root, i.e. a DTD
fixture rather than a well-formed document, so its structure test now reads it
with :lenient. One more W3C not-wf document is correctly rejected at :strict
(181/940, was 180).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a comment/CDATA/PI/DOCTYPE open token consumed through end-of-input
("<!--", "<![CDATA[", "<?pi", "<!DOCTYPE"), the tokenizer's top-of-iterate EOF
check returned `nothing` before the body reader could run. So the Node parser
silently accepted the truncated construct (an empty Document, even at :strict),
and the lazy readers' value() — which re-tokenizes on access — indexed the
`nothing` and threw an opaque MethodError (getindex(::Nothing, ::Int64)).
End the token stream on EOF only when not mid content-body: in M_COMMENT /
M_CDATA / M_PI / M_DOCTYPE, fall through to the body reader so it raises the
precise "unterminated ..." ArgumentError. M_DEFAULT (clean end) and the tag
modes (which emit a partial token the parser rejects as an unclosed tag) keep
returning `nothing`, so the tokenizer's partial-token behavior for "<div" /
"</div" is unchanged. Closes both the silent-accept and the opaque MethodError.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_normalize_bom reinterprets the post-BOM bytes of a UTF-16 stream as UInt16 code units. An odd byte count (a truncated stream) made reinterpret throw a cryptic "cannot reinterpret an UInt8 array to UInt16" ArgumentError. Guard both the LE and BE arms with an isodd check raising a clear "malformed UTF-16: odd number of bytes after the BOM (XML 1.0 §4.3.3)" error, mirroring the existing UTF-16-without-a-BOM message. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
skip_quoted was the only input-error path in the tokenizer throwing a bare
ErrorException ("Unterminated quoted string") with no position context; every
other tokenizer error goes through err(msg, pos), which raises an ArgumentError
carrying "XML tokenizer error at position N: ...". Code that catches
ArgumentError to handle tokenizer errors (e.g. an unterminated quoted string in
a DOCTYPE internal subset, skip_quoted's sole caller) would miss this one path.
Switch it to err("unterminated quoted string", pos).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §2.6 treats the whitespace right after the PITarget as a separator that is not part of the content, but content then runs verbatim up to "?>", so trailing whitespace IS content. The three readers used strip() on the PI content token, dropping trailing whitespace too: <?target hello ?> read back as "hello", and write∘parse was lossy. (DTD content already used lstrip.) Use lstrip at the three PI-content sites (Node parser, LazyNode, Cursor): the leading separator is still removed, trailing whitespace is kept, and whitespace-only / empty content still collapses to nothing. Round-trip faithful. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The NodeType constructors accepted any string as a tag/target, so Element(""),
Element("1bad"), Element("a b"), and Element("a<b") all built nodes that
serialize to malformed XML (</>, <1bad/>, <a b/>, <a<b/>) — XML the package's
own parser rejects (parse("<1bad/>") throws while Element("1bad") did not).
Validate the name against XML 1.0 §2.3 (a NameStartChar followed by NameChars,
using the tokenizer's lenient per-character rule, so non-ASCII and namespaced
names like "café"/"ns:item" still construct). An invalid name now raises a clear
error, so a constructed node round-trips to the same node through write/parse.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
in_close_tag was write-only: initialized false, set true after a CLOSE_TAG, and reset in the TAG_CLOSE branch, but never read to emit a token or take a branch. The TAG_CLOSE branch existed only to perform that reset, so dropping the variable lets the `>` delimiter token fall through harmlessly. No behavior change (covered by the existing parse tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nsupported axes
Two cleanups in xpath():
- Remove dead scaffolding: `root = node.nodetype === Document ? node : node`
(identical ternary arms) and an if/else whose branches were both
`current = Node{S}[node]`. Collapsed to the single equivalent, with a note
that an absolute path from a non-Document node is evaluated relative to that
node (Node has no parent link to re-anchor at the document root).
- Fail fast on unsupported axes. The name lexer sweeps ':' into a name, so an
axis specifier like child:: / descendant:: / following-sibling:: lexed as a
single literal element name and silently matched nothing (child::a returned 0
where the equivalent `a` returns 2). A name containing "::" now raises a clear
"unsupported XPath axis" error; a single ':' (a namespaced name like ns:item)
is unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The XPath quick-start bound `root = doc[end]`, but the example document's trailing newline parses to a Text node as the document's last child, so doc[end] was that Text node, not <root> — every one of the four example queries returned empty. Bind `root = doc[1]` (the root element; the leading newline does not produce a Text node). A test pins the documented example and the doc[end] pitfall so it can't silently rot again. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A Comment/CData/ProcessingInstruction value containing "-->", "]]>", or "?>"
serialized to XML that re-parses split into multiple nodes — silent corruption:
write(Comment("a-->b")) == "<!--a-->b-->", which re-parses as two children, not
one comment. The parser never yields such content, so this only arises from
hand-construction.
Reject it at construction, completing the constructor-name validation from the
previous commit: a constructed node can no longer serialize to malformed XML.
DTD is excluded (its internal subset legitimately contains '>'); the individual
delimiter characters (e.g. "a-b->c", "a]b]>c") remain valid content.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
XML 1.0 §2.8 requires the XML declaration (<?xml ...?>) to be the very first thing in the document. The tokenizer keys XML_DECL on spelling alone, so a <?xml?> after content, a repeated one, or one nested in an element all became Declaration nodes and parsed without error at every level. At :structural/:strict, reject a Declaration that is not the first top-level child (checked in _check_document_wellformed — this also catches a duplicate, since the second is never first) and one nested in element content (guarded at the XML_DECL_CLOSE branch via children_stack depth), mirroring the DOCTYPE placement checks. A leading <?xml?> followed by the root is unaffected; :lenient opts out. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
:strict range-checked character references (�) against XML 1.0 §2.2 but never
the raw characters themselves, so a literal NUL or other C0 control passed even
at :strict while its &#...; reference form was rejected — a reference-vs-raw
asymmetry.
Add a _check_chars_strict scan over the five content kinds (text, attribute
value, comment, CDATA, PI content); it rejects a raw character outside the §2.2
Char range with a clear "U+XXXX outside the legal XML range" error. The scan is
gated on Val{W} === :strict (DCE'd off the :structural/:lenient paths, so they
pay nothing). W3C not-wf rejections rise 188 -> 235; all 577 valid docs still
parse.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
parse_dtd is a best-effort, non-validating DTD extractor, but it does not expand parameter-entity references (%name;) in declarations: _dtd_read_name can't advance over '%', so a content model like <!ELEMENT e %text;> surfaced an opaque internal BoundsError / position error — and it failed that way on the repo's own complex_dtd.xml fixture. Wrap the implementation: on a parse failure where the DTD contains a '%', raise a clear message explaining that parameter-entity references aren't expanded (and that the raw DTD node still round-trips). Other failures rethrow unchanged, and a PE-free DTD is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
runtests.jl's 46 top-level @testsets are each a *root* testset: a top-level testset throws at finish() when it has a failure or error, which aborts every later top-level testset — so the first failing testset hid all the rest. Push a single parent testset (`_ROOT_TS`) that the file's testsets (and the 7 now-wrapped include()s) nest under, then finish() it at the end: one aggregated summary, still exits 1 on failure. Each @testset stays a separate top-level statement, which avoids wrapping the whole ~3700-line body in one `@testset begin … end` block — that compiles as a single giant thunk and is pathologically slow (minutes vs. ~25s). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… floor test_w3c.jl filters the suite to ENTITIES="none", but read the attribute with get(attrs, "ENTITIES", "") — so the 317 tests that *omit* the attribute (which the catalog DTD defaults to "none") were silently dropped from scope. Default it to "none", which surfaces them: 577->776 valid/invalid and 940->1257 not-wf in-scope tests. All 776 well-formed docs still parse at :strict (no regression). Update the stale no-regression floor on rejected not-wf docs from 169 to 367 — the current count, pushed up by the :strict raw-character-range and DOCTYPE / XML-declaration placement checks added in this fix loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CHANGELOG, migration guide, and docstrings for the behavior changes this fix
loop introduced plus the audit's document-disposition findings:
- Stricter default parsing: :structural now also rejects a rootless document, a
literal '<' in an attribute value, and a misplaced/duplicate/nested DOCTYPE or
XML declaration; :strict adds the comment trailing-'-' and raw §2.2 char-range
checks. A standalone DTD file or prolog-only fragment now needs :lenient.
- Inter-element whitespace is preserved, so children(root)[1] may be a Text node.
- LazyNode/Cursor do not check well-formedness and take no wellformed kwarg.
- Node constructors validate names and reject content containing its own close
delimiter.
- Document the parent()/depth()/siblings()/xpath-`..` value-identity limitation
(immutable Node has no identity; a path-based redesign is deferred to a later
release).
- Node{SubString{String}} returns raw (un-decoded) entities; prev! removed; the
XPath @attr step returns Text nodes (not strings).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pre-registration audit + fixes (
|
The error wrapper raised "parameter-entity references not supported" on any literal '%' in the DTD value, masking the true cause for an already- invalid DTD (e.g. a '%' inside an entity value alongside a malformed NOTATION). Gate the message on an actual %name; reference and append the underlying error; otherwise rethrow the original unchanged. Surfaced by the fix-diff verification audit of 71c8d7b..c9df019. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…udit - CHANGELOG: the constructor close-delimiter guard prevents node-splitting on write, not all malformed output (a Comment containing "--" still constructs and is caught only on re-parse at :strict) — soften the claim. - src/XML.jl + test: clarify that "" yields an empty Document while a whitespace-only input yields a Document with one whitespace Text child; pin both with assertions. - MIGRATING: note that :strict adds a character-range scan and is slower than :structural on text-heavy input; and that top-level CDATA in the prolog/epilog is accepted even at :strict (XML 1.0 Misc excludes it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Two minor follow-ups:
Suite green (2519 tests). |
|
I have made a PR to XLSX that is wired up to this v0.4-dev branch so you should be good to confirm downstream tests now. Let me know if you need more. |
… it's fast' - # Benchmarks: a markdown table (XML.jl vs EzXML / LightXML / XMLDict) from the bare-machine benchmarks.jl run, dropping the misleading 'Read file' row (= parse + ~0.6 ms I/O); dated, with library versions. - # How it's fast (new): the theory — lexer = DFA over a regular token grammar; nesting = visibly pushdown / nested word; 'optimal' = matches the asymptotic lower bounds — plus the by-regime decomposition (stream / full-DOM / stages) and the vs-0.3.9 comparison. Theory fact-checked, Julia internals verified empirically; concepts linked to Wikipedia/docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
WIP — do not merge. Tracking + collaboration PR for the v0.4 line (supersedes #54).
v0.4-dev= #54's rewrite + the cursor stack + ongoing work.Highlights
A ground-up rewrite of XML.jl's internals initiated by @joshday — a token-based streaming parser, a pull/cursor (StAX-style) API, XPath support, and substantial parse/read speedups.
Done so far
main's 0.3.x fixes & infrastructure, combined by mergingmainin (merge-only, so the cursor branch stays usable for downstreams that build on it).<café>,<日本語/>) now parse correctly.wellformed = :lenient | :structural | :strict(default:structural), a compile-time type parameter so a mode never pays for checks above its level (:lenientruns none at all).:structuralrejects multiple roots, a document with no root, non-whitespace text outside the root, empty/invalid names, a literal<in an attribute value, and a misplaced/duplicate/nested DOCTYPE or XML declaration.:strictadditionally rejects--(or a trailing-) in comments, empty/invalid PI targets, and characters — references and raw literals — outside the XML 1.0 §2.2 range. Against the pinned W3C conformance suite at:strictthis rejects 367 of 1257 in-scope not-well-formed documents while parsing all 776 well-formed ones (the rest need DTD/entity validation, out of scope for a non-validating parser).unescape(no double-decode of a numeric ref to&), consistent BOM stripping across all three readers, the well-formedness tightening above, clear errors on truncated input and odd-length UTF-16,Node-constructor name/content validation, processing-instruction whitespace fidelity, XPath hardening (unsupported axes now error),parse_dtderror quality, and dead-code cleanup. Findings dispositioned "document" went into the migration guide.read; a leading BOM is stripped onparse(now by all three readers); UTF-16 without a BOM — which XML 1.0 §4.3.3 forbids — and odd-length UTF-16 raise clear errors instead of failing obscurely.:strict. The whole suite now runs under one parent testset so a failure in any one set no longer aborts the rest, and a W3C-catalog filter bug that silently dropped 317 in-scope tests is fixed.escapeonSubString(escape should work with AbstractString. #60), BOM decoding, and the UTF-16-without-BOM error are pinned by tests.[Unreleased]section carries the fullAdded / Changed / Deprecated / Removed / Fixedbreakdown for the v0.4 line, and a standaloneMIGRATING_TO_v0.4.mdgives the 0.3 → 0.4 upgrade path: a removed-API → replacement table, the structural and behavioural changes (incl. the stricter defaults, preserved inter-element whitespace, and reading a standalone DTD/prolog-only input with:lenient), and which reader to use.main— CI bumps, codecov, the W3C-suite cache, and the 0.3.9 CHANGELOG fold-in.Still ahead
cursor-xml-optimisationbranch (an independent effort; @joshday's #361 was an earlier exploration), and its full test suite passes against thisv0.4-dev. What remains is to open it as an XLSX PR and wire that branch into this PR's downstream CI — replacing the current red-by-design check against the registered XLSX (pinned toXML 0.3) — so XLSX's own suite runs against this XML 0.4 on every push here. Other broadly-used dependents may get the same treatment as we go.Issues addressed
Breaking changes & impact on dependent packages
The low-level streaming API (
Raw,next/prev, single-argumentparent/depth,nodes_equal,escape!/unescape!) is removed in favour of the token parser and cursor API. The high-level DOM API is largely preserved, but note some structural and behavioural changes:Nodeis now parametric (Node{S});attributesis aVector{Pair}andchildrenmay benothing.parse/readdecode entities into values, sovalue()returns&, not&.writeauto-escapes text and attributes (double-escape risk if you pre-escape).:structural); a non-document input such as a standalone DTD file now needswellformed = :lenient.Textnodes, sochildren(root)[1]is not necessarily the first child element.Dependents that only use the high-level DOM mostly need a
[compat]bump to0.4plus a spot-check of those behavioural changes; those using the removed low-level API (notably XLSX.jl) need code changes. SeeMIGRATING_TO_v0.4.mdfor the full upgrade guide.Performance
A preliminary measurement puts
parse(_, Node)at roughly 64–84% faster than 0.3.x onv0.4-dev(corpus-dependent; the spread tracks how much inter-element whitespace a document carries, since v0.4 preserves it where 0.3.x dropped it). These figures are being confirmed and documented before they're quoted as final (see Still ahead).Cursorretains the forward-streaming speed of the oldLazyNodewalk;LazyNodetrades speed for roughly an order-of-magnitude lower resident memory.