NeXus–NOMAD metainfo generator: Phase 3 (parser v2 + app v2) by lukaspie · Pull Request #810 · FAIRmat-NFDI/pynxtools

lukaspie · 2026-06-29T18:46:25Z

Builds on #801 (Phases 1 + 2). This branch is fully rebased onto it. For the full plan, all phases, and the Architecture Decision Records, see the data-modeling repo: https://github.com/FAIRmat-NFDI/data-modeling/tree/main/nexus-metainfo (specifically here for Phase 3 implementation).

Summary

Phase 1+2 generated the static Python metainfo (base classes + applications). This PR adds a new parser (nexus_parser_v2) that reads HDF5 directly into the generated Python classes via their a_nexus_* annotations, and a new search app (nexus_app_v2) that mirrors the old app, but uses the new names in the schema. Both are purely additive, new entry points; nothing existing is touched or removed (for now).

Testing this branch

To test with the new schema, install pynxtools from this branch and add these entry points to your nomad.yaml (in addition to or instead of the existing nexus_schema/nexus_parser/nexus_app):

plugins:
  entry_points:
    include:
      - "pynxtools.nomad.metainfo:nexus_base_classes"
      - "pynxtools.nomad.metainfo:nexus_applications"
      - "pynxtools.nomad.parsers:nexus_parser_v2"
      - "pynxtools.nomad.apps.app_v2:nexus_app_v2"

Upload any NeXus (.nxs/HDF5) file as usual; nexus_parser_v2 matches it and produces archives using the Phase 1+2 generated classes (Entry, Xps, Arpes, …), browsable in the "NeXus v2" app.

Key decisions

1. One archive per `NXentry`, plus one `Root` archive per file

A NeXus file can hold multiple NXentry groups. Each becomes its own archive: archive.data is the Entry/Xps/Arpes/… instance directly. There is no wrapper section, unlike parser v1's Root container. A separate Root(Experiment) archive is also produced per file, populated from the file's own NXroot attributes (file_name, file_time, creator, NeXus_version, …) and linking every NXentry archive in that file as an ExperimentStep (resolved lazily via m_context + generate_entry_id, since sibling entries aren't necessarily already persisted). NXroot is mapped to basesections.Experiment in BASESECTIONS_MAP for this. Entry names are disambiguated as "{file stem} - {entry name}" so that entry/entry1-named groups (the overwhelming common case) stay distinguishable across files; the Root archive itself is named "{file stem} (NeXus file)".

2. Annotation-driven matching replaces XML/NXDL resolution at parse time

parser_v2.py does not re-derive the schema from NXDL at parse time at all. Instead, it it builds a _SectionIndex per generated Section class once, from the a_nexus_group/a_nexus_field annotations already baked into the Phase 1+2 classes (NeXusGroup.nx_class → SubSection, field concept name → Quantity), and uses NexusSchemaResolver only for HDF5-side concept matching (variadic names, NXdata signal/axes hints). Due to the new schema, paths read more naturally (e.g. data.instrument[0].energy_resolution.resolution, not data.ENTRY[0].INSTRUMENT[0].energy_resolution.resolution__field).

3. A new search app, `nexus_app_v2`

Separate from the existing NeXus app, but with the same layout. The columns/menu/dashboard are built against the v2 paths (no __field/__group suffixes). Locked to section_defs.definition_qualified_name = pynxtools.nomad.metainfo.base_classes.Entry, so it only ever shows v2-parsed entries (and the application definitions that inherit from Entry).

4. Link quantities and conflict-renamed quantities are populated correctly at parse time

Two parser-level fixes needed for the typed-link resolution and BaseSection-collision renaming (both already generator-side in #801) to actually produce correct values: a NeXusLink quantity whose target type was resolved to something other than str is now populated like a regular field (h5py already dereferences the link transparently) instead of storing the raw HDF5 link path as a string. And a quantity renamed for a BaseSection collision (e.g. name_quantity for an NXDL field literally named name) now also mirrors its value into the NOMAD-native quantity it shadows (e.g. name) via a _SectionIndex.shadow_map, so both the NXDL-faithful path and the BaseSection-native path (used by normalize(), search, etc.) carry the value.

5. `results.material` / chemical formula normalization wired up for v2 archives

Entry/Root instances now populate results.material (chemical formula, elements) during normalization, matching parser v1's behavior. This is needed for the new app's "Elements" menu (periodic table, formula filters) to return anything.

What is NOT in this PR (deferred, see #801's phase table)

Item	Phase
Migrating `pynxtools-*` plugin apps to the v2 schema/parser	4
Porting `NORMALIZER_MAP` logic to `normalize()` methods	5
`metainfo_to_nxdl.py` round-trip exporter	6
Removing the old `schema.py`/`parser.py`/`NexusBaseSection`/`NORMALIZER_MAP`	7

NamedConceptContext was missing a links field, so groups that define only <link> children (no fields) were not generating named concept classes. - Add links list to NamedConceptContext - Collect link children in _build_named_concept() - Guard on concept.quantities or concept.links - Template named concept block uses concept.links (not top-level links) - NXmonopd now generates MonopdData(Data) with polar_angle and data as NeXusLink quantities pointing to the detector fields

…tiple BASESECTIONS_MAP bases - _package.py: use dir(mod) instead of __all__ to include all generated sections from a module, including named concept classes. Previously __all__ only listed the primary class, leaving named concepts in the per-module auto-package with an unresolvable qualified name that crashed the JS frontend (_allBaseSections error). - _mapping.py: BASESECTIONS_MAP values changed from single-tuple to list[str] (fully-qualified class names). NXentry now maps to both Measurement and EntryData so Entry(Object, basesections.Measurement, EntryData) is generated, making archive.data = Arpes() valid. - nxdl_to_metainfo.py: _base_from_extends() updated to return list[str] of extra bases; added _split_fqn() helper; build_context() passes nomad_extra_bases list to template. render() now accepts out_path so ruff runs with the correct pyproject.toml context for isort fixes. - base_class.py.j2: template iterates nomad_extra_bases, emits imports for non-basesections modules in correct isort position. - entry.py: regenerated with Entry(Object, basesections.Measurement, EntryData).

…s; fix inheritance chains Generator (nxdl_to_metainfo.py): - Named concepts now generated when a group has app-specific sub-groups (children with nx_class absent from the base NXDL class), not only when it has own fields/links. Fixes missing MpesInstrument, XpsInstrument etc. - Application-derived named concepts inherit from the parent application's concept class (XpsInstrument(MpesInstrument), AfmInstrument(SpmInstrument)) rather than the generic NX base class. Uses group_naming_at(1) to compute the parent concept name from the parent's actual group XML, handling variadic-to-specific slot specialization correctly. - own_children() used throughout _build_named_concept so each concept only claims members defined at its own NXDL level. - _section_fqn and all FQN strings use _METAINFO_PACKAGE_ROOT constant. nexus_tree.py (NexusNode API additions): - own_children(): children whose nxdl_base matches this node's own file - children_at_definition(file): children at a specific NXDL definition level - definition_file_at(idx): .base path of inheritance[idx] - NexusGroup.group_naming_at(idx): (name, name_type, nx_class) at a specific inheritance level for concept-name computation without building full nodes

…classes LinkContext now carries target_quantity, resolved via NexusNode tree traversal of the NXDL concept path. The Jinja2 template emits the target's type/dimensionality/unit/shape instead of a bare type=str, so link quantities (e.g. NXlauetof time_of_flight) carry real numeric types and units. Regenerated all 20 affected application files.

Entry and all category="application" classes inherit EntryData via Entry. Group them under one "Experiment" category in NOMAD Oasis's "Create new entry from schema" dialog, configured once in the generator template.

Generator now computes eln_component/eln_default for every Quantity based on its Python type and shape. The Jinja2 template emits a_eln=ELNAnnotation(...) with the appropriate ELNComponentEnum variant for all scalar, non-Bytes quantities (str → StringEditQuantity, Datetime → DateTimeEditQuantity, bool → BoolEditQuantity, MEnum → EnumEditQuantity with single-value default, numeric → NumberEditQuantity with a_display unit). Arrays are excluded. SchemaAnnotation(enabled=False) added to NXentry and NXroot m_def to hide these base classes from the "Create from schema" dialog. All 280 generated files updated; 473 tests pass.

New parallel parser stack using the generated Python metainfo (nexus_base_classes + nexus_applications) with annotation-based HDF5 navigation instead of XML/NXDL schema resolution at parse time. - NexusParserV2 + NomadVisitorV2: walks HDF5 via NexusSchemaResolver, resolves quantities via per-class _SectionIndex (no __field/__group suffixes, no _rename_nx_for_nomad) - One archive per NXentry; archive.data is the Entry/Arpes/Xps instance directly - m_nx_data_path stored as a JSON HDF5-path -> archive-path map - New nexus_app_v2 explore app - handler.py: prescan() + on_prescan_group() hook for pre-pass metadata collection - Acceptance-gate tests in test_parsing_v2.py

Every NeXus file now also produces a "root" child archive holding a Root(Experiment) instance that links all NXentry archives from that file as ExperimentSteps (resolved lazily in normalize() via m_context + generate_entry_id). - BASESECTIONS_MAP: NXroot -> Experiment + EntryData - Root entry_name = "{file stem} (NeXus file)"; results.eln.methods lists the unique technique names (Arpes, Xps, ...) across all NXentries in the file - A NomadVisitorV2 pass over the root group populates Root's own NXroot attributes (file_name, file_time, creator, NeXus_version, ...) - Per-entry entry_name is now "{file stem} - {entry name}" so entries named "entry"/"entry1" remain distinguishable across files

NeXusLink quantities whose target type was resolved by the generator (type != str) are now populated like regular fields instead of storing the HDF5 link path as a string -- h5py transparently dereferences the link so hdf_node already carries the target's real data. Conflict-renamed quantities (e.g. name_quantity for NXDL "name") now mirror their value into the NOMAD-native quantity they shadow (e.g. name), via a new _SectionIndex.shadow_map. Fixes the remaining NXlauetof assertions: name_group.time_of_flight and sample.name.

_SectionIndex.find_subsection used a naive .replace("GROUPNAME","").replace("NAME","") + startswith check for NXDL nameType="partial" groups, requiring the HDF5 group name to literally start with the NXDL name (e.g. "eventID..."). The canonical algorithm (get_nx_namefit) instead treats uppercase letter runs in the given name as wildcards, so "event1" is just as valid a match for "eventID" as "eventID1". Fixed to use get_nx_namefit(hdf_name, annotation.name, name_partial=True). Also adds test_parsing_opt_v2.py, a v2 port of test_parsing_opt.py's storage-layout/ dtype matrix (33 parametrized cases), asserting structural reachability of the NXdata group. The original mean/min/max/size/ndim value assertions are commented out pending FIELD_STATISTICS support for array-shaped quantities in NXdata-derived classes, deferred to Phase 4.

lukaspie added 20 commits June 29, 2026 18:24

generate all base classes

32d4460

add all applications and contributed; only two packages

490e977

update documentation

a3bee69

clean up unneeded import of basesections

f02e997

Add Experiment category to entry-creatable NeXus classes

b3c3016

Entry and all category="application" classes inherit EntryData via Entry. Group them under one "Experiment" category in NOMAD Oasis's "Create new entry from schema" dialog, configured once in the generator template.

fix the paths in the NeXus app

c2a468c

clean up test files

5b42636

enable Root(Experiment to create Entry for NeXus file

804b6f6

enable chemical formula normalization across Entry + Root instances

f16dae7

fix results.material normalizer + tests

54f6734

readd experiment_description in app

28cfb42

lukaspie changed the title ~~NeXus metainfo phase3 - parser & app~~ NeXus–NOMAD metainfo generator: Phase 3 (parser v2 + app v2) Jun 29, 2026

rebase issues

0759548

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NeXus–NOMAD metainfo generator: Phase 3 (parser v2 + app v2)#810

NeXus–NOMAD metainfo generator: Phase 3 (parser v2 + app v2)#810
lukaspie wants to merge 21 commits into
nexus-metainfo-phase2from
nexus-metainfo-phase3

lukaspie commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lukaspie commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing this branch

Key decisions

1. One archive per NXentry, plus one Root archive per file

2. Annotation-driven matching replaces XML/NXDL resolution at parse time

3. A new search app, nexus_app_v2

4. Link quantities and conflict-renamed quantities are populated correctly at parse time

5. results.material / chemical formula normalization wired up for v2 archives

What is NOT in this PR (deferred, see #801's phase table)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lukaspie commented Jun 29, 2026 •

edited

Loading

1. One archive per `NXentry`, plus one `Root` archive per file

3. A new search app, `nexus_app_v2`

5. `results.material` / chemical formula normalization wired up for v2 archives