Skip to content

NeXus–NOMAD metainfo generator: Phase 3 (parser v2 + app v2)#810

Draft
lukaspie wants to merge 21 commits into
nexus-metainfo-phase2from
nexus-metainfo-phase3
Draft

NeXus–NOMAD metainfo generator: Phase 3 (parser v2 + app v2)#810
lukaspie wants to merge 21 commits into
nexus-metainfo-phase2from
nexus-metainfo-phase3

Conversation

@lukaspie

@lukaspie lukaspie commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Builds on #801 (Phases 1 + 2). This branch is fully rebased onto it. For the full plan, all phases, and the Architecture Decision Records, see the data-modeling repo: https://github.com/FAIRmat-NFDI/data-modeling/tree/main/nexus-metainfo (specifically here for Phase 3 implementation).


Summary

Phase 1+2 generated the static Python metainfo (base classes + applications). This PR adds a new parser (nexus_parser_v2) that reads HDF5 directly into the generated Python classes via their a_nexus_* annotations, and a new search app (nexus_app_v2) that mirrors the old app, but uses the new names in the schema. Both are purely additive, new entry points; nothing existing is touched or removed (for now).

Testing this branch

To test with the new schema, install pynxtools from this branch and add these entry points to your nomad.yaml (in addition to or instead of the existing nexus_schema/nexus_parser/nexus_app):

plugins:
  entry_points:
    include:
      - "pynxtools.nomad.metainfo:nexus_base_classes"
      - "pynxtools.nomad.metainfo:nexus_applications"
      - "pynxtools.nomad.parsers:nexus_parser_v2"
      - "pynxtools.nomad.apps.app_v2:nexus_app_v2"

Upload any NeXus (.nxs/HDF5) file as usual; nexus_parser_v2 matches it and produces archives using the Phase 1+2 generated classes (Entry, Xps, Arpes, …), browsable in the "NeXus v2" app.


Key decisions

1. One archive per NXentry, plus one Root archive per file

A NeXus file can hold multiple NXentry groups. Each becomes its own archive: archive.data is the Entry/Xps/Arpes/… instance directly. There is no wrapper section, unlike parser v1's Root container. A separate Root(Experiment) archive is also produced per file, populated from the file's own NXroot attributes (file_name, file_time, creator, NeXus_version, …) and linking every NXentry archive in that file as an ExperimentStep (resolved lazily via m_context + generate_entry_id, since sibling entries aren't necessarily already persisted). NXroot is mapped to basesections.Experiment in BASESECTIONS_MAP for this. Entry names are disambiguated as "{file stem} - {entry name}" so that entry/entry1-named groups (the overwhelming common case) stay distinguishable across files; the Root archive itself is named "{file stem} (NeXus file)".

2. Annotation-driven matching replaces XML/NXDL resolution at parse time

parser_v2.py does not re-derive the schema from NXDL at parse time at all. Instead, it it builds a _SectionIndex per generated Section class once, from the a_nexus_group/a_nexus_field annotations already baked into the Phase 1+2 classes (NeXusGroup.nx_class → SubSection, field concept name → Quantity), and uses NexusSchemaResolver only for HDF5-side concept matching (variadic names, NXdata signal/axes hints). Due to the new schema, paths read more naturally (e.g. data.instrument[0].energy_resolution.resolution, not data.ENTRY[0].INSTRUMENT[0].energy_resolution.resolution__field).

3. A new search app, nexus_app_v2

Separate from the existing NeXus app, but with the same layout. The columns/menu/dashboard are built against the v2 paths (no __field/__group suffixes). Locked to section_defs.definition_qualified_name = pynxtools.nomad.metainfo.base_classes.Entry, so it only ever shows v2-parsed entries (and the application definitions that inherit from Entry).

4. Link quantities and conflict-renamed quantities are populated correctly at parse time

Two parser-level fixes needed for the typed-link resolution and BaseSection-collision renaming (both already generator-side in #801) to actually produce correct values: a NeXusLink quantity whose target type was resolved to something other than str is now populated like a regular field (h5py already dereferences the link transparently) instead of storing the raw HDF5 link path as a string. And a quantity renamed for a BaseSection collision (e.g. name_quantity for an NXDL field literally named name) now also mirrors its value into the NOMAD-native quantity it shadows (e.g. name) via a _SectionIndex.shadow_map, so both the NXDL-faithful path and the BaseSection-native path (used by normalize(), search, etc.) carry the value.

5. results.material / chemical formula normalization wired up for v2 archives

Entry/Root instances now populate results.material (chemical formula, elements) during normalization, matching parser v1's behavior. This is needed for the new app's "Elements" menu (periodic table, formula filters) to return anything.


What is NOT in this PR (deferred, see #801's phase table)

Item Phase
Migrating pynxtools-* plugin apps to the v2 schema/parser 4
Porting NORMALIZER_MAP logic to normalize() methods 5
metainfo_to_nxdl.py round-trip exporter 6
Removing the old schema.py/parser.py/NexusBaseSection/NORMALIZER_MAP 7

lukaspie added 20 commits June 29, 2026 18:24
NamedConceptContext was missing a links field, so groups that define only
<link> children (no fields) were not generating named concept classes.

- Add links list to NamedConceptContext
- Collect link children in _build_named_concept()
- Guard on concept.quantities or concept.links
- Template named concept block uses concept.links (not top-level links)
- NXmonopd now generates MonopdData(Data) with polar_angle and data as
  NeXusLink quantities pointing to the detector fields
…tiple BASESECTIONS_MAP bases

- _package.py: use dir(mod) instead of __all__ to include all generated
  sections from a module, including named concept classes. Previously
  __all__ only listed the primary class, leaving named concepts in the
  per-module auto-package with an unresolvable qualified name that
  crashed the JS frontend (_allBaseSections error).

- _mapping.py: BASESECTIONS_MAP values changed from single-tuple to
  list[str] (fully-qualified class names). NXentry now maps to both
  Measurement and EntryData so Entry(Object, basesections.Measurement,
  EntryData) is generated, making archive.data = Arpes() valid.

- nxdl_to_metainfo.py: _base_from_extends() updated to return list[str]
  of extra bases; added _split_fqn() helper; build_context() passes
  nomad_extra_bases list to template. render() now accepts out_path so
  ruff runs with the correct pyproject.toml context for isort fixes.

- base_class.py.j2: template iterates nomad_extra_bases, emits imports
  for non-basesections modules in correct isort position.

- entry.py: regenerated with Entry(Object, basesections.Measurement, EntryData).
…s; fix inheritance chains

Generator (nxdl_to_metainfo.py):
- Named concepts now generated when a group has app-specific sub-groups
  (children with nx_class absent from the base NXDL class), not only when
  it has own fields/links. Fixes missing MpesInstrument, XpsInstrument etc.
- Application-derived named concepts inherit from the parent application's
  concept class (XpsInstrument(MpesInstrument), AfmInstrument(SpmInstrument))
  rather than the generic NX base class. Uses group_naming_at(1) to compute
  the parent concept name from the parent's actual group XML, handling
  variadic-to-specific slot specialization correctly.
- own_children() used throughout _build_named_concept so each concept only
  claims members defined at its own NXDL level.
- _section_fqn and all FQN strings use _METAINFO_PACKAGE_ROOT constant.

nexus_tree.py (NexusNode API additions):
- own_children(): children whose nxdl_base matches this node's own file
- children_at_definition(file): children at a specific NXDL definition level
- definition_file_at(idx): .base path of inheritance[idx]
- NexusGroup.group_naming_at(idx): (name, name_type, nx_class) at a specific
  inheritance level for concept-name computation without building full nodes
…classes

LinkContext now carries target_quantity, resolved via NexusNode tree
traversal of the NXDL concept path. The Jinja2 template emits the
target's type/dimensionality/unit/shape instead of a bare type=str,
so link quantities (e.g. NXlauetof time_of_flight) carry real numeric
types and units. Regenerated all 20 affected application files.
Entry and all category="application" classes inherit EntryData via
Entry. Group them under one "Experiment" category in NOMAD Oasis's
"Create new entry from schema" dialog, configured once in the
generator template.
Generator now computes eln_component/eln_default for every Quantity based on
its Python type and shape. The Jinja2 template emits a_eln=ELNAnnotation(...)
with the appropriate ELNComponentEnum variant for all scalar, non-Bytes
quantities (str → StringEditQuantity, Datetime → DateTimeEditQuantity,
bool → BoolEditQuantity, MEnum → EnumEditQuantity with single-value default,
numeric → NumberEditQuantity with a_display unit). Arrays are excluded.
SchemaAnnotation(enabled=False) added to NXentry and NXroot m_def to hide
these base classes from the "Create from schema" dialog. All 280 generated
files updated; 473 tests pass.
New parallel parser stack using the generated Python metainfo
(nexus_base_classes + nexus_applications) with annotation-based
HDF5 navigation instead of XML/NXDL schema resolution at parse time.

- NexusParserV2 + NomadVisitorV2: walks HDF5 via NexusSchemaResolver,
  resolves quantities via per-class _SectionIndex (no __field/__group
  suffixes, no _rename_nx_for_nomad)
- One archive per NXentry; archive.data is the Entry/Arpes/Xps
  instance directly
- m_nx_data_path stored as a JSON HDF5-path -> archive-path map
- New nexus_app_v2 explore app
- handler.py: prescan() + on_prescan_group() hook for pre-pass
  metadata collection
- Acceptance-gate tests in test_parsing_v2.py
Every NeXus file now also produces a "root" child archive holding a
Root(Experiment) instance that links all NXentry archives from that
file as ExperimentSteps (resolved lazily in normalize() via
m_context + generate_entry_id).

- BASESECTIONS_MAP: NXroot -> Experiment + EntryData
- Root entry_name = "{file stem} (NeXus file)"; results.eln.methods
  lists the unique technique names (Arpes, Xps, ...) across all
  NXentries in the file
- A NomadVisitorV2 pass over the root group populates Root's own
  NXroot attributes (file_name, file_time, creator, NeXus_version, ...)
- Per-entry entry_name is now "{file stem} - {entry name}" so entries
  named "entry"/"entry1" remain distinguishable across files
NeXusLink quantities whose target type was resolved by the generator
(type != str) are now populated like regular fields instead of storing
the HDF5 link path as a string -- h5py transparently dereferences the
link so hdf_node already carries the target's real data.

Conflict-renamed quantities (e.g. name_quantity for NXDL "name") now
mirror their value into the NOMAD-native quantity they shadow (e.g.
name), via a new _SectionIndex.shadow_map.

Fixes the remaining NXlauetof assertions: name_group.time_of_flight
and sample.name.
_SectionIndex.find_subsection used a naive .replace("GROUPNAME","").replace("NAME","")
+ startswith check for NXDL nameType="partial" groups, requiring the HDF5 group name
to literally start with the NXDL name (e.g. "eventID..."). The canonical algorithm
(get_nx_namefit) instead treats uppercase letter runs in the given name as wildcards,
so "event1" is just as valid a match for "eventID" as "eventID1". Fixed to use
get_nx_namefit(hdf_name, annotation.name, name_partial=True).

Also adds test_parsing_opt_v2.py, a v2 port of test_parsing_opt.py's storage-layout/
dtype matrix (33 parametrized cases), asserting structural reachability of the
NXdata group. The original mean/min/max/size/ndim value assertions are commented out
pending FIELD_STATISTICS support for array-shaped quantities in NXdata-derived
classes, deferred to Phase 4.
@lukaspie lukaspie changed the title NeXus metainfo phase3 - parser & app NeXus–NOMAD metainfo generator: Phase 3 (parser v2 + app v2) Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant