AWI-ESM3#1479
Open
JanStreffing wants to merge 312 commits into
Open
Conversation
…ted still to the actual output variables.
The OIFS branch feature/cmip7-rh-online (merged into local_combined_fixes 2026-04-17) computes CMIP/CF-conformant relative humidity online every timestep using Alduchov-Eskridge Magnus with the hard water/ice switch at 273.15 K, and sends `hur_cmip7` (model levels) and `hurs_cmip7` (2m) to XIOS. Until now those sends were dropped because no XIOS field declared the IDs. field_def_cmip7.xml.j2: - declare raw `hurs_cmip7` (fraction) in 2D_physical - declare raw `hur_cmip7` (fraction) in 3D_ml - add percent alias `near_surface_relative_humidity_pct__hurs` (name="hurs") - repoint `relative_humidity_ml__hur` from `r` (FOEEWM mixed-phase QSAT, NOT CMIP-conformant) to `hur_cmip7` file_def_oifs_cmip7_spinup.xml.j2: - write `hurs` into atmos_mon (monthly surface) - write `hurs` into atmos_day (daily surface) - atmos_mon_ml already references relative_humidity_ml__hur, so it picks up CMIP7-conformant hur automatically Addresses Felix Pithan's 2026-04-15 feedback: IFS-native r/r_pl uses mixed-phase QSAT interpolation between RTICE and RTWAT (not CMIP-CF), and post-hoc RH from monthly-mean ta/hus is biased due to nonlinear e_sat(T). Computing online and averaging downstream fixes both.
- Add 12 {grass,crop,pasture,shrub,baresoil,veg}Frac[C3/C4]Phen_monthly
entries (phenology/LAI-weighted companions).
- Add 6 {grass,crop,pasture}FracC{3,4}_yearly entries (stand-area).
OpenMPI/UCX teardown after a successful run can emit benign 'srun: error:' cascades to stdout. With method=kill, observe_compute scancels the job before tidy runs — so Rename_XIOS_FESOM, Combine_LPJG, etc. never fire, and the experiment ends in a broken state despite the model having completed compute cleanly. Demoting to 'warn' lets the job complete and tidy run; real srun launch failures still abort via the existing 'slurmstepd: error: execve():' kill trigger right below this one.
Adds a new iolibraries arm 'system_libs_oneapi2025_ompi5' alongside the existing 'system_intel_libs' (intel-2022 + openmpi/4.1.2 + UCX 1.12). Not activated by default — selectable per setup via iolibraries. What's needed for OMPI5/UCX 1.19 to actually run on Levante SLURM: - SLURM_MPI_TYPE=pmix_v5 (default pmix_v3 mismatches OMPI5's PMIx 5.0.5 build and makes MPI_Init_thread abort on a NULL communicator) - OMPI_MCA_io=romio341 (OMPI5 doesn't ship romio321) - UCX_TLS=all (UCX 1.19 transport names differ; mm/dc_x no longer exist) - UCX_UNIFIED_MODE=n (y leaves tl_bitmap=0; misbehaves under heterogeneous MPMD) - gcc-13.4 lib64 prepended to LD_LIBRARY_PATH / LIBRARY_PATH / LDFLAGS (icpx 2025 emits GLIBCXX_3.4.32; system /lib64/libstdc++ tops out at 3.4.25, and mambaforge's libstdc++ leaks into ecbuild's rpath-link) - TBBROOT/TBBMALLOC_DIR (2025 oneAPI TBB layout: tbb/2022.3/lib, no intel64/gcc4.8 subdir) - AEC_ROOT (libaec install dir behind libaec/1.0.5-oneapi-2025.0.4) - Per-iolibraries OMPI_MCA_io (was global; moved because the two MPI builds need different ROMIO versions) - PATH wrapper dir for mpicc/mpif90/mpicxx (ectrans link.txt files don't inherit CMAKE_EXE_LINKER_FLAGS — wrappers prepend gcc-13.4 -rpath-link) - launcher_flags: '-l --mpi=pmix_v5' on the iolibraries arm itself Component env mirrors in amip.yaml and rnfmap.env.yaml: same Intel-style compiler flags as intel2022_openmpi (ifx 2025.3.2 accepts them all).
Follow-up to 0af184e ('do not scancel on srun: error:'). The blanket warn on srun:error stops scancels for the benign pmix_v3 finalize-race cascade, but it also swallows real crashes (OOM, segfault) — a real failure would emit 'srun: error: task X: Out Of Memory' or 'Killed' AFTER the model already wrote partial state, and esm_tools would still advance the date file because no trigger fired kill. That happened on HR Test_15 chunk 2 where an LPJ-GUESS rank OOM'd at init, the experiment crashed silently, and the date file marched on to 'Experiment over'. Fix: keep srun:error as warn (still rescues the benign cascade), but add specific kill triggers that ONLY match real-failure signatures: - 'oom_kill event' (slurmstepd OOM notification) - 'Out Of Memory' (srun OOM report) - 'Segmentation fault' (SIGSEGV) - 'Bus error' (SIGBUS) - 'DUE TO TIME LIMIT' (walltime exceeded) - 'Aborted (signal' (process abort) Validated against the known-good LR test52 log (0 false positives for all 6 new patterns) and the failure logs from HR Test_15 chunk2 (oom_kill + Out Of Memory both match correctly). LR_run_test54 with this config: COMPLETED 0:0, 26 srun:error lines (warn, no scancel), 0 trigger hits, full tidy pipeline ran.
…patterns The blanket `Segmentation fault` kill trigger (added in e4f31eb) was too coarse at HR scale. Final_CMIP7_IO_Test_14 and Test_17 (~3281 ranks) both reached the final compute step (STEP 52560, H=17520:00) and only then hit a FESOM `ucs_mpool_cleanup` NULL-deref SIGSEGV on rank 1394 during MPI_Finalize — a known-benign UCX teardown race that nevertheless emits `Segmentation fault` to stdout. The trigger fired and scancel'd the job before tidy could run, losing 7+ hours of completed HR compute. esm_tools check_error is substring-only (no regex), so we can't keep `Segmentation fault` kill and exclude the (nil) finalize variant. Practical fix: drop the broad segfault trigger and replace it with narrower patterns that real compute crashes actually emit: - `forrtl: severe` Intel Fortran runtime error - `MPI_ABORT was invoked` explicit MPI abort from a rank - `BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES` OpenMPI's abnormal-exit marker Trade-off: a raw C-level SIGSEGV during compute (rare in this codebase — FESOM/OIFS/lpj_guess are mostly Fortran or use the runtime guards above) will no longer auto-kill. Such failures still produce incomplete output, which is catchable by output-volume / final-timestep inspection. Cross-checked against actual logs: Test_54 (clean LR): 0 hits on all new triggers (no false pos.) Test_17 (HR teardown): 0 hits → tidy would run, output recoverable Test_15 chunk 2 (OOM): oom_kill + Out Of Memory still match → kill
… aEVP for DARS2
Three updates so v3.4.2 doesn't lag develop:
1. fesom-2.7.4 → fesom-2.7.5
Upstream tag 2.7.5 adds 7 melt-pond improvements over 2.7.4:
- refactor: Icepack-style Stefan lid (replaces home-grown ipnd accumulator)
- feat: namelist-tunable tanh snow-cover albedo blend (h_snowscale)
- feat: expose meltpond tuning parameters via namelist.ice
- fix: meltpond geometry bookkeeping (units, dual branch, alid factor)
- fix: drop hsno reset so lid grows under snow (+revert pair)
2. lpj_guess-4.1.4 → lpj_guess-4.1.5
New tag pinned at lpj_guess_awiesm3 HEAD (67fd6d6) on jan.streffing/lpjg-4.1.
Adds since 4.1.4:
- 67fd6d6 CMIP output: stand-area *Frac + Phen vars
- 041cfbc CMIP output: per-PFT yearly treeFrac{BdlDcd,BdlEvg,NdlDcd,NdlEvg}
- 24da0de CMIP output: stand-area treeFrac, vegHeightGrass, +Phen vars
- 2ee5955 LUH3 / landcover bookkeeping (Fraction-error fix at spinup
->transient handover, div-by-zero guards, fraction-cap)
- 0fef730 auto-detect CO2/ndep first year from NetCDF time axis
3. aEVP (whichevp=2) default on DARS2 for v3.4.2
The May-11 27fda99 commit scoped aEVP to develop/develop-cc only,
keeping v3.4.0/v3.4.1 on mEVP. Since v3.4.2 is the next tagged release
and HR (DARS2) sithick drift is a real bug fixed by aEVP (Koldunov 2019
JAMES), propagate the same DARS2-only override to v3.4.2.
Adds the 2.7.5 / 4.1.5 choose_version arms in the corresponding component
yamls, and updates the v3.4.2 coupling spec to consume them.
Not touching the v3.4.2 oasis pin (5.2) — the OMPI4/UCX12 finalize race
that motivated the local MPI_INIT_THREAD revert turned out not to be
actually fixed by that revert; v3.4.2 keeps the upstream-tagged 5.2.
Mirror the awiesm3 v3.4.2 work for the AWI-CM-3 setup (no LPJ-GUESS).
New file: configs/couplings/awicm3_v3.4.2/awicm3_v3.4.2.yaml
components:
xios-2.5.2, rnfmap-v1.3, oifs-48r1v4, fesom-2.7.5, oasis3mct-5.2
Same upstream-tag pins as awiesm3_v3.4.2 minus lpj_guess (awicm3 has no
vegetation component).
configs/setups/awicm3/awicm3.yaml:
- Add 'v3.4.2' to supported_versions
- Add v3.4.2 arm in choose_version pointing to awicm3_v3.4.2 coupling
- Propagate aEVP (whichevp=2) on DARS2 from develop arm to v3.4.2 arm
Note: pre-existing v3.4.1 arm still references awicm3_v3.4.0 coupling
(was probably an oversight when v3.4.1 was added) — leaving that as-is
since v3.4.2 supersedes it for production.
… cadence Regrid output quality fixes after validation runs on AWI-ESM3 CORE2/TCO95. XML changes (both xios_xml/ and xios_xml_cmip7/ domain_def): - drop interpolate_domain quantity="true" added in 4948546. quantity=true treats fields as extensive (sums area-fraction-weighted contributions); correct only for fluxes and only when source/target cells match in size. FESOM unstructured mesh has 5-20x resolution variation, and target cells in dense-mesh regions (e.g. Indonesia) contain ~12 source cells whose values summed produced impossible values (SST up to 316C). OIFS-to-regular has run without quantity for years; matching that gives intensive area-weighted average and physical per-cell values for all field types. - add interpolate_domain renormalize="true". FESOM unstructured source has no land cells, so coastal target cells were partially covered (weights summing to <1) which diluted values toward 0 across every coastline. renormalize=true divides by sum-of-weights, restoring physical ocean values up to the actual land mask. Pure-land cells retain the 0 sentinel they had before. File_def restructure (xios_xml_cmip7/file_def_fesom.xml.j2 only): - non-CMIP7 file_def back to upstream-like (no regrid blocks). - fesom_1d_reg: 13 daily 2D fields (mirrors fesom_1d) gated by fesom.regrid_surface_output. - fesom_1mo_reg: 33 node + 6 element 2D fields at 1mo (mirrors fesom_1mo 2D set, 0D scalars excluded) gated same. - fesom_1mo_reg_3d: full 3D set (cell-center + interface vertical + dMOC density coords) gated by new fesom.regrid_3d_output flag, default off — 3D regrid is expensive at production resolutions. grid_def additions (xios_xml_cmip7/grid_def_fesom.xml): - grid_3d_nod_reg, grid_3d_elem_reg (cell-center vertical) - grid_3d_nod_reg_nz1, grid_3d_elem_reg_nz1 (interface vertical) - grid_rho_nod_reg, grid_rho_elem_reg (dMOC density coord) Config flag changes (configs/components/fesom/fesom-2.7.yaml): - regrid_surface_output: description updated to reflect native-cadence mirroring and CMIP7-only scope. - regrid_3d_output: new flag, default false. - regrid_surface_daily: removed. Was only needed when the colleague's original regrid blocks output only monthly; replaced by native-cadence mirroring.
LR test runscript for regrid pipeline validation. CORE2 + TCO95 + L91, 2-day run. Enables fesom.cmip7_cmor_output (regrid blocks live in the CMIP7 file_def only) and fesom.regrid_surface_output. 3D regrid stays off via the default of fesom.regrid_3d_output.
Two HR (TCO319L137 + DARS2) 1-year variants used to characterize the
regrid path cost vs the unchanged HR baseline:
- *_1y_regrid.yaml: 2D regrid only, 1deg target (sets
fesom.regrid_surface_output: True; leaves regrid_3d_output false).
Used as Final_CMIP7_IO_Test_20.
- *_1y_regrid_3d_05deg.yaml: 2D + 3D regrid, 0.5deg target (sets
both regrid switches + xios.{fesom_,}regular_res_l{on,at} to
720/360). Used as Final_CMIP7_IO_Test_21.
Measured cost: Test_21 vs Test_20 = +0.4% wall time, +8 GB/year disk.
Test_21 vs HR baseline (no regrid) = similar. Cheap enough to make
both regrid switches default-on for CMIP7 (next commit).
Flip defaults based on HR perf characterization: - fesom.regrid_surface_output: false -> true. CMIP7 file_def emits daily and monthly 2D regridded fields alongside the native unstructured output. <1% wall, <1 GB/year at 0.5deg. - fesom.regrid_3d_output: false -> true. Adds full 3D regridded set (temp, salt, unod/vnod, w, bolus_*, N2, Kv, u/v on element grid, Av, dMOC density-coord). HR Test_21 vs Test_20 measured +0.4% wall and +8 GB/year — effectively free thanks to XIOS server async IO. - awiesm3.yaml + awicm3.yaml: bump default xios.fesom_regular_res_lon/lat from 360/180 (1deg) to 720/360 (0.5deg). Resolves the FESOM-vs-OIFS asymmetry where coastal features and surface gradients were under-resolved at 1deg despite the source mesh being much finer. 0.5deg cost is what Test_21 measured. Standalone-FESOM safe: the regrid file_def blocks only fire under the CMIP7 j2 templates (xios_xml_cmip7/), which require fesom.cmip7_cmor_output: true. Non-CMIP7 runs unaffected. To opt out of regrid per experiment, set the flags back to false in the runscript yaml.
…guard) New component versions both carry the regular-grid output work merged in this iteration. FESOM 2.7.6 (tag d42b4857, merge of PR 917): - Nadine's cell-bounds-aware XIOS domain attrs for unstructured nodes + elements (bounds_lon_1d, bounds_lat_1d with CCW sort + dateline unwrap + NMAX=10 cap). - Drop the #if defined(__oasis) gate around mesh%x_corners allocation in oce_mesh.F90 — XIOS regrid needs the corners and they aren't OASIS-specific. Fixes the standalone-FESOM CI segfault that motivated the gate removal. XIOS 2.5.3 (tag from origin/main 97ac2a88): - Cell-bounds patches in extern/remap/src for the FESOM unstructured source (Nadine). - Magnitude guard in remap update()/move(): proj(centre) was producing NaN when child centroids cancel near-antipodally on global meshes, silently dropping polygons from the kd-tree and producing no nodes_reg weight file. Now falls back to the largest-leafCount child's centre (update) / heavier-weight side (move). Plus diagnostics module with per-rank counter dump at three checkpoints. - Validated on TCO95+CORE2 and TCO319+DARS2 (HR Tests 19-21). Component yaml: new "2.5.3" arm in xios.yaml (mirrors 2.5.2 modulo branch label) + "2.7.6" arm in fesom-2.7.yaml. Bump both awiesm3_v3.4.2 and awicm3_v3.4.2 coupling specs to use the new pins.
The aEVP choose-block for DARS2 in awiesm3.yaml + awicm3.yaml only set whichevp=2 but left delta_min at the namelist.ice default (1.0e-11). That default is too stiff for aEVP on HR — every HR runscript since the AWI-ESM3-VEG-HR-CMIP7-Spinup_cont2 has been setting delta_min=2e-9 explicitly to compensate. Move the override into the choose block so anyone using version v3.4.2 or develop with DARS2 gets the correct paired (whichevp, delta_min) without needing to remember it per runscript. No effect on non-DARS2 resolutions (each keeps its per-mesh whichEVP / delta_min default from fesom-2.7.yaml).
…chevp=2" This reverts commit a0b489e.
The XIOS-side regridded surface output (enabled via
fesom.regrid_surface_output: true, present in HR Test_21+) was already
being copied to outdata/fesom by the default `*.fesom.*.nc` glob, but
left under its XIOS-default name `<var>.fesom.reg_<ystart>-<yend>.nc`.
Add a sibling renamer that normalises to `<var>.fesom.gr.<year>.nc`,
mirroring how `rename_xios_fesom.sh` normalises native XIOS output to
`<var>.fesom.<year>.nc`:
- `gr` matches the CMIP grid_label for regridded (regular lat/lon),
so the on-disk segment tells you the destination DRS grid_label
without translation.
- Visually distinct from native at `ls` time: `.fesom.` vs `.fesom.gr.`
- Disjoint glob from the native renamer: this script needs literal
".fesom.reg_" while the native one needs literal ".fesom_". Safe to
run concurrently.
- Idempotent: skips files already in target form, no-ops cleanly if
no `.fesom.reg_*` files are present (e.g. regrid_surface_output=false).
Pycmor side gets a clean parallel for the future `_gr` mirror tiers:
native: pattern: <var>\.fesom\.\d{4}\.nc
reg: pattern: <var>\.fesom\.gr\.\d{4}\.nc
which is what enables producing both gn and gr CMIP7 output from one
source experiment.
FESOM 2.7.7 (tag cdbab44f) includes 9 commits since 2.7.6 that are all production-relevant for HR: - Fix 0D CMIP diagnostic accumulation bug — diag scalars (siarean, siareas, siextentn, siextents, sivoln, sivols, volo, soga, thetaoga) were being sent to XIOS without proper accumulation, giving wrong monthly means. - Fix mesh-partitioner segfault on large meshes (#922). - Fix aEVP Wvel NaN allocation bug in MOD_ICE (alpha_evp_array wrongly allocated when whichevp=2) — relevant for our DARS2 + aEVP path. Add "2.7.7" arm in fesom-2.7.yaml and bump both awiesm3_v3.4.2 and awicm3_v3.4.2 coupling specs to the new pin.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With the CMIP7 PI control starting, maybe its time to think about merging?
Since we start without automatic cmorization for now, the esm_tools development is not quite over.