Skip to content

Feature/rocoto#168

Draft
DavidBurrows-NCO wants to merge 20 commits into
mainfrom
feature/rocoto
Draft

Feature/rocoto#168
DavidBurrows-NCO wants to merge 20 commits into
mainfrom
feature/rocoto

Conversation

@DavidBurrows-NCO
Copy link
Copy Markdown
Collaborator

Description:

If merged, this PR will enable users to run the nested-EAGLE quickstart pipeline with Rocoto using UWTools for YAML to XML conversion.

This PR also enables an option to run pipeline tasks with or without batch submission. As suggested by @maddenp-cu, this enables Rocoto to directly submit each pipeline task to slurm without an intermediary step.

Resolves #99

A user would follow the quickstart guide’s step 1: make env cudascript=ursa. Then update step 2 to make config compose=base:ursa:quickstart > eagle.yaml. Update app.base. Finally, run make workflow config=eagle.yaml which will convert eagle.yaml into an eagle.xml and iterate through quickstart guide steps 4-8.

Current issues:

  1. For zarr, training, inference, and wx jobs,

test $? -eq 0 && touch runscript.zarr-gfs.done

does not report job failures back to Rocoto. If test=0 (success) then a .done file is generated. The script will return successful status back to Rocoto. However, if test>0 (failure) then nothing happens. The script just ends, and Rocoto believes the job is successful and will continue through the workflow. We need to add something like

eagle-tools inference inference.yaml && touch runscript.inference.done || { echo “job failed”; exit 1; }

which should communicate a failure to Rocoto.

  1. When submitting vx jobs via make, each job submits with ~6 seconds delay. With Rocoto, the jobs submit and launch nearly simultaneously, which leads to intermittent errors (from run/default/vx/prewxvx/global/prewxvx.log):

OSError: [Errno -101] NetCDF: HDF error: '/scratch4/NAGAPE/epic/David.Burrows/may20/EAGLErocoto/src/run/default/data/global_one_degree_with_mask.nc'

and

PermissionError: [Errno 13] Permission denied: '/scratch4/NCEPDEV/nems/David.Burrows/eagle/may4_full_workflow/EAGLE/src/run/default/data/global_one_degree_with_mask.nc'

I believe the errors stem from the global prewxvx jobs attempting to read /scratch4/NCEPDEV/nems/David.Burrows/eagle/may4_full_workflow/EAGLE/src/run/default/data/global_one_degree.nc simultaneously. I see a couple ways to circumvent this. 1) Run vx jobs in serial or nested-metatasks in mixed parallel/serial mode which increases pipeline runtime 2) split vx jobs into grid2grid and grid2obs metatasks which leaves a gap between launching the 2 global jobs and increases pipeline runtime, 3) other thoughts?

Tests ran so far:
Follow quickstart guide directly
Follow guide described for Rocoto above

Type of change:

  • Bug fix
  • New feature
  • Refactor / cleanup
  • Documentation
  • CI/CD or tooling
  • Other:

Area(s) affected

  • nested_eagle workflow
  • Verification / evaluation (via WXVX)
  • Data prep / UFS2ARCO
  • Config (YAML)
  • Plotting / post-processing
  • Infrastructure / Slurm scripts
  • Other:

Commit Requirements:

  • This PR addresses a relevant NOAA-EPIC/EAGLE issue (if not, create an issue); a person responsible for submitting the update has been assigned to the issue (link issue)
  • Fill out all sections of this template.
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • I have made corresponding changes to the system documentation if necessary

@github-actions
Copy link
Copy Markdown

Link to ReadTheDocs sample build for this PR can be found at:
https://epic-eagle--168.org.readthedocs.build/en/168

@DavidBurrows-NCO DavidBurrows-NCO added enhancement New feature or request eagle-ursa labels May 28, 2026
@github-actions github-actions Bot added ci-running CI is running on this pull request and removed ci-running CI is running on this pull request labels May 28, 2026
@github-actions github-actions Bot added ci-running CI is running on this pull request and removed ci-running CI is running on this pull request labels May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eagle-ursa enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a Rocoto workflow for the Quickstart case

2 participants