Skip to content

feat(cwl): integration of CWL job submission and execution into DiracX#877

Draft
ryuwd wants to merge 19 commits intoDIRACGrid:mainfrom
ryuwd:feat/cwl-job-submission
Draft

feat(cwl): integration of CWL job submission and execution into DiracX#877
ryuwd wants to merge 19 commits intoDIRACGrid:mainfrom
ryuwd:feat/cwl-job-submission

Conversation

@ryuwd
Copy link
Copy Markdown
Contributor

@ryuwd ryuwd commented Apr 2, 2026

End-to-end CWL job submission and execution for DiracX — from CLI to worker node and back.

Follows the plan in #858.

Goes with DIRACGrid/DIRAC#8506

CLI (diracx-cli)

  • dirac job submit cwl <workflow> [inputs...] — submit CWL jobs with local file sandbox upload, LFN references, and parametric --range expansion
  • dirac job submit cmd -- <command> — quick submission with auto-generated CWL (captures stdout/stderr to log files)
  • dirac job search — search jobs with conditions and rich table output
  • dirac job sandbox list|peek|get <job_id> — explore and retrieve output sandbox files
  • Submission pipeline: CWL/YAML parsing, input validation, sandbox scanning/grouping/upload, confirmation prompt, range expansion
  • CWL executor (dirac-cwl-run): custom cwltool executor with DIRAC-aware FsAccess and PathMapper for LFN resolution via replica maps

Worker Node (diracx-api)

  • JobWrapper: full CWL job lifecycle — pre-process (sandbox download, LFN resolution, replica map building), async subprocess execution with live stderr streaming, post-process (output sandbox upload, output data registration)
  • ApplicationStatus reporting: cwltool lifecycle transitions (e.g. [job echo-tool] completed success, [workflow ] starting step greet) streamed as ApplicationStatus with rate-limited commits
  • Sandbox handling: SB:<se>|<s3_path>#<filename> URI scheme — #fragment identifies the file inside the tar archive; sandbox reference preserved with SB: prefix throughout the system
  • JobReport: accumulates status updates, flushes via HTTP with rate limiting

Server (diracx-logic, diracx-routers, diracx-core)

  • CWL-to-JDL translation: cwl_to_jdl() extracts dirac:Job hints from CWL, maps to JDL fields (CPUTime, Sites, Tags, I/O sandboxes, InputData, OutputData)
  • Auto stdout/stderr collection: CWL stdout:/stderr: fields automatically added to OutputSandbox in JDL
  • InputSandbox #fragment stripping: JDL InputSandbox contains bare SB: refs (no fragment) for server ownership checks; full URI with #filename preserved in CWL inputs for worker extraction
  • Range expansion: server-side parametric job expansion from --range spec
  • Models: JobHint, IOSource, OutputDataEntry, ReplicaMap, pre/post-process command framework

Client (diracx-client)

  • Generated client extensions for CWL submission, workflow retrieval, and sandbox operations

Key design decisions

  • CWL-native: no JDL on the client side — CWL is the job description format, JDL is an internal detail
  • cwltool passthrough for status: ApplicationStatus shows verbatim cwltool lifecycle lines rather than a custom translation layer
  • SB: URI scheme: SB:<se>|<s3_path>#<relative_path> — logical reference (not PFN), server resolves to presigned URL
  • Replica map: JSON file passed to cwltool executor, maps LFN/SB paths to local files — decouples CWL execution from DIRAC data management

Test coverage

  • Unit tests: JobWrapper commands, output parsing, CWL hint extraction, sandbox path parsing, replica map injection, submission pipeline, input parsing, executor path mapping
  • Integration tests: full JobWrapper lifecycle with mocked services, real CWL execution, stderr streaming, ApplicationStatus filtering

Status

Under certification testing on diracx-cert.app.cern.ch. Actively fixing issues found during grid execution.

cc @aldbr

@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community bot commented Apr 2, 2026

Documentation build overview

📚 diracx | 🛠️ Build #32275799 | 📁 Comparing dd8abd1 against latest (ea405cc)

  🔍 Preview build  

No files changed.

@ryuwd ryuwd force-pushed the feat/cwl-job-submission branch 2 times, most recently from 850b0a5 to 32a33a2 Compare April 8, 2026 13:56
@ryuwd ryuwd changed the title feat(cwl): add CWL workflow submission endpoint and DB storage model feat(cwl): integration of CWL job submission and execution into DiracX Apr 10, 2026
raise RuntimeError(f"Could not set job statuses: {ret}")

async def commit(self):
"""Send all the accumulated information."""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be debounced or use some rate limiter in the JobReport class itself

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would replace the rate limiting logic in the JobWrapper which would be better here

Comment thread diracx-api/src/diracx/api/job_wrapper.py Outdated
This class extends StdFsAccess to handle LFN: and SB: prefixed paths by
looking them up in the replica map and using the physical file path instead.

Key difference: LFN keys are stored WITHOUT prefix, SB keys are stored WITH prefix.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be harmonised

Services: ServicesConfig = ServicesConfig()
"""Configuration for various DIRAC services."""
SoftwareDistModule: str = "LocalSoftwareDist"
SoftwareDistModule: str = ""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was causing errors in the Pilot

to be checked with @chaen

Comment on lines +75 to +81
# TODO: Compute Adler32 checksum before upload
# TODO: Extract POOL/ROOT GUID if applicable
# TODO: Prefer local SEs (getSEsForSite) before remote ones
# TODO: Implement retry with exponential backoff on transient failures
# TODO: On complete failure, create a failover Request (RMS)
# for async recovery instead of raising immediately
# TODO: Report upload progress via job status updates
Copy link
Copy Markdown
Contributor Author

@ryuwd ryuwd Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command is still untested in cert. StoreOutputData still needs fuller implementation, discussion, and testing.

@ryuwd ryuwd force-pushed the feat/cwl-job-submission branch from 75e0b03 to f9133d2 Compare April 14, 2026 14:47
ryuwd and others added 4 commits April 14, 2026 16:57
Implements streaming interpolation-drop compression for prmon time-series
data, porting the HSF/prmon algorithm to pure Python (no pandas) for use
in DIRACOS2 environments. Includes full TDD test suite with 5 tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Reader

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ryuwd and others added 5 commits April 15, 2026 15:05
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per the CWL v1.2 spec, location is an IRI that identifies a file
resource (supports custom URI schemes like LFN: and SB:), while path
is a local filesystem path set after staging. Previously both URI
schemes and local paths were placed in path, which breaks when
cwltool normalises inputs via file_uri().

Readers now check location before path, writers place URI schemes
in location, and validation rejects LFN:/SB: in the path field
on both the client and server side.
Verify that DiracPathMapper produces correct target values (what
cwltool assigns to the File path field at runtime) for different
PFN types: file:// to local path, https:// and root:// passed
through as URLs, and SB: resolved via replica map.
load_inputfile() converts input dicts into cwl_utils File objects
where location="SB:..." and path=None. The extract methods only
checked .path on objects, silently dropping SB: and LFN: references
stored in .location. This caused empty replica maps and sandbox
download failures on the worker.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant