You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jun 14, 2026. It is now read-only.
During a full pe_us_data_rebuild_checkpoint pipeline build after merging origin/main into codex/fix-146-narrow-lazy-imports, the build reached 02_source_loading and then remained there for roughly 2h55m without any provider-level progress or manifest heartbeat.
Observed state:
01_run_profile completed.
02_source_loading started at 2026-06-03T15:49:38Z.
The build never reached 03_source_planning.
stage_artifacts/manifests/02_source_loading.json still showed status: running.
updatedAt remained equal to startedAt.
No completed outputs were present.
Required outputs were still missing:
observation_frame_summary
source_descriptors
source_relationships
After manual termination, the manifest remained in running state with no failure/interruption reason.
This means source loading is currently difficult to diagnose: after a long runtime, we cannot tell whether it is making expected progress, stuck on a specific provider, retrying a cache/download path, or spending time in a pathological slow path.
Recommended fix:
Add provider-level source-loading progress events, at least:
provider started
provider completed
provider failed
elapsed time
row/entity counts where available
cache/download paths where relevant
Heartbeat 02_source_loading.json periodically and after each provider, including the current provider and last successful provider.
Persist partial per-provider summaries so reruns are diagnosable without restarting blind.
Catch SIGTERM/KeyboardInterrupt in the stage runtime or stage writer and mark the active stage as failed/interrupted with timestamp and reason, instead of leaving it as running.
Add unit tests for heartbeat updates and interrupted-stage failure recording.
Notably, this was not a Python traceback or obvious missing dependency. The first blocker was source-loading observability and clean failure recording during a long-running full build.
During a full
pe_us_data_rebuild_checkpointpipeline build after mergingorigin/mainintocodex/fix-146-narrow-lazy-imports, the build reached02_source_loadingand then remained there for roughly 2h55m without any provider-level progress or manifest heartbeat.Observed state:
01_run_profilecompleted.02_source_loadingstarted at2026-06-03T15:49:38Z.03_source_planning.stage_artifacts/manifests/02_source_loading.jsonstill showedstatus: running.updatedAtremained equal tostartedAt.observation_frame_summarysource_descriptorssource_relationshipsrunningstate with no failure/interruption reason.This means source loading is currently difficult to diagnose: after a long runtime, we cannot tell whether it is making expected progress, stuck on a specific provider, retrying a cache/download path, or spending time in a pathological slow path.
Recommended fix:
02_source_loading.jsonperiodically and after each provider, including the current provider and last successful provider.SIGTERM/KeyboardInterruptin the stage runtime or stage writer and mark the active stage as failed/interrupted with timestamp and reason, instead of leaving it asrunning.Notably, this was not a Python traceback or obvious missing dependency. The first blocker was source-loading observability and clean failure recording during a long-running full build.