This document covers cross-cutting operational concerns for developers and
operators running hdhriptv. It complements the README's user-facing
configuration reference and troubleshooting sections with internal
architecture details, behavioral nuances, and diagnostic guidance.
All logging is initialized by internal/logging/logging.go. The system
produces structured slog.TextHandler output in key=value format (not
JSON).
Every log record is dispatched to three simultaneous destinations through a
fanoutHandler:
| Destination | Level Range | Handler |
|---|---|---|
| stdout | DEBUG .. INFO |
levelRangeHandler with upper bound |
| stderr | WARN .. ERROR |
levelRangeHandler with no upper bound |
| log file | configured level+ | slog.TextHandler (same LevelVar filter) |
The levelRangeHandler wraps a standard slog.Handler and adds min/max
level filtering. The stdout handler has hasMax=true (capped at INFO), so
WARN and ERROR records never appear on stdout. The stderr handler has
hasMax=false, so it accepts everything from WARN upward. The file
handler uses the same LevelVar filter as the stdout/stderr handlers, so it
respects the configured log level.
Container runtimes and systemd capture stdout and stderr as separate
streams. When using journalctl, entries from stderr appear with priority
err/warning while stdout entries appear with priority info/debug.
Log aggregation pipelines that merge both streams into one will see
interleaved output — the key=value format includes a level= field for
disambiguation.
- Naming:
hdhriptv-YYYYMMDD-HHMMSS.logusing the process startup timestamp. - Directory:
LOG_DIRis auto-created withMkdirAllmode0o755. - File permissions:
0o644(owner read/write, group/other read). - No rotation: Each process startup creates a new file; old log files
accumulate and are never automatically removed. Operators should configure
external rotation (e.g.,
logrotate, cron cleanup) to manage disk usage. - Close: The log file is closed when the process exits (deferred in
main.go).
The configured level is stored in a slog.LevelVar, which supports
atomic updates. While the service does not currently expose a runtime
level-change API, the infrastructure is in place for potential future use.
Startup emits structured timing checkpoints with:
msg="startup phase complete"phase=<phase_name>duration=<elapsed_duration>
Current phase names are:
sqlite_runtime_pragmassqlite_migratesqlite_open_totalidentity_settings_resolveapp_version_syncautomation_overrides_syncdvr_schedule_syncscheduler_load_from_settings
These records are intended for startup latency baselining and regression tracking in log pipelines.
When at least one playlist source has a configured URL, startup emits a dedicated initial-sync phase event stream:
msg="initial playlist sync phase"initial_sync_phase=scheduled_after_listener_start|completed|failed
Operational behavior:
- The initial sync is deferred until HTTP listener readiness succeeds
(
/healthz) so provider reloads do not run before listeners are accepting requests. - DVR lineup reload during playlist sync is optimistic: provider reload failures are logged and reflected in run summary fields but do not fail the playlist sync run itself.
- After startup sync reconciliation, dynamic generated guide names are refreshed so dynamic channel lineup names stay aligned with current playlist metadata.
Useful fields:
attemptoncompletedphase events (initial_sync_phase=completed).durationon bothcompletedandfailedphase events, including readiness-gate failures.erroronfailedevents.
GetSystemUpdateID requests (POST /upnp/control/content-directory) use a
coalesced refresh path to avoid fan-out lineup reads under burst polling.
- Cache hit behavior is controlled by
--upnp-content-directory-update-id-cache-ttl/UPNP_CONTENT_DIRECTORY_UPDATE_ID_CACHE_TTL. - On cache miss, one caller becomes the in-flight refresh leader while other callers wait for that leader's result.
- If a leader refresh exits with
context canceledordeadline exceeded, callers with still-live contexts retry using bounded exponential backoff (5ms,10ms,20ms,40ms, capped at 4 retries) instead of spinning in a tight loop. - A retry-cap exhaustion returns an
Action FailedSOAP fault and includes the canceled/deadline root cause in logs.
Observability:
- Prometheus counter:
hdhr_content_directory_update_id_refresh_retries_total(increments on each retry attempt after a canceled/deadline refresh). - Debug log:
msg="upnp content directory update-id refresh retrying after canceled refresh"withattempt,max_attempts,backoff, anderrorfields.
make publish-github mirrors internal main to the public repo as squash
commits. It is intended for controlled release publishing, not day-to-day
development pushes.
| Variable | Default | Purpose |
|---|---|---|
INTERNAL_REMOTE |
origin |
Internal source-of-truth remote. |
INTERNAL_REPO_URL |
git@gitlab.lan:arodd/hdhriptv.git |
Canonical URL enforced for INTERNAL_REMOTE. |
PUBLIC_REMOTE |
github |
Public publishing remote. |
PUBLIC_REPO_URL |
git@github.com:arodd/hdhriptv.git |
Canonical URL enforced for PUBLIC_REMOTE. |
SYNC_BRANCH |
main |
Branch to publish. |
PUBLIC_SYNC_TAG |
public-sync/latest |
Marker tag on internal remote tracking last published internal commit. |
PUBLISH_GITHUB_COMMIT_MESSAGE |
empty | Optional custom squash commit subject. |
- Fetch internal/public refs and verify local
SYNC_BRANCHexactly matchesINTERNAL_REMOTE/SYNC_BRANCH. - Read
PUBLIC_SYNC_TAGfrom the internal remote to determine the last published internal commit. - If marker tag equals current internal tip, exit with no-op.
- If marker tag is present but not an ancestor of current internal tip, fail safe and require operator intervention.
- If
PUBLIC_REMOTE/SYNC_BRANCHtree already equals internal tip tree, update onlyPUBLIC_SYNC_TAGon internal remote (recovery/idempotency path). - Otherwise create a squash commit from the internal tip tree:
- initial publish: commit has no parent,
- incremental publish: commit parents the current public tip.
- Push squash commit to
PUBLIC_REMOTE/SYNC_BRANCH. - Move and push
PUBLIC_SYNC_TAGon internal remote to the published internal tip.
- Local/internal divergence rejection
- Symptom:
Refusing to publish: local main is out of sync... - Action: reconcile local checkout with internal remote (
pull --ff-onlyor reset to internal tip as appropriate), then rerun publish.
- Symptom:
- Marker ancestor violation
- Symptom: marker tag is not an ancestor of internal tip.
- Action: audit marker history and fix the tag manually on internal remote before rerunning.
- Public push succeeded, marker push failed
- Symptom: publish command fails during marker tag push after public branch moved.
- Action: rerun
make publish-githubafter fixing internal remote/tag push permissions. The command detects tree equality and updates marker only, avoiding duplicate public squash commits.
make release-github-sync-tag is the release path when binaries are published
on GitHub, while GitLab only receives a matching release tag (no GitLab release
object or assets).
| Variable | Default | Purpose |
|---|---|---|
RELEASE_TAG |
empty (required) | Release tag name (vX.Y.Z). |
RELEASE_TITLE |
empty | GitHub release title; defaults to RELEASE_TAG. |
RELEASE_NOTES_FILE |
empty | Optional notes preface merged ahead of auto-generated changelog highlights. |
RELEASE_DIST_DIR |
dist |
Output directory for built binaries and checksums. |
GITHUB_RELEASE_REPO |
empty | Optional owner/repo override for release publishing. |
RELEASE_IMAGE |
arodd/hdhriptv |
Container image repository pushed during release; publishes <repo>:<RELEASE_TAG> and <repo>:latest (set full registry/repo, for example ghcr.io/<owner>/hdhriptv, when not publishing to Docker Hub). |
BINFMT_IMAGE |
tonistiigi/binfmt:qemu-v10.0.4 |
Pinned binfmt helper image used for multi-arch emulation bootstrap; must be an explicit version tag or digest (floating :latest is rejected). |
BUILDX_BUILDER |
hdhriptv-multiarch |
Docker Buildx builder name used for multi-arch container publishing. |
INTERNAL_REMOTE |
origin |
Internal source-of-truth remote (GitLab). |
INTERNAL_REPO_URL |
git@gitlab.lan:arodd/hdhriptv.git |
Canonical URL for INTERNAL_REMOTE. |
PUBLIC_REMOTE |
github |
Public publishing remote (GitHub). |
PUBLIC_REPO_URL |
git@github.com:arodd/hdhriptv.git |
Canonical URL for PUBLIC_REMOTE. |
SYNC_BRANCH |
main |
Branch to mirror from internal to public before release tagging. |
PUBLISH_GITHUB_COMMIT_MESSAGE |
empty | Optional squash commit subject override used by make publish-github; default is public(<SYNC_BRANCH>): release <RELEASE_TAG> during release-github-sync-tag. |
Before running the release command:
- Authenticate Docker to the registry backing
RELEASE_IMAGE(for exampledocker login, ordocker login ghcr.iowhen using GHCR). - Confirm
RELEASE_IMAGEresolves to the intended registry/repository because the release publishes both<RELEASE_IMAGE>:<RELEASE_TAG>and<RELEASE_IMAGE>:latest.
- Verify clean tracked working tree and local/internal branch parity.
- Build release binaries and checksum file:
dist/hdhriptv-linux-amd64dist/hdhriptv-linux-arm64dist/hdhriptv-darwin-amd64dist/hdhriptv-darwin-arm64dist/hdhriptv-windows-amd64.exedist/SHA256SUMS
- Run
make publish-githubto synchronize public mirror commit.- Default squash commit subject is
public(<SYNC_BRANCH>): release <RELEASE_TAG>unless overridden.
- Default squash commit subject is
- Verify internal/public branch tree hashes match.
- Push
RELEASE_TAGto internal remote at internal tip and to public remote at public tip. - Build and push multi-arch container image tags:
<RELEASE_IMAGE>:<RELEASE_TAG><RELEASE_IMAGE>:latest
- Generate release notes from
CHANGELOG.mdentries added since the previous internal release tag (grouped by type), then merge optionalRELEASE_NOTES_FILEcontent as a preface. - Create/update GitHub release and upload binaries/checksums.
- Remote tag mismatch (safe-stop)
- Symptom: command reports existing
RELEASE_TAGon a remote points to a different commit than expected. - Action: inspect release history, then retag manually (or choose a new release tag) before rerunning.
- Symptom: command reports existing
- Release partially published to GitHub
- Symptom: release exists but assets are incomplete/outdated.
- Action: rerun
make release-github-sync-tag; asset upload uses clobber and converges to current local build outputs.
- Container push/auth failure
- Symptom: tag push succeeds but container publish fails (for example auth or registry reachability).
- Action: fix registry auth/config and rerun
make release-github-sync-tag; tag push is idempotent and container push retries the same<RELEASE_IMAGE>:<RELEASE_TAG>and<RELEASE_IMAGE>:latesttargets.
- Public mirror sync failure
- Symptom: underlying
make publish-githubstep fails. - Action: resolve mirror/marker issue using the publish runbook above, then rerun the release command.
- Symptom: underlying
internal/stream exports stream telemetry metrics on the standard
Prometheus /metrics endpoint.
| Metric | Type | Labels | Purpose |
|---|---|---|---|
stream_slow_skip_events_total |
counter | none | Total skip-policy lag events (slow subscriber fell behind and skipped forward). |
stream_slow_skip_lag_chunks |
histogram | none | Lag depth distribution (chunks) when skip-policy events happen. |
stream_slow_skip_lag_bytes |
histogram | none | Estimated lag depth distribution (bytes) when skip-policy events happen. |
stream_subscriber_write_deadline_unsupported_total |
counter | none | Subscriber writes where write-deadline setup was unsupported and the stream fell back to best-effort writes without deadline enforcement. |
stream_subscriber_write_deadline_timeouts_total |
counter | none | Subscriber writes that hit write deadlines (os.ErrDeadlineExceeded/timeout-classified errors). |
stream_subscriber_write_short_writes_total |
counter | none | Subscriber writes that returned io.ErrShortWrite. |
stream_subscriber_write_blocked_seconds |
histogram | none | Time spent blocked in subscriber ResponseWriter.Write calls. |
stream_source_read_pause_events_total |
counter | reason |
Source read pauses >= 1s (minimum threshold) grouped by finalize reason. |
stream_source_read_pause_seconds |
histogram | reason |
Duration distribution for source read pauses >= 1s grouped by finalize reason. |
stream_startup_probe_read_worker_waits_total |
counter | none | Startup-probe detached-read attempts that waited because the global worker budget was saturated. |
stream_startup_probe_read_worker_acquire_timeouts_total |
counter | none | Startup-probe detached-read attempts that timed out/canceled while waiting for worker budget. |
stream_source_read_pause_*{reason=...} values:
recovered: reads resumed and the pause closed normally.pump_exit: pump/run cycle exited while pause tracking was active.ctx_cancel: session context cancellation finalized an active pause.
| Metric | Type | Labels | Purpose |
|---|---|---|---|
stream_close_with_timeout_started_total |
counter | none | Number of bounded close attempts started. |
stream_close_with_timeout_retried_total |
counter | none | Number of deferred close attempts re-run from retry queue. |
stream_close_with_timeout_suppressed_total |
counter | none | Number of close attempts suppressed due to in-flight/budget limits. |
stream_close_with_timeout_suppressed_duplicate_total |
counter | none | Suppressed attempts due to duplicate close ownership. |
stream_close_with_timeout_suppressed_budget_total |
counter | none | Suppressed attempts due to global worker-budget saturation. |
stream_close_with_timeout_dropped_total |
counter | none | Deferred close retries dropped after retry queue overflow. |
stream_close_with_timeout_timeouts_total |
counter | none | Close attempts that exceeded bounded timeout. |
stream_close_with_timeout_late_completions_total |
counter | none | Timed-out closes that later completed before abandon deadline. |
stream_close_with_timeout_late_abandoned_total |
counter | none | Timed-out closes still blocked after abandon deadline. |
stream_close_with_timeout_release_underflow_total |
counter | none | Internal worker-slot release underflow guardrail hits. |
Metrics are most useful when tied to clear operator actions. The baseline below gives a practical starting point for dashboards and alerts; tune values to your traffic profile after collecting at least several days of baseline.
| Panel | Example query | Why it matters |
|---|---|---|
| Source read-pause rate by reason | sum by (reason) (rate(stream_source_read_pause_events_total[5m])) |
Distinguishes upstream starvation (recovered) from expected shutdown/cancel churn (ctx_cancel, pump_exit). |
| Source read-pause duration p95 | histogram_quantile(0.95, sum by (le, reason) (rate(stream_source_read_pause_seconds_bucket[5m]))) |
Shows whether pauses are brief blips or sustained stalls likely to drain DVR client buffers. |
| Slow-skip event rate | sum(rate(stream_slow_skip_events_total[5m])) |
Indicates subscribers repeatedly falling behind the publish window. |
| Write-pressure timeout/short-write/unsupported rate | sum(rate(stream_subscriber_write_deadline_timeouts_total[5m])), sum(rate(stream_subscriber_write_short_writes_total[5m])), and sum(rate(stream_subscriber_write_deadline_unsupported_total[5m])) |
Detects downstream client/network write pressure and unsupported deadline fallback paths before widespread disconnects. |
| Bounded-close timeout/suppression | sum(rate(stream_close_with_timeout_timeouts_total[5m])) and sum(rate(stream_close_with_timeout_suppressed_total[5m])) |
Surfaces shutdown/cleanup pressure and worker-budget saturation. |
| Late-abandoned / release-underflow counters | increase(stream_close_with_timeout_late_abandoned_total[15m]), increase(stream_close_with_timeout_release_underflow_total[15m]) |
Flags close-path invariants and potentially stuck close operations. |
| Per-source tuner utilization | stream_virtual_tuner_utilization_ratio by playlist_source |
Identifies per-source pool saturation before global capacity is exhausted. |
| Per-source sync duration | histogram_quantile(0.95, sum by (le, playlist_source) (rate(playlist_sync_source_duration_seconds_bucket[5m]))) |
Detects slow-fetching sources that extend total sync window. |
| Per-source sync errors | sum by (playlist_source) (rate(playlist_sync_source_errors_total[5m])) |
Isolates failing sources for targeted investigation. |
A ready-to-import dashboard bundle is available at:
deploy/grafana/hdhriptv-release-health-dashboard.json
It combines Prometheus and Loki release health signals into one view:
- target availability and process uptime
- warning/error log rates over deploy windows
- close-path invariant counters (timeouts, late-abandoned, release-underflow)
- write-pressure / slow-skip and source read-pause rates
- per-source tuner utilization and playlist-sync latency/error trends
- release lifecycle logs (
starting server,phase=app_version_sync,job finished)
Import instructions and datasource mapping details are in:
deploy/grafana/README.md
| Severity | Condition | Action |
|---|---|---|
| Page | increase(stream_close_with_timeout_late_abandoned_total[10m]) > 0 |
Investigate immediately: at least one close stayed blocked past abandon window; correlate with shared session slate AV close error logs and stream/tuner churn. |
| Page | increase(stream_close_with_timeout_release_underflow_total[10m]) > 0 |
Treat as invariant violation; inspect recent close suppression/timeout logs and deploy health before user impact expands. |
| Warning | sum(rate(stream_source_read_pause_events_total{reason="recovered"}[5m])) > 0.1 for 15m |
Upstream starvation is recurring; inspect provider/source health and failover behavior. |
| Warning | sum(rate(stream_subscriber_write_deadline_timeouts_total[5m])) > 0 for 10m |
Downstream write pressure is active; review subscriber network paths and lag policy settings. |
| Warning | sum(rate(stream_subscriber_write_deadline_unsupported_total[5m])) > 0 for 10m |
Write-deadline fallback is active; investigate transport/proxy paths that do not support deadlines. |
| Warning | sum(rate(stream_slow_skip_events_total[5m])) > 0 for 10m |
Slow-client lag compensation is frequently engaged; verify buffer and subscriber lag settings. |
stream_source_read_pause_events_total and
stream_source_read_pause_seconds are now reason-labeled vectors.
- For total event rate across all reasons, use:
sum(rate(stream_source_read_pause_events_total[5m])) - For per-reason breakdown, use:
sum by (reason) (rate(stream_source_read_pause_events_total[5m])) - For duration percentiles by reason, aggregate histogram buckets by both
leandreasonbeforehistogram_quantile(...).
internal/stream/profile_probe.go provides stream metadata detection used
by recovery filler resolution matching and tuner status diagnostics.
type streamProfile struct {
Width int
Height int
FrameRate float64
VideoCodec string
AudioCodec string
AudioSampleRate int
AudioChannels int
BitrateBPS int64
}Invokes ffprobe with JSON output (-of json -show_streams -show_format)
against a stream URL. Key parameters:
| Parameter | Default | Purpose |
|---|---|---|
ffprobePath |
ffprobe |
Executable path (--ffprobe-path / FFPROBE_PATH) |
timeout |
4 s | Context deadline for the probe |
analyzeduration |
1,500,000 us | ffprobe analysis window |
probesize |
1,000,000 bytes | ffprobe input read limit |
The function:
- Runs ffprobe and captures JSON output.
- Selects the first video and first audio stream from the result
(
selectProfileStreams). - Parses frame rate from
avg_frame_rate(falling back tor_frame_rate), handling both decimal and fractional (N/D) formats. - Resolves bitrate using a priority chain: video
bit_rate> formatbit_rate> stream tagvariant_bitrate(firstPositiveInt64).
Operational note:
- Startup logs include
ffprobe_pathandffmpeg_pathresolution so operators can verify which executables were selected at runtime.
When RECOVERY_FILLER_MODE=slate_av, the recovery filler uses the probed
profile to match the slate resolution, frame rate, and audio parameters to
the active source. If the profile has odd dimensions (e.g., 853x480), they
are normalized to even values for libx264/yuv420p encoder safety. If no
profile is available, defaults are 1280x720 @ 29.97 fps, 48 kHz stereo.
Computes a running bitrate estimate from bytes pushed since source selection:
bps = (bytesPushed - sourceBytesBaseline) * 8 / elapsed_seconds
This estimate is exposed in tuner status as current_bitrate_bps for
telemetry purposes only. Keepalive pacing guardrail calculations use the
separate recoveryKeepaliveExpectedRate field.
Probed profiles are persisted to the database via UpdateSourceProfile so
that admin diagnostics/history retain last-seen stream characteristics across
process restarts. Current live shared-session filler sizing uses the active
session's in-memory probe profile. The auto-prioritize analyzer
(internal/analyzer/ffmpeg.go) uses a similar but separate probing
subsystem for source quality ranking.
internal/stream/url_sanitize.go implements credential and token redaction
for all user-visible URL output.
Parses the URL, then strips:
| Component | Action |
|---|---|
User (userinfo) |
Removed (parsed.User = nil) |
RawQuery |
Removed (parsed.RawQuery = "") |
Fragment |
Removed (parsed.Fragment = "") |
| Scheme | Preserved |
| Host | Preserved |
| Path | Preserved |
If url.Parse fails, a fallback path (fallbackSanitizeStreamURL)
manually strips query/fragment and removes userinfo by finding @ in the
authority portion.
- Log output: All stream URLs logged during session lifecycle use
sanitizeStreamURLForLog. - Tuner status API (
/api/admin/tuners): Livesource_stream_urlfields and allsession_history[*].sources[*].stream_urlentries usesanitizeStreamURLForStatus(same redaction policy). - Client stream status: The
ClientStreamStatussource URL is sanitized when building the snapshot.
The sanitization functions are intentionally aliased
(sanitizeStreamURLForLog and sanitizeStreamURLForStatus both delegate
to SanitizeURLForLog) so the redaction policy is consistent across all
surfaces and can be changed in one place.
internal/stream/status.go provides the structured status snapshot served
by GET /api/admin/tuners and rendered in /ui/tuners.
Optional query behavior:
GET /api/admin/tuners?resolve_ip=1enables reverse-DNS enrichment for client addresses and populatesclient_hoston:client_streams[*]session_history[*].subscribers[*]
- The default (
resolve_ipomitted/false) skips reverse lookups. - Reverse lookups are bounded in two ways to protect request latency:
- per-lookup timeout:
2sper unique IP. - total resolve budget:
8sfor the full request'sresolve_ippass.
- per-lookup timeout:
- Reverse lookups run sequentially and are memoized:
- within a single response payload (duplicate IPs are resolved once),
- across requests via a bounded in-process cache (successful lookups cached for
~2m; failed lookups cached for~15s). - cache cardinality is capped (
4096entries by default); expired entries are swept periodically and oldest entries are evicted first when the cap is reached.
- Cache tuning knobs (
max entries, negative TTL, sweep interval) are currently compile-time defaults ininternal/http/admin_routes.goand are not exposed as runtime flags/environment variables. - With debug logging enabled, resolve stats are aggregated and emitted periodically (
~30mby default) asadmin tuner resolve_ip summary, covering counters since the previous summary (resolve_requests,cache_hits,cache_misses,cache_hit_rate,lookup_calls,lookup_errors). - For deployments with many unique clients or slow upstream PTR infrastructure,
use a local caching resolver (
systemd-resolved,dnsmasq,unbound, etc.) to keep reverse lookups fast and stable.
| Type | Purpose |
|---|---|
TunerStatusSnapshot |
Top-level response: tuner count, in-use/idle counts, churn summary, tuner list, client streams, session history |
TunerStatus |
One active tuner lease with linked shared-session state (source, recovery, keepalive, subscriber details) |
ClientStreamStatus |
One connected subscriber mapped to its backing tuner session |
ChurnSummary |
Aggregated recovery/reselection counters across all active shared sessions |
SharedSessionHistory |
Lifecycle record for one shared session (active or recently closed) |
Maps tuner kind and subscriber count to a human-readable state string:
| Kind | Subscribers | Has session | State |
|---|---|---|---|
probe |
any | any | probe |
client |
> 0 | any | active_subscribers |
client |
0 | yes | idle_grace_no_subscribers |
client |
0 | no | allocating_session |
| other | > 0 | any | active_subscribers |
| other | 0 | any | unknown |
In-memory retention of active and recently closed session lifecycle records. Key operational details:
- Retention: 256 entries (configurable via
--session-history-limit/SESSION_HISTORY_LIMIT; default256). - Eviction: Oldest entries are removed when the retention cap is
reached.
session_history_truncated_countin the API response tracks how many entries have been evicted since process start. - Per-session timeline bounds: Each entry exposes
source_history_limit/source_history_truncated_countandsubscriber_history_limit/subscriber_history_truncated_countso operators can distinguish overall session-entry eviction from in-session source/subscriber timeline trimming. - Content: Each entry includes
opened_at, optionalclosed_at,activeflag,terminal_status,peak_subscribers,recovery_cycle_count,same_source_reselect_count, and nestedsources/subscriberstimelines. - URL sanitization: All
sources[*].stream_urlvalues in history entries are sanitized before inclusion in the snapshot. - Lifetime: History is in-memory only and is lost on process restart.
Aggregated from all active SharedSessionStats:
recovering_session_count: sessions that have entered recovery in the current process lifetime (non-zerorecovery_cycleor non-emptyrecovery_reason), not strictly "currently recovering".sessions_with_reselect_count: sessions wheresame_source_reselect_count > 0.sessions_over_reselect_threshold: sessions exceeding the alert threshold (default3).total_recovery_cycles,total_same_source_reselect_count: sums across all sessions.max_same_source_reselect_countwithmax_reselect_channel_id/max_reselect_guide_number: identifies the worst-churning channel.
Shutdown is orchestrated in cmd/hdhriptv/main.go and follows a strict
ordering to ensure in-flight work drains before dependencies are closed.
The process listens for SIGINT and SIGTERM via
signal.NotifyContext. When received, the context is canceled and
shutdown begins.
- HTTP servers call
Shutdownwith a 10 s timeout. If graceful shutdown times out, active connections are force-closed. - UDP discovery server is closed.
- UPnP/SSDP server is closed (if enabled).
- Background source prober is closed. This closes the prober session-close queue and blocks until queued probe session closes have drained.
- Stream handler is closed via
CloseWithContext(shutdownCtx). This cancels all active shared sessions, waits for session goroutines to drain within the remaining 10 s shutdown deadline, releases tuner leases, and ensures async source-health persistence completes. If the deadline expires before full convergence, shutdown logs a warning and continues teardown. - Admin handler is closed. This closes the
closeChchannel to signal background workers, acquiresdynamicSyncMuanddynamicBlockSyncMuto establish a happens-before barrier with enqueue paths, then callsworkerWg.Wait()to block until all background workers finish. - Main goroutine calls
wg.Wait()to wait for listener goroutines. - Job runner is closed (deferred). This waits for in-flight
FinishRunpersistence. - Store is closed (deferred). This closes the SQLite connection.
The AdminHandler uses a closeCh / workerWg pattern for background
worker lifecycle:
closeChis achan struct{}closed exactly once viacloseOnce.Do. Background workers derive their context from this channel viaworkerContext().workerWgtracks active background goroutines (dynamic channel sync, dynamic block sync). Each enqueue path callsworkerWg.Add(1)under its respective mutex, and each worker defersworkerWg.Done().Close()closes the channel, then acquires both sync mutexes to establish a happens-before barrier — after the lock/unlock pairs, no newAdd(1)can occur because enqueue methods checkcloseChunder the mutex.Close()then callsworkerWg.Wait()to block until all background workers finish.
- Stream handler must close before admin handler: Stream sessions may interact with channel/source state that admin sync workers also touch.
- Admin handler must close before store: Background sync workers perform database operations.
- Job runner must close before store: The runner's
FinishRunwrites terminal job state to the database. - Store closes last: All subsystems that perform database I/O must drain
before
store.Close().
All log output uses slog.TextHandler, which produces key=value pairs:
time=2026-02-17T10:30:15.123Z level=INFO msg="shared session created" channel_id=42 guide_number=101
- Not JSON: Pipelines expecting JSON-structured logs will need a
key=valueparser (e.g., logfmt). - Consistent field names: Correlation fields like
channel_id,guide_number,tuner_id,source_id,subscriber_id,run_id,job_nameappear consistently across related events. - Level field: Always present as
level=TRACE|DEBUG|INFO|WARN|ERROR. - Timestamp: Always present as
time=in RFC 3339 format. - Multi-stream: stdout carries
TRACE/DEBUG/INFO; stderr carriesWARN/ERROR. The log file respects the configured log level. When aggregating from containers, prefer the log file or merge both streams.
Each process startup creates a new log file. Over time, log files
accumulate without automatic cleanup. In long-running deployments, configure
external rotation or periodic deletion of old hdhriptv-*.log files.
The database at DB_PATH uses WAL mode. The main .db file, -wal, and
-shm files should all reside on the same filesystem. Backup procedures
should capture all three files (or stop the service first for a clean
single-file backup).
The in-memory session history retains up to 256 entries per process
lifetime. Each entry includes source and subscriber timelines, but per-session
timeline limits (source_history_limit / subscriber_history_limit) prevent
active sessions from growing those nested arrays without bound. If trimming
occurs, source_history_truncated_count and
subscriber_history_truncated_count expose how many oldest per-session rows
were evicted.
slate_av recovery filler spawns an ffmpeg subprocess during each recovery
window. In deployments with many concurrent recovering sessions, this means
multiple ffmpeg processes may run simultaneously, each producing realtime
MPEG-TS output. The keepalive guardrail (2.5x expected bitrate for 1.5 s)
limits runaway output, and the fallback chain (slate_av -> psi ->
null) provides graceful degradation if ffmpeg fails.
When RECOVERY_FILLER_MODE=slate_av, ffmpeg drawtext rendering expects a
Sans-compatible font. The project Docker image installs ttf-dejavu to
provide that dependency. For custom images/hosts, install ttf-dejavu (or an
equivalent Sans font package) to avoid fallback failures like Cannot find a valid font for the family Sans.
- On startup, if at least one playlist source has a configured URL (via
playlist_sourcestable, persistedplaylist.urlsetting, orPLAYLIST_URLenvironment variable), the service performs an initial playlist sync job covering all enabled sources. - Startup initial playlist sync is scheduled only after the primary HTTP listener answers
/healthzreadiness probes to avoid startup-window DVR reload races against an unopened listener. - DVR lineup reload after playlist sync is optimistic: reload failures do not
fail playlist sync and are surfaced through logs and job summary fields
(
dvr_lineup_reload_status=unknown,dvr_lineup_reload_skip_reason=reload_error). - Startup sync emits structured log records (
msg="initial playlist sync phase") withinitial_sync_phase=scheduled_after_listener_start|completed|failedand includes attempt/duration metadata on completion. - Periodic refresh uses the automation schedule in
jobs.playlist_sync.cron. All enabled sources share a single cron schedule. --refresh-schedule/REFRESH_SCHEDULEand the UI/API automation settings update the same persisted schedule keys.- Source refresh worker concurrency is controlled by
--playlist-sync-source-concurrency/PLAYLIST_SYNC_SOURCE_CONCURRENCY(default1, max16).1keeps sequential behavior; higher values enable bounded parallel source refresh. - Scheduler overlap contention is handled by a coalesced deferred catch-up policy with bounded exponential backoff (instead of dropping overlapping scheduled ticks).
- Per-source manual sync:
POST /api/admin/jobs/playlist-sync/run?source_id=Nsyncs a single source without affecting other sources. - Playlist upsert behavior (per source):
- upserts existing entries for the current source
- inserts new entries with the current
playlist_source_id - marks stale entries inactive only within the current source (source-scoped deactivation)
- items from other sources are never affected by a single-source refresh
- Published channel behavior:
- channels are ordered explicitly
- guide numbers are contiguous and start at
--traditional-guide-start(default100) - startup validates existing traditional channel numbering against
--traditional-guide-startand auto-renumbers when drift is detected - add/remove/reorder operations renumber guide numbers to keep them dense
/ui/catalogsupports a rapid source-add workflow:- choose one target channel once in the toolbar
- row actions switch to
Add Sourcefor rapid attachment - clear target selection to return to row-level
Create Channelactions - dynamic channel creation is toolbar-driven from current filter context
- dynamic channel blocks materialize generated channels into reserved guide ranges (
10000+) and are managed separately from traditional channel ordering - lineup-changing reorder/materialization paths enqueue a shared DVR lineup reload queue (trailing-edge
debounce=20s,max_wait=300s) so rapid mutation bursts coalesce instead of triggering one reload per request - each channel may have multiple ordered sources (
priority_index) - each channel also has a
dynamic_rule:- if enabled, matching catalog items are synchronized into channel sources asynchronously
- when disabled, pending/in-flight dynamic sync is canceled
- newer enabled updates also preempt/cancel stale in-flight sync runs so broad scans converge to the latest rule quickly
dynamic_rule.search_queryis required when enableddynamic_rule.search_queryuses the same OR-capable include/exclude semantics as/api/items(|/ORplus-term/!term; no OR separator preserves legacy AND behavior)- only
dynamic_query-managed associations are automatically removed when no longer matched; manual associations are preserved
- non-matching source attachments require explicit override (
allow_cross_channel) - lineup responses include only enabled channels
- source health cooldown uses a bounded fail ladder (
10s,30s,2m,10m,1hcap) - successful source startup clears cooldown and resets failure state (
fail_count,last_fail_at,last_fail_reason)
- Run playlist sync when catalog content changed, channel sources need to be reconciled, or lineup entries seem stale.
- Run auto-prioritize when channel source order quality needs to be recalculated from fresh probes.
- Use playlist sync first, then auto-prioritize, after major playlist/provider updates.
- Schedule playlist sync more frequently than auto-prioritize in most deployments because sync is correctness-focused and analyze/reorder is probe-heavy.
- If providers enforce strict concurrent session caps, reduce auto-prioritize frequency and verify
AUTO_PRIORITIZE_PROBE_TUNE_DELAY.
- Reorder/materialization APIs return
204 No Contentafter enqueueing reload work, not after provider-side DVR lineup reload completion. - Under normal conditions, expect propagation by
debounce + max_wait + dvr_lineup_reload_timeout(default:20s + 300s + 30sworst case, typically much faster). - Correlate these queue lifecycle logs in order:
admin dvr lineup reload queuedadmin dvr lineup reload startedadmin dvr lineup reload completed(healthy) oradmin dvr lineup reload failed/admin dvr lineup reload canceled(degraded)
- If you only see repeated
queued/follow-up queuedevents withoutcompleted, inspectdue_at,due_in_ms, andreason_summaryfields to confirm ongoing churn is intentionally coalescing runs. - For
failedevents, use the structurederrorfield and verify DVR connectivity/auth withPOST /api/admin/dvr/testbefore retrying channel mutations. - If lineup remains stale after errors are resolved, trigger a fresh reconciliation cycle with
POST /api/admin/jobs/playlist-sync/runand watch for a new successful lineup reload sequence.
- Use forward sync (
POST /api/admin/dvr/sync) to push hdhriptv channel mapping intent into DVR custom lineup mapping. - Use reverse sync (
POST /api/admin/dvr/reverse-sync) to pull provider-side custom mapping back into hdhriptv. - Use per-channel reverse sync (
POST /api/channels/{channelID}/dvr/reverse-sync) to import a single channel without touching the rest. - By default, DVR mapping/sync workflows target traditional channels only; set
include_dynamic=trueon sync APIs only when you intentionally want dynamic generated channels included. - In
configured_onlymode, forward sync updates only configured channels. - In
mirror_devicemode, forward sync also clears unmatched provider mappings to mirror current hdhriptv state. - Forward/reverse sync endpoints are intended for the
channelsDVR provider. Jellyfin provider mode is for lineup refresh orchestration (ReloadLineup) and does not implement custom mapping APIs.
- Configure primary sync/mapping workflows with
provider=channels, and configure post-playlist-sync reload fan-out withactive_providers. - Use per-provider base URLs:
channels_base_urlfor Channels DVR.jellyfin_base_urlfor Jellyfin API root.- legacy
base_urlmaps to Channels DVR base URL.
- Jellyfin post-sync reload requires both
jellyfin_base_urlandjellyfin_api_token. - hdhriptv Jellyfin requests use header auth only:
X-Emby-Token: <token>.api_keyquery-token auth is not used by the provider implementation. GET /api/admin/dvrandPUT /api/admin/dvrresponses redactjellyfin_api_token(write-only semantics) and exposejellyfin_api_token_configured=true|false.- Optional
jellyfin_tuner_host_idcan pin refresh targeting when multiple Jellyfin HDHomeRun tuner hosts exist. - Jellyfin lineup refresh flow after HDHR changes:
GET /System/Configuration/livetv->POST /LiveTv/TunerHosts-> best-effortGET /ScheduledTasks?isHidden=falseobservability probe. - Playlist-sync-triggered DVR reload uses provider-aware gating for each active
provider (
active_providers):- Channels reload runs for active
channels. - Jellyfin reload runs for active
jellyfinonly whenjellyfin_base_urlandjellyfin_api_tokenare configured. - Provider-local build/reload failures are aggregated while healthy providers continue in the same fan-out pass.
- Mixed outcomes are reported as
dvr_lineup_reload_status=partial. - All-provider failures are reported as
dvr_lineup_reload_status=failed. - Provider reason details (skips and failures) are encoded in
dvr_lineup_reload_skip_reason(for examplejellyfin:missing_jellyfin_api_tokenorchannels:reload_lineup_failed:...). - Job summaries include
dvr_lineup_reload_status=<disabled|reloaded|partial|failed|skipped>anddvr_lineup_reload_skip_reason=<reason>.
- Channels reload runs for active
- Query
GET /api/admin/jobs/{runID}orGET /api/admin/jobs?name=.... - Supported
namefilters areplaylist_sync,auto_prioritize, anddvr_lineup_sync. runningmeans job execution has started and may updateprogress_curandprogress_max.successmeans terminal completion without errors.errormeans terminal completion with failure details inerror.canceledmeans terminal cancellation, usually due to shutdown or context cancellation.- A second trigger while the same job is active returns HTTP
409. - If a run remains
runningwithout progress changes for an extended period, correlate with logs (job started,job finished, and subsystem-specific warnings/errors).
- Streaming is shared per published channel: multiple viewers of the same guide number reuse one upstream producer session.
- Tuner usage is per active channel session, not per viewer.
- Shared sessions use a size-or-time chunk pump and flush each chunk to subscribers to reduce no-data gaps.
- Stall recovery is policy-driven (
STALL_POLICY); default behavior fails over to alternate sources (failover_source), with optional same-source retry (restart_same) or immediate session close (close_session) when configured. - In
failover_source, when no startup-eligible alternates are available in the current recovery pass, recovery automatically downgrades to restart-like same-source retry behavior (restart_sameparity) while recovery filler keepalive remains active until startup succeeds or recovery deadline is reached. - Repeated same-source
source_eofrecoveries are paced with bounded inter-cycle backoff (250ms->2s) so retry loops do not burn recovery-cycle budget in millisecond bursts. Recovery burst accounting is time-aware (recovery_burst_budget_count/recovery_burst_pace_window), and repetitiveshared session recovery triggeredwarnings are coalesced (recovery_trigger_logs_coalesced).
ffmpeg-copy and ffmpeg-transcode startup paths pass producer pacing flags
before -i:
PRODUCER_READRATE->-readratePRODUCER_READRATE_CATCHUP->-readrate_catchupPRODUCER_INITIAL_BURST->-readrate_initial_burst
Practical guidance:
- Start with
PRODUCER_READRATE=1andPRODUCER_READRATE_CATCHUP=1for stable realtime pacing. - Increase
PRODUCER_READRATE_CATCHUPfirst (for example1.1to1.25) when recovering from transient upstream stalls causes prolonged lag. - Keep
PRODUCER_READRATEnear1unless you intentionally want sustained faster-than-realtime source ingestion. PRODUCER_INITIAL_BURSTcontrols startup fill aggressiveness only; prefer small values (1to2) to reduce startup underruns without large burst overshoot.- Config validation rejects
PRODUCER_READRATE_CATCHUPvalues belowPRODUCER_READRATEto avoid contradictory pacing inputs.
Additional ffmpeg startup toggles:
FFMPEG_INPUT_BUFFER_SIZEmaps to ffmpeg-buffer_sizein ffmpeg stream modes.0disables explicit buffer sizing.- values are bounded to
64 MiB(67108864) and validated at startup. - the limit is per active ffmpeg stream; estimate aggregate memory as
active_streams * buffer_size(for example,50 * 64 MiB ~= 3.1 GiB, exact3.125 GiB). - this setting is primarily effective on
rist://,rtp://,rtsp://, andudp://inputs (often less common in many M3U playlists). - many playlist sources are
http:///https://MPEG-TS or HLS-style streams and may reject-buffer_size. - when enabled for mixed-source playlists, unsupported inputs may fail the first startup attempt before fallback, adding startup delay on those attempts.
- when ffmpeg reports
Option buffer_size not found, startup automatically retries once without-buffer_sizeand emitsffmpeg startup option unsupported; continuing without ffmpeg input buffer size.
FFMPEG_DISCARD_CORRUPT=trueappends-fflags +discardcorruptso ffmpeg drops packets flagged as corrupt instead of attempting to decode them.- this can reduce corruption cascades on noisy inputs, but may introduce visible or audible gaps where packets are discarded.
- leave disabled unless upstream source-side packet corruption is a known issue.
GET /healthzreturns{"status":"ok"}with HTTP200.GET /metricsavailable whenENABLE_METRICS=true.- Per-client IP rate limiting via token bucket:
- configured by
RATE_LIMIT_RPS,RATE_LIMIT_BURST, andRATE_LIMIT_MAX_CLIENTS - for reverse-proxy deployments, set
RATE_LIMIT_TRUSTED_PROXIESso client identity is derived from trusted forwarded headers instead of collapsing all traffic to the proxy IP Forwarded/X-Forwarded-Forchains are interpreted from right to left; trusted proxy hops are stripped from the right and the first non-trusted hop is used as the limiter key- returns HTTP
429when exceeded - stale client entries are incrementally evicted and limiter-map cardinality can be capped to avoid unbounded growth
/healthzis exempt
- configured by
- Request timeout middleware:
- controlled by
REQUEST_TIMEOUT - applies to non-stream endpoints
/auto/...streaming is exempt
- controlled by
- Admin mutation JSON size hardening:
- controlled by
ADMIN_JSON_BODY_LIMIT_BYTES - applies to admin mutation endpoints that decode JSON bodies
- oversized bodies are rejected with HTTP
413 Request Entity Too Large
- controlled by
- Channel tune startup-failure backoff:
- controlled by
TUNE_BACKOFF_MAX_TUNES,TUNE_BACKOFF_INTERVAL, andTUNE_BACKOFF_COOLDOWN - applies only to requests that would create a new shared session/source startup
- counts startup-cycle outcomes only (startup leader path), so concurrent joiners sharing the same startup do not overcount failures or overclear successes
- counts startup failures only; successful startups clear outstanding failure budget
- scope is per channel, so one failing channel does not throttle unrelated channel startups
- existing active shared sessions can continue accepting additional subscribers during cooldown
- returns HTTP
503withRetry-Afterwhile backoff is active
- controlled by
- Server read hardening:
ReadTimeout=30sReadHeaderTimeout=5sIdleTimeout=120s
Key info-level events emitted by the service:
- Stream/session lifecycle:
shared session created,shared session reused,shared session ready,shared session subscriber connected,shared session subscriber disconnected,shared session canceled while idle,stream tune rejected,stream tune failed,stream subscriber started,stream subscriber ended,stream subscriber canceled,stream subscriber disconnected due to lag,stream subscriber ended with error. - Recovery diagnostics include burst and pacing fields on
shared session recovery triggered/shared session recovery cycle budget exhausted(recovery_burst_count,recovery_burst_budget_count,recovery_burst_pace_window, andrecovery_trigger_logs_coalescedwhen repeated warnings are rate-limited). - Tuner lifecycle:
tuner lease acquired,tuner lease reused,tuner lease released,tuner probe preempted,tuner idle-client preempted. - Admin mutation lifecycle:
admin channel created,admin channel updated,admin channels reordered,admin channel deleted,admin source added,admin source updated,admin sources reordered,admin source deleted,admin source health cleared,admin all source health cleared,admin automation updated,admin automation timezone updated,admin automation schedule updated,admin automation settings updated,admin manual job run started,admin auto-prioritize cache cleared,admin dvr config updated,admin dvr schedule updated,admin dvr sync requested,admin dvr sync completed,admin dvr reverse-sync requested,admin dvr reverse-sync completed,admin channel dvr reverse-sync requested,admin channel dvr reverse-sync completed,admin channel dvr mapping updated. - Playlist source lifecycle:
admin playlist source created,admin playlist source updated,admin playlist source deleted. - Multi-source sync lifecycle:
playlist sync source refresh started,playlist sync source refresh succeeded,playlist sync source refresh failed(withplaylist_source_id,playlist_source_name,playlist_source_key, anditem_countfields). - Dynamic channel immediate-sync lifecycle:
admin dynamic channel immediate sync queued,admin dynamic channel immediate sync started,admin dynamic channel immediate sync completed,admin dynamic channel immediate sync canceled,admin dynamic channel immediate sync skipped stale run,admin dynamic channel immediate sync failed. - Dynamic block materialization lifecycle:
admin dynamic block sync queued,admin dynamic block immediate sync started,admin dynamic block immediate sync completed,admin dynamic block immediate sync canceled,admin dynamic block immediate sync failed,admin dynamic generated channels reordered. - DVR lineup reload queue lifecycle:
admin dvr lineup reload queued,admin dvr lineup reload started,admin dvr lineup reload completed,admin dvr lineup reload canceled,admin dvr lineup reload failed,admin dvr lineup reload follow-up queued. - Jobs/scheduler/playlist lifecycle:
job started,job finished,job panic recovered,job run persistence failed,scheduler loaded schedules,scheduler schedule updated,scheduler timezone updated,scheduled job started,scheduled job overlap deferred for catch-up,scheduled job overlap coalesced into existing deferred catch-up,scheduled deferred catch-up waiting,scheduled deferred catch-up started,playlist refresh started,playlist refresh finished. - SQLite IOERR diagnostics lifecycle:
sqlite_ioerr_diag_bundle(one-shot pragma/db-file snapshot) andsqlite_ioerr_trace_dump(rate-limited in-memory DB operation timeline). - Discovery lifecycle (trace-level):
discovery response sent,ssdp response sent. - Recovery filler normalization (debug-level):
shared session slate AV recovery filler profile normalizedwithoriginal_resolution,normalized_resolution, andnormalization_reason.
Key warn-level close-path events:
closeWithTimeout worker slot release underflow: internal close worker-slot accounting invariant warning. Correlateclose_release_underflowwithclose_timeouts,close_late_completions, andclose_late_abandoned.closeWithTimeout suppression observed: close retry suppression under worker-budget pressure. Correlate suppression reason/counters with retry queue depth and close timeout churn.shared session slate AV close error: bounded close failure while shutting down recovery filler readers; inspectclose_error_typeand accompanyingclose_*counters. Cancellation-adjacent benign FontConfig shutdown signatures are intentionally downgraded to debug (close_error_type=non_timeout_benign_fontconfig_canceled) to reduce non-actionable WARN noise.
See docs/STREAMING.md bounded-close telemetry guidance for detailed triage and remediation.
Common correlation fields on these events include:
channel_id,guide_number,guide_nametuner_id,source_id,source_item_key,playlist_source,playlist_source_idsubscriber_id,client_addr,remote_addrrun_id,job_name,triggered_byresult,reason,duration
- Catalog, channels, source mappings, and persisted device identity are stored in SQLite at
DB_PATH. - Default path is
./hdhr-iptv.db. - In Docker, default DB path is
/data/hdhr-iptv.db; mount/datato persist. - Runtime build metadata is persisted at startup in settings keys:
app.versionapp.commitapp.build_time
Example query:
sqlite3 /data/hdhr-iptv.db "SELECT key, value FROM settings WHERE key LIKE 'app.%' ORDER BY key;"Use one of the two supported approaches:
- Cold backup (simplest and safest):
- stop the service cleanly
- copy the SQLite DB file to backup storage
- start the service
- Online backup (service remains running):
- use SQLite's online backup API (for example via
sqlite3CLI.backup) - example:
- use SQLite's online backup API (for example via
sqlite3 /data/hdhr-iptv.db ".backup '/backups/hdhr-iptv-$(date +%Y%m%d-%H%M%S).db'"Prefer the online backup API for live systems. Avoid plain file copies of a busy database unless you are using a filesystem snapshot mechanism that can capture all related SQLite files atomically.
- hdhriptv runs SQLite with WAL mode enabled in normal operation.
- With WAL mode, recently committed data may reside in
hdhr-iptv.db-waluntil checkpointed; copying onlyhdhr-iptv.dbwhile writes continue can produce incomplete backups. - The SQLite backup API reads a transaction-consistent view and safely captures WAL-backed state.
- If you must restore from filesystem-level copies, keep
.db,.db-wal, and.db-shmfiles from the same snapshot together.
- Stop the service.
- Restore the backup DB file to
DB_PATH. - Remove stale sidecars from previous runs if present (
*.db-wal,*.db-shm) unless you intentionally restored matching sidecar files from the same backup set. - Start the service.
- Run integrity checks:
sqlite3 /data/hdhr-iptv.db "PRAGMA integrity_check;"
sqlite3 /data/hdhr-iptv.db "PRAGMA foreign_key_check;"integrity_check should return ok; foreign_key_check should return no rows.
- Choose recovery point objective (RPO) based on tolerance for channel/source metadata loss between backups.
- As a practical baseline, run backups at least twice as often as the playlist
refresh schedule (for example, every
15mif playlist refresh runs every30m). - Increase frequency when performing bulk channel/source edits or automation rollouts that mutate mapping state rapidly.
Operational recommendation:
- If
DB_PATHis persistent, identity remains stable across restarts automatically. - Set explicit
DEVICE_IDandDEVICE_AUTHwhen you need fixed identity across fresh databases, DB replacements/restores, or multiple deployments.
| Trigger | Scope | Endpoint |
|---|---|---|
| Scheduled cron | All enabled sources | Automation schedule (jobs.playlist_sync.cron) |
| Manual all-source | All enabled sources | POST /api/admin/jobs/playlist-sync/run |
| Manual per-source | Single source | POST /api/admin/jobs/playlist-sync/run?source_id=N |
| UI per-source button | Single source | Per-row "Sync Now" on /ui/automation |
| UI global button | All enabled sources | "Sync All" on /ui/automation |
| Startup one-shot | All enabled sources | Automatic on process start |
Per-source sync refreshes only the specified source's catalog items, runs source-scoped deactivation, then triggers reconciliation and conditional DVR lineup reload. Other sources' catalog state is unaffected.
When syncing multiple sources, each source is fetched independently:
| Outcome | Job Status | DVR Reload | Summary |
|---|---|---|---|
| All sources succeed | success |
Triggers | playlist_sources attempted=N succeeded=N failed=0 |
| Some sources succeed | success (with warnings) |
Triggers | playlist_sources attempted=N succeeded=M failed=K + failed source list |
| All sources fail | error |
Skipped | playlist_sources attempted=N succeeded=0 failed=N |
A failing source does not deactivate catalog items from other sources. Source-scoped deactivation ensures that only items belonging to the currently-refreshing source are marked inactive when not seen in the latest fetch.
Multi-source support adds a playlist_source label dimension to
playlist sync metrics. Existing dashboard queries and alert rules that
reference these metrics must be updated to account for the new label:
| Metric | Type | Labels | Description |
|---|---|---|---|
playlist_sync_source_duration_seconds |
histogram | playlist_source |
Per-source fetch+upsert duration |
playlist_sync_source_errors_total |
counter | playlist_source |
Per-source fetch/parse/upsert errors |
playlist_sync_source_items |
gauge | playlist_source |
Items processed per source in last sync |
stream_virtual_tuner_utilization_ratio |
gauge | playlist_source |
Per-source tuner pool utilization (in-use / configured) |
Migration steps for Prometheus consumers:
- Update dashboard queries to aggregate across the new label where
total-system metrics are needed:
sum(rate(playlist_sync_source_errors_total[5m])) - Add per-source breakdown panels for granular monitoring:
sum by (playlist_source) (rate(playlist_sync_source_duration_seconds_sum[5m])) - Update alert rules to either aggregate or filter by source as appropriate for the alert's intent.
Scheduled automation overlaps now use coalesced deferred catch-up with bounded backoff. The following metrics provide contention/freshness visibility:
| Metric | Type | Labels | Description |
|---|---|---|---|
job_scheduler_events_total |
counter | job_name, event |
Scheduler start/skip/deferred lifecycle events (started, skipped_already_running, deferred_enqueued, deferred_started, etc.). |
job_scheduler_deferred_pending |
gauge | job_name |
1 when a deferred catch-up run is pending for the job. |
job_scheduler_deferred_backoff_seconds |
gauge | job_name |
Current deferred retry backoff duration. |
job_scheduler_deferred_age_seconds |
gauge | job_name |
Age of the pending deferred catch-up run. |
job_scheduler_last_success_timestamp_seconds |
gauge | job_name |
Unix timestamp of the last successful schedule-triggered run. |
job_scheduler_freshness_seconds |
gauge | job_name |
Current age since the last successful schedule-triggered run. |
Both client-facing discovery surfaces cap the advertised tuner count at
255:
- UDP discovery:
TunerCountresponse tag is encoded as a singleuint8byte — values above 255 are capped. /discover.json:TunerCountJSON field is capped at255for DVR client compatibility (Channels DVR, Plex, Jellyfin).
The real per-source and aggregate tuner totals are visible only in
/api/admin/tuners (the virtual_tuners array and summary fields).
When the internal sum of enabled source tuner counts exceeds 255, a startup log event is emitted indicating that discovery tuner-count capping is active. This does not affect stream capacity — virtual tuner pools enforce the real per-source limits internally.
| Limitation | Details | Future Work |
|---|---|---|
| Shared cron schedule | All sources share one playlist_sync.cron schedule. Per-source scheduling is deferred. |
Per-source cron expressions |
| Shared refresh worker pool | Source refresh parallelism is bounded globally per run (playlist_sync_source_concurrency, max 16), not per-source class/provider. |
Adaptive concurrency by provider/source class |
| No hard source count limit | There is no enforced maximum number of playlist sources. Operators with many sources accept longer sync windows and should monitor per-source sync duration metrics. | Operational guidance only |
On first startup with multi-source code against a legacy database:
- Migration
007_playlist_sources.sqlcreates theplaylist_sourcestable. ensurePlaylistSourcesSchemaseeds the primary source (source_id=1) from legacy settings:- URL from persisted
playlist.urlsetting (orPLAYLIST_URLenv). - Tuner count from effective startup config (
cfg.TunerCount). source_keyis set to the well-known constant"primary".namedefaults to"Primary".
- URL from persisted
- Existing
playlist_itemsrows haveplaylist_source_iddefaulting to1, pointing to the auto-seeded primary source.
No manual migration is required — existing one-source databases start and stream without changes.
Legacy arguments continue to work and map to the primary source:
| Legacy Argument | Maps To |
|---|---|
--playlist-url / PLAYLIST_URL |
Primary source URL |
--tuner-count / TUNER_COUNT |
Primary source tuner count |
Additional sources are added via:
--playlist-source "url=<url>,tuners=<count>[,name=<label>][,enabled=<bool>]"(repeatable)PLAYLIST_SOURCESenvironment variable (semicolon-separated entries)
If only legacy arguments are supplied, behavior remains single-source.
GET /api/admin/automation returns both:
playlist_url— alias to primary source URL (backward compatible)playlist_sources— full array of all sources (new canonical shape)
PUT /api/admin/automation accepts both:
- Payloads with
playlist_urlupdate the primary source URL. - Payloads with
playlist_sourcesupdate the full source list. - Both can coexist in the same payload —
playlist_sourcestakes precedence when present.
To revert from multi-source to single-source operation:
- Disable extra sources: Set
enabled=0on all non-primary sources via the API (PUT /api/admin/playlist-sources/{sourceID}) or automation UI. The primary source (source_id=1) cannot be deleted. - Remove CLI arguments: Stop passing
--playlist-sourceorPLAYLIST_SOURCES. Legacy--playlist-urland--tuner-countcontinue to control the primary source. - Run a sync: Trigger
POST /api/admin/jobs/playlist-sync/runto reconcile the catalog with only the primary source active.
The playlist_sources table and non-primary source catalog items remain
in the database but are inert when disabled. No destructive schema
rollback is needed.
If a full database rollback is required (downgrading to pre-multi-source code), restore a pre-migration database backup. The migration is additive — it does not modify existing tables — so the original schema is preserved in backups taken before the migration ran.
- Confirm UDP
65001is open between client and server. - Verify service is running and listening.
- Validate client and server are on reachable subnets/VLANs.
- Try manual endpoint:
http://<server-ip>:5004/discover.json.
lineup.jsonis published-channel only.- Publish at least one channel in
/ui/catalogor via/api/channels. - Confirm playlist refresh succeeded and catalog has items.
- HTTP
503: all tuners are in use. In multi-source deployments, check per-source pool utilization in/api/admin/tuners(virtual_tunersarray) — a single source pool may be exhausted while others have capacity. Increase the bottleneck source'stuner_countor reduce concurrent playback on that source. - HTTP
502: upstream URL unavailable or ffmpeg process failed. - In ffmpeg modes, validate
FFMPEG_PATHand local ffmpeg installation. - For analyzer/profile-probe failures, validate
FFPROBE_PATH(or--ffprobe-path) and confirm the selected value in startup logs (ffprobe_path). - For
ffmpeg-copystartup-timeout errors, increaseSTARTUP_TIMEOUTand/or tune startup detection (FFMPEG_STARTUP_PROBESIZE_BYTES,FFMPEG_STARTUP_ANALYZEDURATION) so ffmpeg emits initial bytes before failover deadline. - If initial startup frequently times out on random-access gating but recovery continuity still needs strict cutover behavior, keep
STARTUP_RANDOM_ACCESS_RECOVERY_ONLY=true(the default) so random-access enforcement applies only during recovery cycles. - Startup expects both video and audio components. Startup is accepted only when inventory reaches
video_audio;video_only,audio_only,undetected, andunknownstartup component states are all treated as startup failures. Inspect diagnostics in logs and/api/admin/tuners(source_startup_component_state,source_startup_video_streams,source_startup_audio_streams). For random-access startup, comparesource_startup_probe_raw_bytesvssource_startup_probe_trimmed_bytesand monitorsource_startup_probe_cutover_offset/source_startup_probe_dropped_bytesto see how much pre-IDR data is discarded before stream handoff. - High random-access cutover (>=75% dropped from startup probe) emits
shared session startup probe cutover warninglog events with bounded coalescing metadata (source_startup_probe_cutover_warn_logs_coalesced) to reduce repetitive log spam under rapid recovery churn. - The runtime automatically retries once with relaxed startup probe/analyze settings (
source_startup_retry_relaxed_probe=true) when startup detection initially appears component-incomplete. Startup is accepted only when inventory reachesvideo_audio; if the relaxed retry fails or still reports incomplete inventory, startup fails and failover continues. - If DVR logs show disconnects after no data for several seconds, reduce
BUFFER_PUBLISH_FLUSH_INTERVAL, confirm producer pacing (PRODUCER_READRATE=1), and confirm recovery keepalive remains enabled (RECOVERY_FILLER_ENABLED=true, default). For picky clients, tryRECOVERY_FILLER_MODE=slate_av(decodable filler) orRECOVERY_FILLER_MODE=psibeforenull. - For ffmpeg source pacing, start with
PRODUCER_READRATE=1andPRODUCER_READRATE_CATCHUP=1.15to1.5. Increase catch-up gradually when post-stall lag persists. KeepPRODUCER_READRATE_CATCHUP >= PRODUCER_READRATE; avoid very high catch-up rates unless needed because aggressive catch-up can amplify downstream burst pressure. - For
RECOVERY_FILLER_MODE=slate_av, odd source resolutions (for example853x480) are normalized to even dimensions forlibx264/yuv420pencoder safety. Use debug logs to confirm normalization and watch/api/admin/tunerskeepalive fallback counters if the session still degrades topsi/null. - If recovery resumes cleanly but playback is far behind live edge, inspect
/api/admin/tunerskeepalive telemetry:- sustained high
recovery_keepalive_rate_bytes_per_secondrelative torecovery_keepalive_expected_rate_bytes_per_second, recovery_keepalive_realtime_multipliersignificantly above1.0when profile bitrate is known,- non-zero
recovery_keepalive_guardrail_countindicating safety fallback was required.
- sustained high
- Install Windows ffmpeg/ffprobe builds from either:
- Point
FFMPEG_PATH/FFPROBE_PATH(or--ffmpeg-path/--ffprobe-path) to the executable files, not thebindirectory.- Correct:
C:\Users\<you>\ffmpeg\bin\ffmpeg.exe - Incorrect:
C:\Users\<you>\ffmpeg\bin
- Correct:
- If logs include
exec: "...\\ffmpeg\\bin": executable file not found in %PATH%, the configured path is a directory and must be changed toffmpeg.exe. - Common startup pattern:
.\hdhriptv.exe `
--playlist-url https://example.com/playlist `
--ffmpeg-path C:\users\person\ffmpeg\bin\ffmpeg.exe `
--ffprobe-path C:\users\person\ffmpeg\bin\ffprobe.exe `
--http-addr-legacy :80 `
--friendly-name "HDHRIPTV Windows"When playlist sync fails with SQLite IOERR codes (for example SQLITE_IOERR_WRITE
/ extended code 778), capture diagnostics before restarting the process.
- Keep the process running and preserve current logs.
- Collect the first
sqlite_ioerr_diag_bundleevent for the failing run:- includes
phase,item_index/item_total,run_id, sqlite code/name,db.Stats(), runtime pragmas, and DB/WAL/SHM file metadata.
- includes
- If trace ring diagnostics are enabled, collect
sqlite_ioerr_trace_dump:- provides the bounded pre-failure DB operation timeline (
op,phase, duration, error classification/code).
- provides the bounded pre-failure DB operation timeline (
- Correlate timestamps with host and storage telemetry (kernel, filesystem, volume/backend metrics) before restart.
- Restart only after capture is complete.
Rapid rollback switches (if diagnostic verbosity needs to be reduced immediately):
- Disable trace ring:
HDHRIPTV_SQLITE_IOERR_TRACE_ENABLED=false. - Reduce trace dump volume: lower
HDHRIPTV_SQLITE_IOERR_TRACE_DUMP_LIMITand/or increaseHDHRIPTV_SQLITE_IOERR_TRACE_DUMP_INTERVAL. - Disable optional checkpoint probing:
HDHRIPTV_SQLITE_IOERR_CHECKPOINT_PROBE=false.
- Verify
dynamic_rule.enabled=trueand a non-emptydynamic_rule.search_query. - Confirm target catalog rows are active and match both optional
group_nameandsearch_query(same behavior asGET /api/itemsfiltering, including-term/!termexclusion support). - Check logs for dynamic sync lifecycle events (
queued,started,completed,failed,canceled,skipped stale run) and correlate bychannel_id. canceledevents includecancel_reason(superseded,disabled_or_deleted,state_removed, orcanceled) to distinguish expected preemption from unexpected failures.- If you disable a rule or delete a channel while sync is running, cancellation is expected and stale results are intentionally discarded.
- Invalid cron updates return HTTP
400from automation or DVR config endpoints when a schedule is enabled. - Disabling a schedule does not require cron validation; you can disable first and fix cron later.
- Scheduler timezone updates require a valid IANA timezone string; blank values return HTTP
400. - Windows builds embed IANA tzdata (
time/tzdata) so valid zones likeAmerica/Chicagoresolve without external zoneinfo installation. - If persisted timezone is invalid at load time, scheduler falls back to
UTCand logs a warning (invalid scheduler timezone; falling back to UTC).
PUT /api/admin/automationsnapshots prior scheduler-related settings before applying updates and performs apply/rollback work under a detached30stimeout budget. IfScheduler.LoadFromSettings(...)fails, the handler restores the prior values and returns HTTP500.PUT /api/admin/dvrupdates persisted DVR config first, then applies scheduler changes fordvr_lineup_sync. If schedule apply fails, the previous DVR config is restored before the error response is sent.- If rollback itself fails, the error response includes both the apply failure and rollback failure details; immediately re-read config state from
GET /api/admin/automationandGET /api/admin/dvrbefore retrying writes.
- HTTP
409when triggering a run means that job (or another job under global job lock) is already running. - Inspect run state with
GET /api/admin/jobs/{runID}and checkstatus,progress_cur,progress_max,summary, anderror. status=errorwith empty progress usually indicates early validation/config errors (for example missing playlist URL or provider access failure).status=canceledis expected on shutdown or canceled request contexts.- If a run appears stalled in
running, correlate with log events:job started,job finished,playlist refresh*,admin dvr sync*, and upstream/provider error messages.
- If lineup entries expose
station_refwithoutlineup_channel, sync can still succeed using station-ref-only mapping. - Station-ref-only behavior is surfaced as warnings in sync responses and should not be treated as automatic failure.
- Forward sync unresolved counts (
unresolved_count) usually indicate missing tuner-number matches or lineup station lookup mismatches. - Reverse sync missing counts (
missing_tuner_count,missing_mapping_count,missing_station_ref_count) identify which side lacks matching metadata.
/ui/channelsuses paged bulk mapping fetch (GET /api/channels/dvrwithlimit/offset) to reduce per-row request fan-out and avoid one unbounded mapping payload.- If DVR mappings render empty after refresh, check for HTTP
429responses and adjustRATE_LIMIT_RPS/RATE_LIMIT_BURSTfor your admin workload. - Confirm the DVR backend is reachable with
POST /api/admin/dvr/testbefore treating blank mappings as UI-only issues. - Use browser devtools network logs to confirm mapping payload content and response codes when diagnosing render gaps.
- If all users behind a reverse proxy appear to share one limiter bucket (frequent cross-user
429), setRATE_LIMIT_TRUSTED_PROXIES(or--rate-limit-trusted-proxies) to the proxy CIDR/IP. - Include only proxy hops you operate and trust to sanitize forwarded headers.
- Header precedence for trusted peers is
Forwarded, thenX-Forwarded-For, thenX-Real-IP. ForwardedandX-Forwarded-Forvalues are resolved right-to-left by peeling trusted hops from the right; malformed chains or all-trusted chains fall back toRemoteAddr.
- HTTP
401: credentials are missing or wrong. - HTTP
500on all admin routes:ADMIN_AUTHformat is invalid; useuser:pass.
- Enable legacy listener for compatibility:
- set
HTTP_ADDR_LEGACY=:80 - expose/open TCP
80
- set