Skip to content

Investigate and fix data wipeout in some Engagement tables and usage of FINAL keyword for replicated ClickHouse server #35272

@jcastro-dotcms

Description

@jcastro-dotcms

Problem Statement

Two related defects affect the Content Analytics pipeline in environments where ClickHouse replication is enabled:

1 — session_facts data is wiped on every refresh

The session_facts_rmv Refreshable Materialized View is configured with REWRITE mode, which truncates and rewrites the session_facts target table on every scheduled refresh cycle. This destroys all historical session data, making it impossible to retain engagement history beyond the most recent refresh window. The expected behavior is that session records accumulate over time without being purged.

2 — ReplacingMergeTree tables may return duplicated/stale rows

Several aggregate tables (session_facts, engagement_daily, sessions_by_device_daily, sessions_by_browser_daily, sessions_by_language_daily) use the ReplacingMergeTree engine. In ClickHouse deployments with replication enabled, reads against these tables are not guaranteed to be deduplicated unless the FINAL keyword is present in the query. The current CubeJS schema files query these tables without FINAL, which can lead to metrics surfacing duplicate or stale rows in replicated environments.

Impact: Loss of session history data and potentially incorrect engagement metrics in any ClickHouse-replicated deployment.


Steps to Reproduce

Issue 1 — Data wipe:

  1. Start the analytics stack with ClickHouse replication enabled.
  2. Ingest events so that session_facts accumulates rows.
  3. Wait for session_facts_rmv to refresh (every 30 s in the dev config).
  4. Query session_facts — only rows produced in the most recent refresh window are present; all prior history is gone.

Issue 2 — Duplicate rows:

  1. Start the analytics stack with ClickHouse replication enabled.
  2. Ingest events and allow engagement_daily (or any *_daily table) to populate.
  3. Query the table directly without FINAL — observe that duplicate versions of the same row may be returned depending on merge state.
  4. Add FINAL to the same query — observe that only the latest version is returned.

Acceptance Criteria

  • session_facts_rmv Refreshable Materialized View is updated from REWRITE to APPEND mode
  • Existing session data in session_facts is preserved across all subsequent refresh cycles — no rows are removed on refresh
  • New session records are correctly appended to session_facts without side-effects on existing rows
  • An investigation is performed to determine whether FINAL is required for correct deduplication when reading from the following ReplacingMergeTree tables in replicated ClickHouse environments: session_facts, engagement_daily, sessions_by_device_daily, sessions_by_browser_daily, sessions_by_language_daily
  • Either FINAL is added to all affected SQL queries, or the table engine / insert strategy is updated so FINAL is not required — whichever is the more appropriate fix
  • Content Analytics dashboard metrics return correct, deduplicated data in environments with ClickHouse replication enabled
  • The chosen deduplication approach is documented inline in the affected schema and/or DDL files

dotCMS Version

Any environment running the analytics stack with ClickHouse replication enabled.

Severity

High - Major functionality broken


Links

NA

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions