Skip to content

Feat/restore observability panel#591

Open
Niks988 wants to merge 9 commits into
mainfrom
feat/restore-observability-panel
Open

Feat/restore observability panel#591
Niks988 wants to merge 9 commits into
mainfrom
feat/restore-observability-panel

Conversation

@Niks988
Copy link
Copy Markdown
Collaborator

@Niks988 Niks988 commented May 20, 2026

Summary

Restores the Observability page that was lost during the v10 merge, adds comprehensive operation tracking for all DKG event types, and includes a critical database migration fix for pre-existing nodes.

What's included

Observability UI (restored & enhanced)

  • Full Observability page with dedicated "All Operations" and "Hardware" tabs
  • Expandable bottom panel with live logs (ANSI-stripped), level filtering, and gossip noise filtering
  • Operation list with inline phase visualization (MiniGantt with labeled pills)
  • Hardware live stat cards (CPU, RAM, Heap, Disk, Peers, RPC Latency) + time-series charts
  • Operation success rate charts and per-type time series
  • Click-to-expand detail panel with Phase Timeline waterfall and correlated logs

Operation Tracking (daemon)

  • Tracks all DKG event types: sync (PROJECT_SYNCED), gossip (GOSSIP_MESSAGE), verify (KC_CONFIRMED), ka-update (KA_UPDATED)
  • Existing tracking preserved: query (with parse/execute phases), publish, connect, share (with validate/store phases)
  • All operation types now appear in the type filter dropdown

Database Migration (critical fix)

  • Schema version 14: auto-renames legacy paranet_countcontextGraph_count and paranet_idcontextGraph_id
  • Fixes silent INSERT failures on nodes that were created before the v10 terminology rename
  • Column existence checked via pragma table_info before rename (safe for fresh DBs)

Quality-of-life fixes

  • Metrics collection interval reduced from 120s to 30s for more responsive hardware display
  • Auto page reload on 401 (stale auth token after node restart)
  • Header shows node name instead of placeholder ** from agent identity
  • Log level select properly styled (no native browser chrome)

Test plan

  • Fresh node: verify schema creates with contextGraph_* columns, Observability page loads correctly
  • Existing node (pre-rename DB): verify migration renames columns, metrics and operations appear in UI
  • Run queries via CLI → confirm parse + execute phases visible in Operations tab
  • Join a context graph → confirm sync operations appear
  • Verify hardware stats populate within 30s of node start
  • Restart node → confirm UI auto-reloads and continues working (no permanent 401 loop)
  • Check header displays configured node name, not **

Niks988 and others added 9 commits May 19, 2026 14:46
- Header: add Observability button (pulse icon) that opens the existing
  OperationsPage (All Operations / Performance / Logs / Errors tabs).
  Entry point was orphaned after the v10 UI rewrite; page itself and its
  PanelCenter wiring were already intact.

- PanelBottom: replace fixed 200px bottom panel with a fully resizable,
  draggable panel (vertical drag handle on top edge, persisted height via
  layout store). Adds a maximise/restore toggle (80vh overlay). Height
  is stored alongside leftWidth/rightWidth and survives reloads.

- Transactions tab: wired from existing /api/operations (publish/update
  ops that reached the chain phase) — on-chain activity without any new
  backend routes. Expandable rows show tx hash, peer, phase waterfall.

- Gossip tab: live-filtered view of the node log showing only libp2p /
  gossipsub / peer / SWM lines. Keyword list is broad enough to catch
  relay, DHT and protocol events.

- Node Log tab: adds level filter (error/warn/info/debug), pause button,
  auto-scroll that respects manual scroll position.

Note: all three tabs are currently backed by the local SQLite-backed
/api/* endpoints. Once the OTEL telemetry stack is live these tabs will
be replaced by Tempo trace / Loki log streams at the fleet level.

Co-authored-by: Cursor <cursoragent@cursor.com>
OperationName in packages/core/src/logger.ts now covers:
  publish, publishFromSWM, update, ka-update, query, resolve, connect,
  sync, share, gossip, reconstruct, verify, init, system

Operations.tsx OP_TYPE_COLORS and OP_TYPE_DESCRIPTIONS were missing
the seven new ones (share, publishFromSWM, ka-update, reconstruct,
verify, init, resolve). All added with distinct colours and descriptions.

PanelBottom Transactions tab TX_OP_TYPES expanded from {publish, update}
to {publish, publishFromSWM, update, ka-update, reconstruct} — all op
types that can reach the chain phase and submit a tx.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Strip ANSI escape sequences from all log lines (Node Log + Gossip)
  so raw color codes no longer appear as literal text
- Remove pause/play button from Node Log toolbar (simplified to filter
  + level select only)
- Extract useAutoScroll() hook shared by Node Log and Gossip
- Gossip tab now shows only real libp2p events (Connection opened/closed,
  ProtocolRouter timings, Circuit relay, GossipSub, FinalizationHandler)
  instead of leaking general DKGAgent structured log lines
- CSS: remove browser focus outlines on tab/toggle buttons, fix toolbar
  flex layout, add v10-log-level-select class, use pre-wrap + word-break
  on log lines so long lines wrap instead of overflowing, active tab now
  uses accent-blue underline

Co-authored-by: Cursor <cursoragent@cursor.com>
The v10 UI rewrite left Operations.tsx without corresponding stylesheet
definitions for its legacy v9 class names. Added:

  tab-group / tab-item     — horizontal tab bar with accent-blue underline
  input / select.input     — themed form controls with custom select arrow
  data-table               — striped/hoverable table with uppercase headers
  badge / badge-{success,error,warn,info} — coloured type/status pills
  empty-state / --compact / --rich — centred placeholder layouts
  page-section / page-title — page wrapper and heading
  card-title               — section heading inside cards
  filters                  — flex filter bar
  phase-bar-wrap/seg       — inline phase progress bars
  tx-link-icon             — subtle tx hash link styling

Co-authored-by: Cursor <cursoragent@cursor.com>
…ph_*

Databases created before the v10 terminology rename still have
`paranet_count` and `paranet_id` columns. INSERT statements targeting
the new `contextGraph_*` names fail silently, preventing metrics and
operations from being stored.

Adds schema version 14 which detects the old column names via
pragma table_info and renames them in-place.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds OperationTracker instrumentation for DKG events that were
previously untracked: PROJECT_SYNCED (sync), GOSSIP_MESSAGE (gossip),
KC_CONFIRMED (verify), and KA_UPDATED (ka-update).

These are event-based records (work completed in the core before
the event fires), so they capture occurrence + metadata rather than
multi-phase timing.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Reduce SNAPSHOT_INTERVAL_MS from 120s to 30s for more responsive
  hardware metrics in the Observability panel.
- Handle 401 responses in useFetch by triggering a page reload so
  the server re-injects a fresh auth token after node restarts.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Split the old "Performance" tab into dedicated "All Operations" and
  "Hardware" tabs. Operations tab shows operation stats charts + the
  full operations list; Hardware tab shows live stat cards and
  time-series graphs.
- Redesign MiniGantt component to display phase name pills with
  colored dots and durations inline (no hover required).
- Show "event-based" label for operations without phases instead of
  a bare dash.
- Fix header showing "**" instead of node name when agent identity
  has a placeholder name.

Co-authored-by: Cursor <cursoragent@cursor.com>
// Track sync completions
agent.eventBus.on(DKGEvent.PROJECT_SYNCED, (data: any) => {
try {
const ctx = createOperationContext("sync");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: PROJECT_SYNCED is emitted only after catch-up has already finished, so creating a fresh sync context here and completing it immediately records every sync as ~0 ms. That will skew the new latency/success views instead of reflecting the real sync cost. Start/finish the tracked operation from the actual sync entrypoint, or store this as a separate event type rather than an Operation.

});

// Track gossip messages
agent.eventBus.on(DKGEvent.GOSSIP_MESSAGE, (data: any) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: GOSSIP_MESSAGE fires for every inbound GossipSub payload. Writing each one as a full operation row will grow operations at network-traffic rate and swamp the observability views on busy nodes. Consider aggregating/sampling gossip activity into metrics instead of tracker.start/complete per message.

const ctx = createOperationContext("ka-update");
tracker.start(ctx, {
contextGraphId: data.contextGraphId,
details: { kaUri: data.kaUri },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: the current KA_UPDATED emitters publish fields like ual, batchId, and rootEntities rather than kaUri, so these new ka-update rows will always lose the asset identifier you're trying to surface. Record data.ual here or normalize the event payload before sending it to the tracker.

const [expanded, setExpanded] = useState<string | null>(null);

const load = useCallback(() => {
fetchOperationsWithPhases({ limit: '100', periodMs: String(6 * 60 * 60_000) })
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: this calls fetchOperationsWithPhases from ui/api.ts directly instead of going through api-wrapper like the rest of the shell. In mock/offline mode the Transactions tab will fail while the other bottom-panel tabs still fall back cleanly. Route this through api.fetchOperationsWithPhases(...) for consistent behavior.

<span><b>ID:</b> <span style={{ fontFamily: 'var(--font-mono)' }}>{op.operation_id}</span></span>
{txHash && <span><b>Tx:</b> <span style={{ fontFamily: 'var(--font-mono)' }}>{txHash}</span></span>}
{op.peer_id && <span><b>Peer:</b> <span style={{ fontFamily: 'var(--font-mono)' }}>{shortId(op.peer_id)}</span></span>}
{op.error && <span style={{ color: '#ef4444' }}><b>Error:</b> {op.error}</span>}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: /api/operations exposes failures as error_message, not error, so failed transactions in this new panel will render without any reason. Read op.error_message here (or normalize the API response shape before rendering).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant