Fix witness by On1x · Pull Request #96 · VIZ-Blockchain/viz-cpp-node

On1x · 2026-04-28T04:59:59Z

No description provided.

…_median_witness_props

…chedule when emergency mode is active

…psis matching diagnostics

…indows - Track logical block and index file sizes separately from mapped_file.size() - Add methods to sync and verify logical sizes against actual mapped sizes - Heal stale mappings by closing and reopening files when size mismatches are detected - Increase resize count on each resize operation for diagnostics - Return dlt_block_log head by value to ensure thread safety - Expose verify_mapping() and resize_count() for external monitoring - Integrate periodic mapping verification and healing in P2P plugin statistics task - Improve block read and append logic to use logical sizes and assert correctness

…ling - Add detailed per-peer and failed peer stats logging when p2p-stats-enabled is true - Include block storage diagnostics in P2P stats output with ranges and resize counts - Implement verify_mapping() to detect and self-heal stale memory-mapped file states - Track logical file sizes independently from mapped_file.size() to avoid stale size issues - Automatically run verify_mapping() every 5 minutes by P2P stats task to maintain integrity - Document minority fork auto-recovery process triggered by witness plugin resync_from_lib()

…ontinuity - Detect gap between dlt_block_log end and fork_db start on update_lib - Reset dlt_block_log to earliest available fork_db block after gap detection - Append blocks from fork_db to dlt_block_log after reset for recovery - Add dlt_block_log::verify_continuity() to identify missing or unreadable blocks - Log warnings for detected gaps and coverage issues during P2P stats reporting - Improve dlt_block_log integrity checks by combining mapping verification and full scan - Introduce ANSI color codes for enhanced P2P log message clarity

…recovery - Add verify_continuity() method for gap detection and integrity verification - Implement sophisticated gap detection and automatic DLT block log reset - Integrate signal-based snapshot plugin triggering after DLT reset - Enhance diagnostic system with verify_mapping() and resize_count() for monitoring - Fix Windows memory-mapped file size drift with separate logical size tracking - Improve read/append logic using logical sizes for correctness and reliability - Add automatic healing and periodic mapping verification mechanisms - Strengthen gap warning suppression to prevent redundant logging - Update reset() method to ensure safe log clearing with comprehensive cleanup - Integrate gap monitoring and recovery in P2P synchronization and stats tasks

- Implement diagnostics to log block storage state at startup - Report head block, last irreversible block, earliest block available - Log dlt_block_log range and block_log end block number - Display fork_db head, linked and unlinked block ranges and counts - Detect and log gaps between dlt_block_log end and fork_db start - Perform full integrity scan of dlt_block_log continuity - Log any detected gaps or confirm integrity if none found

- Added check for peers being ahead in DLT mode by verifying all synopsis entries are above local head block number - Return empty result if peer is ahead, indicating no fork but no blocks to send - Retain existing logging and exception throw for unreachable fork cases - Improved log messages with detailed peer synopsis and head block info

- Add tracking of connected peer IPs for cross-referencing with peer_db entries - Log additional peer metrics including bytes sent, connection direction, user agent, fork revision, head block info, firewall status, and timestamps for connection events - Improve p2p peer log output with expanded details and clearer formatting - Cross-reference failed peer_db entries against currently connected IPs - Annotate peer_db logs when a "failed" peer is actually connected at the moment

…d Windows support - Add verify_continuity(), verify_mapping(), and resize_count() methods for diagnostics - Implement intelligent gap detection and automatic recovery with block log reset - Integrate periodic integrity scanning into P2P stats task for DLT mode nodes - Add signal-based snapshot plugin integration for fresh snapshot creation - Enhance gap logging with automatic warning suppression via _dlt_gap_logged state - Improve Windows compatibility by tracking logical file size and healing stale mappings - Introduce advanced peer ahead-of-us detection to prevent unnecessary sync attempts - Provide comprehensive startup diagnostics before synchronization begins - Use ANSI color codes for enhanced console readability and comprehensive logging - Strengthen DLT integrity verification and coverage gap monitoring in P2P plugin

- Introduce a disable threshold setting to auto-disable a witness after producing N consecutive blocks from the same node, with 0 to disable this feature - Track per-witness consecutive block counts and auto-disable witnesses exceeding the threshold by setting their signing key to null on-chain - Prevent auto-restoration of witnesses that have been auto-disabled, requiring manual intervention or restart to re-enable - Implement send_witness_disable function to create and broadcast witness_update transactions that disable witnesses safely - Add detailed logging for auto-disable and disable transaction broadcasting events - Reset consecutive block counters when block production is by different witnesses - Add configuration options and documentation entries for new auto-disable feature - Update witness_guard plugin lifecycle to initialize and use the new threshold and auto-disable logic during runtime block handling

- Changed peer connection time to be returned as seconds since epoch - Updated conntime field assignment to use sec_since_epoch() method - Ensured consistency with other timestamp fields in peer details output

…itnesses - Add `witness-guard-disable` config option to set threshold for auto-disabling a monitored witness after producing consecutive blocks - Implement per-witness consecutive block counters, reset when a different witness produces a block - Broadcast `witness_update_operation` with null signing key to disable witness on reaching threshold, marking witness as auto-disabled - Suppress auto-restore for auto-disabled witnesses until operator manually restores the signing key - Clear auto-disabled flag when a non-null signing key is detected on-chain - Update block handler to track and act on consecutive block counts - Add detailed documentation for new feature, safety guards, and operator guidance

…tion system - Add detailed peer metrics logging including connection, bytes sent, user agents, and firewall status - Implement peer database cross-referencing to track failed/rejected peers and connection states - Introduce automatic peer soft-banning for sync spam and gap-related errors - Enhance console logging with ANSI color codes for better readability - Add multi-layer gap detection monitoring with automatic recovery mechanisms - Improve DLT mode support with integrity verification and compatibility enhancements - Update and reorganize documentation to reflect new diagnostics and recovery functionalities

- Skip vote penalties and total_missed count during emergency offline witness scenario - Blank signing key if witness missed blocks exceed emergency threshold - Emit log message when key is blanked due to consecutive missed blocks - Push shutdown virtual operation to remove forked witnesses from schedule - Define emergency max missed blocks constant for quicker network recovery

- Add detection for stale snapshots older than DLT block log start block on plugin startup - Log warning and schedule creation of a fresh snapshot on first synced block if stale snapshot found - Implement urgent fresh snapshot creation on first synced block to prevent serving broken snapshots - Defer urgent snapshot creation if a witness is scheduled soon to avoid conflicts - Integrate stale snapshot handling into applied_block signal logic for timely snapshot updates

- Implement sync_spam_strikes counter for repeated sync requests on competing forks - Configure 50-strike threshold triggering automatic 5-minute soft-ban enforcement - Integrate sync spam prevention with existing soft-ban infrastructure and peer management - Add automatic reset of strike counter upon soft-ban expiration - Prevent resource exhaustion through sync ping-pong loop mitigation - Enhance peer connection logging with sync spam related status reporting and closing reason tracking

…T mode - Detect at startup if latest snapshot block is older than dlt_block_log start block - Log warning and set flag for creating urgent fresh snapshot on first fully-synced block - Schedule async snapshot creation with witness-aware deferral to avoid unsyncable gaps - Integrate stale snapshot detection into applied_block lifecycle for seamless recovery - Update documentation, plugin descriptions, and repowiki with detailed stale snapshot info - Ensure serving nodes maintain syncable snapshots to prevent P2P sync failures in DLT mode

…anism - Introduce sync_spam_strikes counter for tracking repeated sync requests of competing forks - Implement fork_rejected_until timestamp for precise soft-ban timing control - Add configurable 50-strike threshold triggering 5-minute soft-ban duration on sync spam - Integrate sync spam prevention into peer_connection, node, and peer_database components - Enhance existing intelligent enforcement and soft-ban infrastructure with spam prevention logic - Improve network stability against resource exhaustion attacks from malicious peers - Enable automatic reset of strike counters post soft-ban enforcement to avoid permanent bans - Extend detailed peer disconnect logging and diagnostics to include sync spam events - Update documentation and diagrams to reflect new sync spam prevention system and mechanisms

…snapshot plugin replaces the entire database state and resets forkdb, but never notifies the P2P layer. The P2P peers keep their old stale sync state and never send a new synopsis to request blocks. This causes: Seed node stops receiving blocks 2-minute stall timeout fires Snapshot re-download cycle repeats

When the snapshot plugin creates a snapshot, P2P block processing is paused and incoming blocks from peers are silently dropped. After the pause ends, resume_block_processing() detects the gap and requests missing blocks asynchronously — but the witness production loop (250ms tick) could fire before those blocks arrive. For the emergency master, which bypasses all sync checks, this meant producing a block on a stale head that conflicted with the blocks about to arrive from peers. Other nodes saw this as a fork. Add a _catchup_after_pause flag that is set unconditionally when resume_block_processing() runs (peer_head_num may be stale after the pause), and cleared either by transition_to_forward() when the gap is filled, or by periodic_task() when no gap actually existed. Also send a proactive hello to all peers in resume_block_processing() to refresh their head info as quickly as possible.

When the snapshot plugin creates a snapshot, P2P block processing is paused. Previously, incoming blocks from peers were silently dropped, requiring a gap fill after resume — and the emergency master could produce a block on a stale head before the gap was detected. Now block-carrying messages (block_reply, block_range_reply, gap_fill_reply) are deserialized and pushed into _paused_block_queue during the pause. Hello and fork_status messages are still processed normally to keep peer_head_num up to date. When resume_block_processing() is called, it posts drain_paused_block_queue() to the P2P thread, which sorts queued blocks by block_num and applies each via accept_block(). The _catchup_after_pause flag blocks witness production until the drain completes and no peer is ahead. Files changed: - dlt_p2p_node.hpp: add _paused_block_queue, drain_paused_block_queue() - dlt_p2p_node.cpp: queue blocks in on_message(), drain in resume, periodic fallback drain - p2p_plugin.hpp/cpp: expose is_catching_up_after_pause() - witness.cpp: defer production while catchup flag is set

When the snapshot plugin creates a snapshot, P2P block processing is paused and the snapshot thread holds a strong DB read lock for 30-120s. Two bugs: 1. Write lock deadlock: the emergency master's production loop (250ms tick) bypasses all sync checks and calls generate_block() → push_block() → write lock, which deadlocks behind the read lock, producing 11+ second write lock timeouts (readers=0, waiter spinning). 2. Fork on stale head: blocks arriving during the pause were silently dropped. After resume, the emergency master produced a block on a stale head before gap-fill could deliver the real blocks. Fix: - Block queue: block-carrying messages (block_reply, block_range_reply, gap_fill_reply) are deserialized and pushed into _paused_block_queue during the pause. Hello/fork_status are still processed to keep peer_head_num up to date. - Queue drain: resume_block_processing() posts drain_paused_block_queue() to the P2P thread, which sorts by block_num and applies each block. - Production gate: is_catching_up_after_pause() returns true when EITHER _block_processing_paused OR _catchup_after_pause is set. This prevents generate_block() during the pause (no write lock deadlock) AND during post-pause catchup (no stale-head fork). Files changed: - dlt_p2p_node.hpp: add _paused_block_queue, drain_paused_block_queue(), update is_catching_up_after_pause() to check _block_processing_paused - dlt_p2p_node.cpp: queue blocks in on_message(), drain in resume, periodic fallback drain - p2p_plugin.hpp/cpp: expose is_catching_up_after_pause() - witness.cpp: defer production while gate is active

…erleaving send_message() had two bugs causing remote peers to see corrupted dlt_block_reply_message payloads: 1. writesome() return value was ignored — partial TCP writes silently dropped the remaining bytes, sending truncated messages (e.g. 123 bytes where 500KB were intended). 2. Two separate writesome() calls (header, data) each yield the fiber via async .wait(). During the yield, another fiber could write to the same socket, interleaving bytes and producing garbage that fails deserialization at the receiver (e.g. huge varint in witness field). Fix: coalesce header + payload into a single buffer, loop writesome() until all bytes are written, and add a per-peer send guard that drops concurrent messages rather than corrupting the stream.

send_message() previously dropped messages when a send was already in progress for the same peer. If blocks 10, 11, 12 were dispatched in rapid succession, 11 and 12 were silently discarded, breaking sync. Replace the drop-on-contention guard with a per-peer send queue (dlt_peer_state.send_queue). When a fiber is already writing to a peer's socket, new messages are enqueued. The active writer drains the queue after each successful write, preserving message order. Key details: - Single-buffer serialization (header + payload coalesced) prevents fiber interleaving mid-message during partial writes - Partial-write retry loop ensures every byte reaches the wire - Queue capped at 100 messages per peer (configurable constant) - Dropped-message counter tracked in stats and logged periodically - Queue cleared on disconnect; send guard cleaned on close

…sconnected When all peers are disconnected/banned (e.g. after snapshot pause), check_sync_catchup() falsely reported "caught up" with zero active peers, causing SYNC→FORWARD→SYNC oscillation every 30s. Forward stagnation also oscillated uselessly because no peers existed to request blocks from. Add isolation detection with a 60s grace period, after which all peer backoffs are reset to initial values and soft bans are cleared, forcing immediate reconnection attempts. This replaces the oscillating state machine with a single recovery action.

The snapshot TCP server's accept_loop had no sleep in its catch handlers for fc::exception, std::exception, and catch-all. When the process hit the file descriptor limit (EMFILE / "Too many open files"), the loop would spin at full speed, spamming error logs every millisecond and burning CPU for hours. Add fc::usleep(fc::seconds(1)) in all three error paths, matching the pattern already used in the DLT P2P accept_loop.

… gaps Previously request_gap_fill() was gated to FORWARD mode only, causing nodes stuck in SYNC with a gap to never request missing blocks. The gap silently grew as broadcast blocks arrived and were stored in fork_db as unlinkable. Additionally, gaps larger than GAP_FILL_MAX_BLOCKS (100) were silently skipped instead of being served in chunks. With a gap of 800+ blocks, the node had no recovery path at all. Changes: - Remove FORWARD-only guard so gap fill works in both SYNC and FORWARD - Include SYNCING lifecycle peers in the candidate search (the best peer in SYNC mode is typically in SYNCING state) - For large gaps, request the first GAP_FILL_MAX_BLOCKS blocks instead of returning; subsequent chunks are requested after the current one completes or times out - Also trigger gap fill from out-of-order blocks in SYNC mode, not just FORWARD

…mpeting forks When a witness node produces a block that the network rejects (competing fork at the same height), gap fill requests starting from our_head+1 cannot link any blocks because their parent (the network's version of our_head) is unknown to our chain. By including our_head in the gap fill request (same P49 logic already used in request_blocks_from_peer), the peer returns its version of our head block. If it's the same as ours, accept_block returns ALREADY_KNOWN with no side effects. If different, fork_db stores it and can link the subsequent blocks, resolving the competing fork. For the reported case: head=79740489, gap fill now requests 79740489-79740500 instead of 79740490-79740500. The network's #79740489 links to #79740490+, allowing the chain to advance.

Two root causes for the immediate SYNC→FORWARD round-trip: 1. transition_to_sync() did not reset _last_block_received_time, so sync_stagnation_check() inherited a stale timestamp and fired on the very next tick (~5s later). 2. check_forward_stagnation() transitioned to SYNC even when no peer had a higher block number. SYNC has nothing to offer in that case — check_sync_catchup() immediately returned to FORWARD, completing the oscillation loop. Fix (1): reset _last_block_received_time to now() on SYNC entry. Fix (2): check for peers ahead before transitioning; if none, reset the stagnation timer and stay in FORWARD.

… fork_db When syncing from LIB, the sync starting block's parent is on the main chain but absent from fork_db (which only tracks blocks near head). The dead-fork detection in _push_block() checked only fork_db for the parent, causing it to incorrectly reject legitimate fork blocks and soft-ban the sync peer. Add a main-chain parent check via fetch_block_by_id() before declaring a dead fork. If the parent exists on the main chain, seed fork_db with it and let the block proceed through normal push logic. Apply the same check in p2p_plugin's unlinkable_block_exception handler as a safety net.

During P2P sync mode, the accept_block handler was skipping both witness signature and transaction signature verification. Skipping witness signatures allows a malicious peer with a valid fork_id to inject forged blocks that would be accepted without verification. Keep skip_transaction_signatures during sync (transactions inside a block are already committed by the producing witness, so individual signature checks are redundant). But always verify the witness block signature — it is the sole defense against block forgery.

Replace catch-all exception handlers with targeted catches: - database.cpp: fork_db parent seeding now catches only fc::assert_exception (duplicate/already-present). Memory errors and corruption propagate instead of being silently swallowed. - dlt_p2p_node: add PAUSED_QUEUE_MAX (1000) limit to prevent unbounded memory growth during long snapshot pauses. Replace silent catch(...){} with fc::exception catch + wlog so deserialization failures are visible in diagnostics.

Issue #4: Add comprehensive safety analysis for _pending_tx access during snapshot serialization. Document that strong_read_lock is compatible with weak_write_lock (used by push_transaction), making a theoretical race possible via API accept_transaction. P2P pause during snapshot eliminates the primary concurrent writer. A full fix requires pausing API transactions or copying _pending_tx outside the read lock scope. Issue #5: Prevent infinite loop in drain_send_queue when writesome() returns 0 bytes (stalled connection). Throw an fc::exception which is already caught by the outer handler that logs the error and disconnects the peer.

A slave node falling far behind the network would soft-ban the very peer that could supply the blocks it needed, locking itself out of catch-up. Three related fixes in dlt_p2p_node.cpp: 1. add_to_mempool: do not strike the sender for TaPoS-invalid transactions. TaPoS validity is a function of OUR chain state (block_summary_object circular buffer), not the sender's. When we are behind, recent txs reference blocks past our head whose ref_block_num slot still holds an older block, so the prefix check fails through no fault of the sender. After 10 strikes the slave soft-banned its only viable bridge peer for 3600s. Strikes for genuinely-bad packets (expired, oversized, far-future expiration) are unchanged. 2. send_to_all_our_fork_peers: skip transaction broadcast to peers whose peer_node_status == DLT_NODE_STATUS_SYNC. They cannot validate the txs anyway, so it is wasted bandwidth and used to be the source of the strikes from (1). Block and other message types still propagate. 3. on_dlt_gap_fill_request: always send the dlt_gap_fill_reply, even when empty. Previously, when the serving peer had no blocks in the requested range (e.g. all below its dlt_earliest), no reply was sent, so the requester waited GAP_FILL_TIMEOUT_SEC (15s) and retried the same peer indefinitely. With an empty reply the requester clears _gap_fill_in_progress immediately and can switch peer/strategy on the next periodic tick. Reproduces the scenario captured in .qoder/logs/p57. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Issue #4: Add comprehensive safety analysis for _pending_tx access during snapshot serialization. Document that strong_read_lock is compatible with weak_write_lock (used by push_transaction), making a theoretical race possible via API accept_transaction. P2P pause during snapshot eliminates the primary concurrent writer. A full fix requires pausing API transactions or copying _pending_tx outside the read lock scope. Issue #5: Prevent infinite loop in drain_send_queue when writesome() returns 0 bytes (stalled connection). Throw an fc::exception which is already caught by the outer handler that logs the error and disconnects the peer.

…pp-node into fix-witness

…e handling - Remove detailed implementation of peer management functions, including connection setup, disconnect handling, and periodic lifecycle checks - Omit message send and receive logic, including send queue management and message dispatch - Discard fork alignment checks and handshake message building for peer compatibility verification - Remove feature gating for block processing pause and buffered message queues - Eliminate debug logging and error handling for message processing failures - Simplify peer state transitions and reconnection backoff strategies - Strip out IP deduplication in connection establishment and relay protocols - Remove code related to seed node initialization and listen socket management - Omit send-to-all broadcast and echo suppression mechanisms for peers - Remove detailed error categorization for connection failures and socket operations

…up logic - Replace unguarded transition_to_forward() in range_fallback_mode path with check_sync_catchup() - Add known_head_peers counter to check_sync_catchup() to skip empty peers in catchup check - Prevent node claiming caught up when connected only to empty peers by requiring known_head_peers > 0 - Maintain empty peers in active_peer_count for isolation detection while excluding from known_head_peers - Fix disconnect handling and send logic to avoid bandwidth waste for empty peers - Identify and document logic errors: node entering FORWARD state prematurely and partial fork alignment failures - Update code comments and add missing post-P31 fixes from code to documentation audit file

- Update record_packet_result to consider fork_db-only blocks as valid - Prevent marking peers as spam when they send valid but currently unapplicable blocks - Improve handling of competing forks and large-gap syncs from LIB

…e stall Two bugs prevented a DLT slave node from filling its gap after connecting to a master peer on the majority fork: BUG-E — false spam strike on fork_db-only range responses During a large-gap LIB-based sync, every block range the master sends lands in fork_db (the slave's head has diverged; blocks cannot be applied yet). record_packet_result(peer, any_block_applied) incremented spam_strikes on each such batch. After 10 consecutive fork_db-only responses the master was soft-banned for 3600s — exactly the scenario the LIB fallback was designed to handle. The fix: count fork_db-only batches as useful (not spam) by changing the argument to any_block_applied || any_fork_db_only. BUG-F — DLT snapshot node permanently rejects competing fork starting at LIB After importing a snapshot at block N the DLT log starts at N+1. Block N is absent from every store fetch_block_by_id searches. Fork_db was seeded with only the head block (start_block, prev=null). When the master sends its competing fork (N+1_master, parent=N.id): is_known_block(N.id)=false, fetch_block_by_id(N.id)=null → DEAD_FORK. The fork switch could never fire. Three coordinated changes fix this: 1. database.cpp DLT seeding: replaced single-block start_block(head) with bottom-up seeding from the oldest available DLT log block. The slave's recent chain (N+1..head) is now in fork_db with correct prev pointers, making fetch_branch_from walkable on the slave branch during fork switch. 2. database.cpp ALREADY_KNOWN path: when block N arrives from the peer and is confirmed ALREADY_KNOWN, call _fork_db.insert_as_base(new_block) while the full block data is in hand — the only opportunity to populate fork_db with the snapshot LIB block whose data is absent from all log files. 3. fork_database: new insert_as_base() inserts an anchor block without requiring its parent in fork_db, then calls _push_next() and the new _repair_child_prev_links() to reconnect child blocks already in _index (inserted via start_block with null prev) so fetch_branch_from can walk both branches to their common ancestor at block N.

On1x added 30 commits April 29, 2026 16:39

fix: move the emergency hybrid schedule override to run before update…

9884f7b

…_median_witness_props

Add startup recovery in database::open() to detect and fix a broken s…

5c842b2

…chedule when emergency mode is active

add debug crash

154b514

add debug crash deeper

f437f43

fix: add a bounds check on i to prevent overflow

e4e73d4

move debug_crash dlog to debug-block-production option in config.ini

ec5a2f7

Only restart sync if the peer's last block is AHEAD of our head

97ea87a

add 5 min soft ban for sync spam

25678ea

update docs

7c9d246

add dlt block storage stat logs to p2p stats

b23deb2

add better exception handling peer_is_on_an_unreachable_fork and syno…

fb6cbdf

…psis matching diagnostics

fix(network): return connection time as epoch seconds

4b1371b

- Changed peer connection time to be returned as seconds since epoch - Updated conntime field assignment to use sec_since_epoch() method - Ensured consistency with other timestamp fields in peer details output

update logs

8cf1bba

On1x and others added 30 commits May 7, 2026 21:06

update debug logs

ec8ffbe

fix build

143d2f4

update debug log

c53f1b0

fix debug log for socket close race condition

5f3c538

update tx debug log

e700749

update docs

322d871

update docs

7f97b2a

update docs

8787f23

Merge branch 'fix-witness' of https://github.com/VIZ-Blockchain/viz-c…

a40b333

…pp-node into fix-witness

fix build

6a37c14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix witness#96

Fix witness#96
On1x wants to merge 319 commits intomasterfrom
fix-witness

On1x commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

On1x commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant