Skip to content

SyslogAppender may exceed MaxMessageLength for multibyte UTF-8 messages because splitting is based on LogString character count instead of encoded byte length #680

@jmestwa-coder

Description

@jmestwa-coder

SyslogAppender currently decides whether to split outgoing syslog packets using LogString::size(), while the actual UDP payload is produced only after transcoding the message into encoded bytes.

This creates a transport boundary mismatch for multibyte encodings such as UTF-8:

  • split decision → based on character count
  • emitted datagram size → based on encoded byte length

As a result, messages containing multibyte characters can remain below the configured MaxMessageLength threshold while still producing UDP datagrams larger than the configured maximum.

Runtime Reproduction

Configuration:

  • MaxMessageLength = 100
  • PatternLayout("%m")
  • Syslog output to localhost UDP receiver

Test message:

  • 40 Euro symbols ()
  • UTF-8 encoding ( = 3 bytes)

Observed behavior with current implementation:

  • msg.size() = 40
  • encoded payload length = 120 bytes
  • no splitting occurred
  • emitted UDP datagram size = 124 bytes including syslog prefix

This exceeds the configured maximum despite the existing split logic.

Root Cause

Current split logic uses:

if (msg.size() > _priv->maxMessageLength)

However:

  • LogString::size() reflects internal character/code-unit count
  • UDP transport size depends on encoded byte length after transcoding

For multibyte UTF-8 content, these values diverge.

Why This Matters

This affects transport boundary reliability rather than trusted configuration validation.

Oversized syslog datagrams may:

  • exceed expected relay or collector limits
  • increase truncation/drop risk
  • produce inconsistent packet chunking behavior across encodings

The issue is especially visible with:

  • UTF-8 multibyte characters
  • emoji
  • CJK characters
  • mixed-width log content

Additional Investigation Notes

I prototyped a byte-aware splitting implementation to validate the issue.

That investigation confirmed:

  • byte-aware splitting resolves the demonstrated overflow case

  • however, a naive implementation introduces additional considerations:

    • prefix/suffix accounting must be dynamic rather than heuristic
    • repeated transcoding inside the split loop can introduce avoidable hot-path overhead

For example:

  • enabling facilityPrinting
  • while using a fixed suffix reserve
  • still allowed a packet to exceed MaxMessageLength by 1 byte in testing

Because of this, I am opening this issue first rather than immediately proposing the prototype patch.

Suggested Direction

A more robust solution may involve:

  • enforcing limits using encoded byte length
  • dynamically accounting for syslog prefix/suffix overhead
  • preserving valid UTF-8/codepoint boundaries during splitting
  • avoiding repeated transcoding work in the logging hot path

I can provide the reproduction test and prototype implementation details if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions