SyslogAppender may exceed MaxMessageLength for multibyte UTF-8 messages because splitting is based on LogString character count instead of encoded byte length

`SyslogAppender` currently decides whether to split outgoing syslog packets using `LogString::size()`, while the actual UDP payload is produced only after transcoding the message into encoded bytes.

This creates a transport boundary mismatch for multibyte encodings such as UTF-8:

* split decision → based on character count
* emitted datagram size → based on encoded byte length

As a result, messages containing multibyte characters can remain below the configured `MaxMessageLength` threshold while still producing UDP datagrams larger than the configured maximum.

### Runtime Reproduction

Configuration:

* `MaxMessageLength = 100`
* `PatternLayout("%m")`
* Syslog output to localhost UDP receiver

Test message:

* 40 Euro symbols (`€`)
* UTF-8 encoding (`€` = 3 bytes)

Observed behavior with current implementation:

* `msg.size()` = 40
* encoded payload length = 120 bytes
* no splitting occurred
* emitted UDP datagram size = 124 bytes including syslog prefix

This exceeds the configured maximum despite the existing split logic.

### Root Cause

Current split logic uses:

```cpp
if (msg.size() > _priv->maxMessageLength)
```

However:

* `LogString::size()` reflects internal character/code-unit count
* UDP transport size depends on encoded byte length after transcoding

For multibyte UTF-8 content, these values diverge.

### Why This Matters

This affects transport boundary reliability rather than trusted configuration validation.

Oversized syslog datagrams may:

* exceed expected relay or collector limits
* increase truncation/drop risk
* produce inconsistent packet chunking behavior across encodings

The issue is especially visible with:

* UTF-8 multibyte characters
* emoji
* CJK characters
* mixed-width log content

### Additional Investigation Notes

I prototyped a byte-aware splitting implementation to validate the issue.

That investigation confirmed:

* byte-aware splitting resolves the demonstrated overflow case
* however, a naive implementation introduces additional considerations:

  * prefix/suffix accounting must be dynamic rather than heuristic
  * repeated transcoding inside the split loop can introduce avoidable hot-path overhead

For example:

* enabling `facilityPrinting`
* while using a fixed suffix reserve
* still allowed a packet to exceed `MaxMessageLength` by 1 byte in testing

Because of this, I am opening this issue first rather than immediately proposing the prototype patch.

### Suggested Direction

A more robust solution may involve:

* enforcing limits using encoded byte length
* dynamically accounting for syslog prefix/suffix overhead
* preserving valid UTF-8/codepoint boundaries during splitting
* avoiding repeated transcoding work in the logging hot path

I can provide the reproduction test and prototype implementation details if helpful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SyslogAppender may exceed MaxMessageLength for multibyte UTF-8 messages because splitting is based on LogString character count instead of encoded byte length #680

Runtime Reproduction

Root Cause

Why This Matters

Additional Investigation Notes

Suggested Direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SyslogAppender may exceed MaxMessageLength for multibyte UTF-8 messages because splitting is based on LogString character count instead of encoded byte length #680

Description

Runtime Reproduction

Root Cause

Why This Matters

Additional Investigation Notes

Suggested Direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions