Skip to content

fix: UTF-8 decoding corrupts multi-byte characters in streaming#10

Open
TheBlueHouse75 wants to merge 1 commit into
walter-grace:mainfrom
TheBlueHouse75:fix/utf8-streaming-decode
Open

fix: UTF-8 decoding corrupts multi-byte characters in streaming#10
TheBlueHouse75 wants to merge 1 commit into
walter-grace:mainfrom
TheBlueHouse75:fix/utf8-streaming-decode

Conversation

@TheBlueHouse75
Copy link
Copy Markdown

Summary

Fixes garbled output for any non-ASCII text streamed from the LLM (French accents, Chinese, Japanese, emoji, etc.).

The bug

In stream_llm(), the SSE reader was doing:

ch = resp.read(1)          # read 1 byte
buf += ch.decode("utf-8", errors="replace")

Multi-byte UTF-8 characters span 2–4 bytes (é = 0xC3 0xA9, = 3 bytes, emoji = 4 bytes). Decoding each byte individually fails for every byte of a multi-byte sequence, so each byte is replaced by U+FFFD. The result is unreadable output for any non-English response.

Example before fix (French)

Je peux t'aider avec une grande vari��t�� de t��ches ! Voici quelques-unes
de mes principales comp��tences :
  **R��ponses et informations** : Je peux r��pondre �� tes questions
  **R��daction et cr��ation** : Je peux r��diger des textes, emails, po��mes

The fix

Use codecs.getincrementaldecoder("utf-8") and read in larger chunks. The incremental decoder keeps state across decode() calls, so multi-byte sequences that straddle chunk boundaries are handled correctly.

decoder = codecs.getincrementaldecoder("utf-8")(errors="replace")
while True:
    ch = resp.read(4096)
    if not ch:
        buf += decoder.decode(b"", final=True)
        break
    buf += decoder.decode(ch)

Switching from 1-byte to 4 KiB reads is also noticeably cheaper on CPU.

Test plan

  • Run python3 agent.py against a French prompt — accents render correctly
  • Verified streaming still works token-by-token in the UI
  • No behavior change for pure ASCII output

The SSE reader was calling resp.read(1) and decoding each single
byte with utf-8. Multi-byte characters (é, à, 中, emoji, etc.) span
2–4 bytes, so each byte was individually replaced by U+FFFD, producing
garbled output for any non-ASCII language.

Fixed by reading 4 KiB chunks and feeding them through an incremental
UTF-8 decoder, which correctly handles multi-byte sequences that span
chunk boundaries.
mitre88 pushed a commit to mitre88/mac-code that referenced this pull request Apr 18, 2026
stream_llm() reads SSE response byte-by-byte (resp.read(1)) and decodes
each byte individually as UTF-8. This corrupts all multi-byte characters:
emojis (🍎→????), accented chars (ñ→??), CJK text, etc.

Fix: read 4096-byte chunks and decode the full chunk. Multi-byte
characters are now correctly assembled before decoding.

This is the same issue reported in PR walter-grace#10.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant