fix: UTF-8 decoding corrupts multi-byte characters in streaming by TheBlueHouse75 · Pull Request #10 · walter-grace/mac-code

TheBlueHouse75 · 2026-04-13T16:32:39Z

Summary

Fixes garbled output for any non-ASCII text streamed from the LLM (French accents, Chinese, Japanese, emoji, etc.).

The bug

In stream_llm(), the SSE reader was doing:

ch = resp.read(1)          # read 1 byte
buf += ch.decode("utf-8", errors="replace")

Multi-byte UTF-8 characters span 2–4 bytes (é = 0xC3 0xA9, 中 = 3 bytes, emoji = 4 bytes). Decoding each byte individually fails for every byte of a multi-byte sequence, so each byte is replaced by U+FFFD. The result is unreadable output for any non-English response.

Example before fix (French)

Je peux t'aider avec une grande vari��t�� de t��ches ! Voici quelques-unes
de mes principales comp��tences :
  **R��ponses et informations** : Je peux r��pondre �� tes questions
  **R��daction et cr��ation** : Je peux r��diger des textes, emails, po��mes

The fix

Use codecs.getincrementaldecoder("utf-8") and read in larger chunks. The incremental decoder keeps state across decode() calls, so multi-byte sequences that straddle chunk boundaries are handled correctly.

decoder = codecs.getincrementaldecoder("utf-8")(errors="replace")
while True:
    ch = resp.read(4096)
    if not ch:
        buf += decoder.decode(b"", final=True)
        break
    buf += decoder.decode(ch)

Switching from 1-byte to 4 KiB reads is also noticeably cheaper on CPU.

Test plan

Run python3 agent.py against a French prompt — accents render correctly
Verified streaming still works token-by-token in the UI
No behavior change for pure ASCII output

The SSE reader was calling resp.read(1) and decoding each single byte with utf-8. Multi-byte characters (é, à, 中, emoji, etc.) span 2–4 bytes, so each byte was individually replaced by U+FFFD, producing garbled output for any non-ASCII language. Fixed by reading 4 KiB chunks and feeding them through an incremental UTF-8 decoder, which correctly handles multi-byte sequences that span chunk boundaries.

stream_llm() reads SSE response byte-by-byte (resp.read(1)) and decodes each byte individually as UTF-8. This corrupts all multi-byte characters: emojis (🍎→????), accented chars (ñ→??), CJK text, etc. Fix: read 4096-byte chunks and decode the full chunk. Multi-byte characters are now correctly assembled before decoding. This is the same issue reported in PR walter-grace#10.

mitre88 mentioned this pull request Apr 18, 2026

fix: UTF-8 corruption in streaming — read chunks instead of byte-by-byte #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: UTF-8 decoding corrupts multi-byte characters in streaming#10

fix: UTF-8 decoding corrupts multi-byte characters in streaming#10
TheBlueHouse75 wants to merge 1 commit into
walter-grace:mainfrom
TheBlueHouse75:fix/utf8-streaming-decode

TheBlueHouse75 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheBlueHouse75 commented Apr 13, 2026

Summary

The bug

Example before fix (French)

The fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant