Skip to content

fix: ConvertFromUnicodeJava - Korean unicode escape not decoded, hardcoded range limit #73

@sihyunjojo

Description

@sihyunjojo

Problem

ConvertFromUnicodeJava does not convert Korean Unicode escape sequences.
Input like \uC548\uB155 is passed through as-is instead of being decoded to 안녕.

This happens because the function only handles the hardcoded range \u00A1 through \u00FF.
Korean characters fall in the \uAC00\uD7A3 range, which is never reached.

On top of that, characters even within the supported range were rendered as ? due to encoding corruption in the source file.

A secondary issue exists in the RegEx-based variant: Replace() rescans the entire string on every iteration, so when the same pattern appears multiple times, it gets replaced on the first call and subsequent iterations call Replace() against a pattern that no longer exists.

Fix

Replaced the hardcoded lookup table with a RegEx pattern \u([0-9A-Fa-f]{4}) that matches any four-digit hex escape.
Instead of calling Replace() per match, the function now walks through matches by position using FirstIndex, appending unmatched segments as-is and converting each matched escape with ChrW().

This covers the full Unicode range \u0000\uFFFF in a single pass.

Function ConvertFromUnicodeJava(ByVal TextStr As String) As String

    Dim regEx      As Object
    Dim matches    As Object
    Dim match      As Object
    Dim result     As String
    Dim cursor     As Long
    Dim matchStart As Long

    Set regEx = CreateObject("VBScript.RegExp")
    With regEx
        .Global  = True
        .Pattern = "\\u([0-9A-Fa-f]{4})"
    End With

    Set matches = regEx.Execute(TextStr)

    If matches.Count = 0 Then
        ConvertFromUnicodeJava = TextStr
        Exit Function
    End If

    result = ""
    cursor = 1

    For Each match In matches
        matchStart = match.FirstIndex + 1
        If matchStart > cursor Then
            result = result & Mid(TextStr, cursor, matchStart - cursor)
        End If
        result = result & ChrW(CLng("&H" & match.SubMatches(0)))
        cursor = matchStart + match.Length
    Next match

    If cursor <= Len(TextStr) Then
        result = result & Mid(TextStr, cursor)
    End If

    ConvertFromUnicodeJava = result

    Set matches = Nothing
    Set regEx   = Nothing

End Function

Verified behavior

Input Before After
\uC548\uB155 \uC548\uB155 (not converted) 안녕
Fran\u00E7ois Fran?ois François
Copyright \u00A9 Copyright ? Copyright ©
\u00A9\u00A9\u00A9 correct output, 3× Replace calls correct output, single pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions