Escaping (escape, escapeURIComponent) and unescaping (unescape, unescapeURIComponent) work by interpreting the source String as binary and forcing encoding on the output string. However, this completely breaks when encoding isn't single-byte-long in ASCII range, such as UTF-16. All of this applies to all variants of UTF-16 and UTF-32, as far as I'm aware.
For example, escaping technically correctly percent-encodes each octet, but the characters are not encoded in UTF-16, instead the encoding is forced on raw ASCII values:
"\uFEFF".encode(Encoding::UTF_16LE).bytes
=> [255, 254]
CGI.escape_uri_component("\uFEFF".encode(Encoding::UTF_16LE)).bytes
=> [37, 70, 70, 37, 70, 69] # This is %FF%FE in ASCII
CGI.escape_uri_component("\uFEFF".encode(Encoding::UTF_16LE))
=> "\u4625\u2546\u4546" # But this is just 3 unrelated characters!
CGI.escape_uri_component("\uFEFF".encode(Encoding::UTF_16LE)).encoding
=> #<Encoding:UTF-16LE>
On the other hand, unescaping tries to interpret sequential triplets of bytes as a percent-encoded octet, but with the characters in the string all being multibyte, there will always be extra bytes in between, making unescaping completely impossible:
"%FE%FF".encode(Encoding::UTF_16LE).bytes
=> [37, 0, 70, 0, 69, 0, 37, 0, 70, 0, 70, 0]
CGI.unescape_uri_component("%FE%FF".encode(Encoding::UTF_16LE))
=> "%\u0000F\u0000E\u0000%\u0000F\u0000F\u0000" # This is in UTF-8, hence the extra NULs
CGI.unescape_uri_component("%FE%FF".encode(Encoding::UTF_16LE), Encoding::UTF_16LE)
=> "%FE%FF" # This is UTF-16, but not very unescaped
But it may work on accident instead:
what = "┵〥㔷┴䔥㐵┴㐡".encode(Encoding::UTF_16BE)
=> "\u2535\u3025\u3537\u2534\u4525\u3435\u2534\u3421"
CGI.unescapeURIComponent(what)
=> "PWNED!"
Round-tripping looks to be working, but that's mainly by virtue of the methods being inverses of each other. In reality, "escaped" string is definitely not valid:
CGI.unescape_uri_component(CGI.escapeURIComponent("A %$\u1234".encode(Encoding::UTF_16LE)), Encoding::UTF_16LE)
=> "A %$\u1234"
CGI.escapeURIComponent(" %$\u1234".encode(Encoding::UTF_16LE))
=> "\u3225\u2530\u3030\u3225\u2535\u3030\u3225\u2534\u3030\u2534\u3231"
CGI.escapeURIComponent(" %$\u1234".encode(Encoding::UTF_16LE)).encode(Encoding::UTF_8)
=> "㈥┰〰㈥┵〰㈥┴〰┴㈱"
RFC 3986 specifies that data needs to be encoded as a sequence of characters, which in turn can use whatever encoding is suitable. So this method of escaping/unescaping is definitely wrong.
I believe that this issue completely breaks interoperabilty with wchar_t APIs.
[un]escapeHTML seem to use a very different approach, seemingly not being susceptible to this:
CGI.escapeHTML("<html>".encode(Encoding::UTF_16LE))
=> "<html>"
CGI.unescapeHTML(CGI.escapeHTML("<html>".encode(Encoding::UTF_16LE)))
=> "<html>"
CGI.unescapeHTML(CGI.escapeHTML("<html>".encode(Encoding::UTF_16LE))).encoding
=> #<Encoding:UTF-16LE>
[un]escapeElement, on the other hand, try to use UTF-8 strings inside, not working with wide encoding at all:
CGI.escapeElement("<html><body>".encode(Encoding::UTF_16LE), "body".encode(Encoding::UTF_16LE))
# <...>/lib/ruby/3.3.0/cgi/util.rb:187:in `escapeElement': incompatible character encodings: UTF-8 and UTF-16LE (Encoding::CompatibilityError)
CGI.unescapeElement("<html><body>".encode(Encoding::UTF_16LE), "body".encode(Encoding::UTF_16LE))
# <...>/lib/ruby/3.3.0/cgi/util.rb:207:in `unescapeElement': incompatible character encodings: UTF-8 and UTF-16LE (Encoding::CompatibilityError)
(This session was on Ruby 3.3, but the same happens in 3.4 and I believe that there is no difference in the extracted gem either.)
Escaping (
escape,escapeURIComponent) and unescaping (unescape,unescapeURIComponent) work by interpreting the source String as binary and forcing encoding on the output string. However, this completely breaks when encoding isn't single-byte-long in ASCII range, such as UTF-16. All of this applies to all variants of UTF-16 and UTF-32, as far as I'm aware.For example, escaping technically correctly percent-encodes each octet, but the characters are not encoded in UTF-16, instead the encoding is forced on raw ASCII values:
On the other hand, unescaping tries to interpret sequential triplets of bytes as a percent-encoded octet, but with the characters in the string all being multibyte, there will always be extra bytes in between, making unescaping completely impossible:
But it may work on accident instead:
Round-tripping looks to be working, but that's mainly by virtue of the methods being inverses of each other. In reality, "escaped" string is definitely not valid:
RFC 3986 specifies that data needs to be encoded as a sequence of characters, which in turn can use whatever encoding is suitable. So this method of escaping/unescaping is definitely wrong.
I believe that this issue completely breaks interoperabilty with
wchar_tAPIs.[un]escapeHTMLseem to use a very different approach, seemingly not being susceptible to this:[un]escapeElement, on the other hand, try to use UTF-8 strings inside, not working with wide encoding at all:(This session was on Ruby 3.3, but the same happens in 3.4 and I believe that there is no difference in the extracted gem either.)