Escaping/unescaping URIs and elements doesn't work for UTF-16/32

Escaping (`escape`, `escapeURIComponent`) and unescaping (`unescape`, `unescapeURIComponent`) work by [interpreting the source String as binary and forcing encoding on the output string](https://github.com/ruby/cgi/blob/master/lib/cgi/escape.rb#L22). However, this completely breaks when encoding isn't single-byte-long in ASCII range, such as UTF-16. All of this applies to all variants of UTF-16 and UTF-32, as far as I'm aware.

For example, escaping technically correctly percent-encodes each octet, but the characters are not encoded in UTF-16, instead the encoding is forced on raw ASCII values:
```ruby
"\uFEFF".encode(Encoding::UTF_16LE).bytes
=> [255, 254]
CGI.escape_uri_component("\uFEFF".encode(Encoding::UTF_16LE)).bytes
=> [37, 70, 70, 37, 70, 69] # This is %FF%FE in ASCII
CGI.escape_uri_component("\uFEFF".encode(Encoding::UTF_16LE))
=> "\u4625\u2546\u4546" # But this is just 3 unrelated characters!
CGI.escape_uri_component("\uFEFF".encode(Encoding::UTF_16LE)).encoding
=> #<Encoding:UTF-16LE>
```

On the other hand, unescaping tries to interpret sequential triplets of bytes as a percent-encoded octet, but with the characters in the string *all* being multibyte, there will always be extra bytes in between, making unescaping completely impossible:
```ruby
"%FE%FF".encode(Encoding::UTF_16LE).bytes
=> [37, 0, 70, 0, 69, 0, 37, 0, 70, 0, 70, 0]
CGI.unescape_uri_component("%FE%FF".encode(Encoding::UTF_16LE))
=> "%\u0000F\u0000E\u0000%\u0000F\u0000F\u0000" # This is in UTF-8, hence the extra NULs
CGI.unescape_uri_component("%FE%FF".encode(Encoding::UTF_16LE), Encoding::UTF_16LE)
=> "%FE%FF" # This is UTF-16, but not very unescaped
```

But it may work on accident instead:
```ruby
what = "┵〥㔷┴䔥㐵┴㐡".encode(Encoding::UTF_16BE)
=> "\u2535\u3025\u3537\u2534\u4525\u3435\u2534\u3421"
CGI.unescapeURIComponent(what)
=> "PWNED!"
```

Round-tripping *looks* to be working, but that's mainly by virtue of the methods being inverses of each other. In reality, "escaped" string is definitely not valid:
```ruby
CGI.unescape_uri_component(CGI.escapeURIComponent("A %$\u1234".encode(Encoding::UTF_16LE)), Encoding::UTF_16LE)
=> "A %$\u1234"
CGI.escapeURIComponent(" %$\u1234".encode(Encoding::UTF_16LE))
=> "\u3225\u2530\u3030\u3225\u2535\u3030\u3225\u2534\u3030\u2534\u3231"
CGI.escapeURIComponent(" %$\u1234".encode(Encoding::UTF_16LE)).encode(Encoding::UTF_8)
=> "㈥┰〰㈥┵〰㈥┴〰┴㈱"
```

[RFC 3986 specifies](https://www.rfc-editor.org/rfc/rfc3986#section-2) that data needs to be encoded as a sequence of characters, which in turn can use whatever encoding is suitable. So this method of escaping/unescaping is definitely wrong.

I believe that this issue completely breaks interoperabilty with `wchar_t` APIs.

---

 `[un]escapeHTML` seem to use a very different approach, seemingly not being susceptible to this:
```ruby
CGI.escapeHTML("<html>".encode(Encoding::UTF_16LE))
=> "&lt;html&gt;"
CGI.unescapeHTML(CGI.escapeHTML("<html>".encode(Encoding::UTF_16LE)))
=> "<html>"
CGI.unescapeHTML(CGI.escapeHTML("<html>".encode(Encoding::UTF_16LE))).encoding
=> #<Encoding:UTF-16LE>
```

`[un]escapeElement`, on the other hand, try to use UTF-8 strings inside, not working with wide encoding at all:
```ruby
CGI.escapeElement("<html><body>".encode(Encoding::UTF_16LE), "body".encode(Encoding::UTF_16LE))
# <...>/lib/ruby/3.3.0/cgi/util.rb:187:in `escapeElement': incompatible character encodings: UTF-8 and UTF-16LE (Encoding::CompatibilityError)
CGI.unescapeElement("<html>&lt;body&gt;".encode(Encoding::UTF_16LE), "body".encode(Encoding::UTF_16LE))
# <...>/lib/ruby/3.3.0/cgi/util.rb:207:in `unescapeElement': incompatible character encodings: UTF-8 and UTF-16LE (Encoding::CompatibilityError)
```

(This session was on Ruby 3.3, but the same happens in 3.4 and I believe that there is no difference in the extracted gem either.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escaping/unescaping URIs and elements doesn't work for UTF-16/32 #89

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Escaping/unescaping URIs and elements doesn't work for UTF-16/32 #89

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions