ID:2883253
 
Resolved
url_encode() and list2params() didn't correctly skip BYOND format characters.
BYOND Version:514
Operating System:n/a
Web Browser:
Applies to:Dream Daemon
Status: Resolved (515.1611)

This issue has been resolved.
Descriptive Problem Summary:
A lone surrogate is effectively a malformed UTF-16 pair. This makes topic responses using params that do not sanitize on the byond side fail or throw errors for recipients that do not themselves then strip bad uri encodings once received.

I haven't checked but I assume DD->DD topics understand these just fine, so this is probably only a trip hazard for communication with things like server panels and bots. That second part is only available because of protocol reverse engineering, but it's a major existing use case.

This is an issue for plain and json responses as well, but easiest to illustrate with params because raw UTF-8 handles these as just mangled garbage instead.


Numbered Steps to Reproduce Problem:
1. Create text with a macro with a subject that cannot or will not be immediately resolved to plain text.
2. Emit it via topic or to a function in a client browser pane.


Code Snippet (if applicable) to Reproduce Problem:
/world/Topic(query)
if (query != "uritest")
return
var/mob/mob = new
return list2params(list(
"\the [mob]",
"\the cat",
"\a [mob]",
"\a cat",
"[mob] \he",
"[mob] \He",
"[mob] \she",
"[mob] \She",
"[mob] \his",
"[mob] \His",
"[mob] \him",
"[mob] \himself",
"[mob] \herself",
"[mob] \hers",
"\improper [mob]",
"\proper [mob]",
"[7]\th",
"[5] [mob]\s",
"\roman [53]",
"\Roman [53]",
"\...",
"\<",
"\>"
))



Results:
Here, I've replaced & with \n. I haven't bothered with all possible combinations; `\the [mob]` and `\the cat` provide an example of a resolved vs unresolved macro with a subject, and the behavior applies to the others as well.
the+mob
%ff%08cat
a+mob
%ff%06cat
The+mob+it
The+mob+It
The+mob+it
The+mob+It
The+mob+its
The+mob+Its
The+mob+it
The+mob+itself
The+mob+itself
The+mob+its
%ff%16the+mob
%ff%15the+mob
7th
5+the+mobs
+liii
+LIII
%ff%12
%26lt%3b
%26gt%3b


The %ff%xx pairs are the problems. By testing they're produced when the macro cannot be resolved ("\the cat" is bad, "\the [mob]" is good), and always in the case of \improper, \proper, and \... because these only seem relevant to the output control.


Workarounds:
JS:
Use decode(topicResponse) (deprecated but probably still in common use)
Or use decodeURI(topicResponse.toWellFormed()) and then strip instances of a (common-use) symbol the db backing this forum can't store.
( https://developer.mozilla.org/en-US/docs/Web/JavaScript/ Reference/Global_Objects/String/toWellFormed )

Python:
Nothing built in, but pre-filtering for bad pairs is possible
( https://github.com/python/cpython/issues/ 72158#issuecomment-1093724712 )

Byond:
Pre-strip all instances of text macros that still exist at send time. I don't believe you can do this with regex since they're all invalid symbols, and so this actually becomes something like:
/proc/strip_macros(text)
var/static/list/macros = list("\a", "\improper", <...all the subject macros>)
for (var/macro_text in macros)
text = replacetext(text, macro_text, "")
return text

Which of course is not exactly ideal.

I've posted this FR in the past https://www.byond.com/forum/post/2777898 which would resolve the issue by making these into valid magic symbols (which comes with its own set of handling concerns, but at least doesn't crash uri decoders), but is work on lummox's end.
Is this only an issue for list2params(), and/or possibly url_encode()?

BYOND strings with format codes shouldn't get through to most outputs generally, except stuff like \... as you mentioned, but it seems the main issue is just the URI encoding.
Having done some quick checking:

Raw text is of course not transformed and is just emitted as bad bytes that produce garbage symbols for those bytes. Since DD probably doesn't know if a query has come from another DD or not, filtering for unresolved macros is probably not realistic when transforming a string to net bytes.

json_encode actually protects against this, contrary to what I thought initially, by stripping the symbols before emitting; this is good behavior since json isn't allowed to not be UTF-8.

url_encode has the same behavior as list2params.
/world/Topic(query)
if (query == "plain")
return "pat \the cat"
if (query == "enc")
return url_encode("pat \the cat")
if (query == "params")
return list2params(list("pat \the cat"))
if (query == "json")
return json_encode(list("str" = "pat \the cat"))

plain Invalid string payload.
enc pat+%ff%08cat
params pat+%ff%08cat
json {"str":"pat cat"}
Lummox JR resolved issue with message:
url_encode() and list2params() didn't correctly skip BYOND format characters.