A lone surrogate is effectively a malformed UTF-16 pair. This makes topic responses using params that do not sanitize on the byond side fail or throw errors for recipients that do not themselves then strip bad uri encodings once received.
I haven't checked but I assume DD->DD topics understand these just fine, so this is probably only a trip hazard for communication with things like server panels and bots. That second part is only available because of protocol reverse engineering, but it's a major existing use case.
This is an issue for plain and json responses as well, but easiest to illustrate with params because raw UTF-8 handles these as just mangled garbage instead.
Numbered Steps to Reproduce Problem:
1. Create text with a macro with a subject that cannot or will not be immediately resolved to plain text.
2. Emit it via topic or to a function in a client browser pane.
Code Snippet (if applicable) to Reproduce Problem:
/world/Topic(query)
if (query != "uritest")
return
var/mob/mob = new
return list2params(list(
"\the [mob]",
"\the cat",
"\a [mob]",
"\a cat",
"[mob] \he",
"[mob] \He",
"[mob] \she",
"[mob] \She",
"[mob] \his",
"[mob] \His",
"[mob] \him",
"[mob] \himself",
"[mob] \herself",
"[mob] \hers",
"\improper [mob]",
"\proper [mob]",
"[7]\th",
"[5] [mob]\s",
"\roman [53]",
"\Roman [53]",
"\...",
"\<",
"\>"
))
Results:
Here, I've replaced & with \n. I haven't bothered with all possible combinations; `\the [mob]` and `\the cat` provide an example of a resolved vs unresolved macro with a subject, and the behavior applies to the others as well.
the+mob
%ff%08cat
a+mob
%ff%06cat
The+mob+it
The+mob+It
The+mob+it
The+mob+It
The+mob+its
The+mob+Its
The+mob+it
The+mob+itself
The+mob+itself
The+mob+its
%ff%16the+mob
%ff%15the+mob
7th
5+the+mobs
+liii
+LIII
%ff%12
%26lt%3b
%26gt%3b
The %ff%xx pairs are the problems. By testing they're produced when the macro cannot be resolved ("\the cat" is bad, "\the [mob]" is good), and always in the case of \improper, \proper, and \... because these only seem relevant to the output control.
Workarounds:
JS:
Use decode(topicResponse) (deprecated but probably still in common use)
Or use decodeURI(topicResponse.toWellFormed()) and then strip instances of a (common-use) symbol the db backing this forum can't store.
( https://developer.mozilla.org/en-US/docs/Web/JavaScript/ Reference/Global_Objects/String/toWellFormed )
Python:
Nothing built in, but pre-filtering for bad pairs is possible
( https://github.com/python/cpython/issues/ 72158#issuecomment-1093724712 )
Byond:
Pre-strip all instances of text macros that still exist at send time. I don't believe you can do this with regex since they're all invalid symbols, and so this actually becomes something like:
/proc/strip_macros(text)
var/static/list/macros = list("\a", "\improper", <...all the subject macros>)
for (var/macro_text in macros)
text = replacetext(text, macro_text, "")
return text
Which of course is not exactly ideal.
I've posted this FR in the past https://www.byond.com/forum/post/2777898 which would resolve the issue by making these into valid magic symbols (which comes with its own set of handling concerns, but at least doesn't crash uri decoders), but is work on lummox's end.
BYOND strings with format codes shouldn't get through to most outputs generally, except stuff like \... as you mentioned, but it seems the main issue is just the URI encoding.