How BYOND used to handle characters
In the old days, all characters in BYOND were 8-bit and were defined by whatever locale your system used. This is often called "ANSI" encoding. If your characters used only code values from 1 to 127, then it could be called ASCII which is only 7-bit. ASCII is pretty much universal whereas ANSI used your locale to determine the value of characters 128 through 255.One happy consequence of this way of handling characters was that one character was basically always one byte, which made text very easy to work with in BYOND's built-in routines like copytext(), findtext(), etc.
UTF-8 encoding
Unicode characters can have code values anywhere from 1 to 0x10FFFF, which is a lot. Obviously you can't fit all that in one byte, so BYOND uses a form of encoding internally called UTF-8. Any characters beyond ASCII (1 to 127) are turned into multi-byte sequences like so:Characters 1-0x7F: 1 byte
0x80-0x7FF: 2 bytes
0x800-0xFFFF: 3 bytes
0x10000-0x10FFFF: 4 bytes
All of the text procs in BYOND still use byte positions when talking about the position of a character, rather than counting characters, so the 2nd character may not be at position 2 like it was in older BYOND; it may be anywhere from 2 to 5. The reason the text routines still use bytes is because using a byte position is much faster; having to count characters would slow everything down considerably.
What this means for your code
In places where you use findtext(), length(), and copytext() together, it's very unlikely there will be an issue at all. For instance, this code will be the same as it ever was:proc/SnipWordOnce(text, word)
. = text
var/i = findtext(text, word)
if(i)
. = copytext(text, 1, i) + copytext(text, i + length(word))
Because findtext() produces a byte index and length() returns a length in bytes, you'll have no trouble using a byte index in copytext() no matter what the actual text is.
There are however some situations that may not be as easy, like if you're trying to count one character at a time:
// This works fine in BYOND 512 but breaks for non-ASCII in 513+
proc/AnimateMaptext(object, text)
var/i, len=length(text)
for(i=2, i<=len, ++i)
object.maptext = copytext(text, 1, i)
sleep(1)
To handle these special cases, BYOND has introduced _char versions of most of the text procs, like findtext_char() and copytext_char(). Here is one way you could, but shouldn't, update the previous code:
// This works in 513+ but performance sucks for long strings
proc/AnimateMaptext(object, text)
var/i, len=length_char(text)
for(i=2, i<=len, ++i)
object.maptext = copytext_char(text, 1, i)
sleep(1)
Counting characters is not necessarily a fast process, so for long text strings this would be cumbersome. Here's a better way:
// This is better (and will also work in 512!)
proc/AnimateMaptext(object, text)
var/i, len=length(text)
var/ch
for(i=1, i<=len,)
// the read-only [] operator will grab the whole character at byte i
i += length(text[i])
object.maptext = copytext(text, 1, i)
sleep(1)
Quick tips
- Never assume one byte is one character anymore.
- Prefer the non-char versions of any text handling procs over the _char versions whenever possible.
- If you have to use something like length_char(text)-offset, use -offset alone instead, since most text procs allow negative indexes; this will be much faster.
- In your HTML interfaces for the browser element, add <meta charset="UTF-8"> so you get the right encoding.
Locale issues
513 and older versions will talk to each other as best they're able, but ultimately the goal is for everyone to upgrade. Therefore mixing 512 servers with 513 clients or vice-versa could introduce some problems. Chances are you'll want to upgrade both at the same time, so if your server now uses 513 you should require users to update as well.On older versions, because they're limited to 8-bit characters they're bound to whatever locale they're using. However, older versions do not send their locale info to 513 because, well, they weren't really prepared to do so. As a result 513 will up-convert based on the locale it's running on, not the sender's locale. In practice that shouldn't be a huge problem; most Russian-dedicated SS13 servers for instance had all their players using a Russian locale.
Similarly, Dream Maker will up-convert old code files from ANSI to UTF-8 using its current locale, and will compile any converted strings as UTF-8.
Other notes
Strings in the DM language can now use three new escape sequences if you want to use a special character:\xNN
\uNNNN
\UNNNNNN
The Ns represent hexadecimal digits, so \x can create characters from 1 to 255; \u can do up to 0xFFFF; and \U can do any character supported by Unicode. You must use the exact right number of digits with the escape sequence, so you may need to pad with zeroes. For instance the bacon emoji 0x1F953 should be written as \U01F953.