Updating for Unicode in BYOND 513

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

ID:2520672

Nov 4 2019, 11:09 am

Lummox JR

If your game uses accented or non-English characters at all, you definitely need to update it for Unicode support in BYOND 513. Here are the basics:

How BYOND used to handle characters

In the old days, all characters in BYOND were 8-bit and were defined by whatever locale your system used. This is often called "ANSI" encoding. If your characters used only code values from 1 to 127, then it could be called ASCII which is only 7-bit. ASCII is pretty much universal whereas ANSI used your locale to determine the value of characters 128 through 255.

One happy consequence of this way of handling characters was that one character was basically always one byte, which made text very easy to work with in BYOND's built-in routines like copytext(), findtext(), etc.

UTF-8 encoding

Unicode characters can have code values anywhere from 1 to 0x10FFFF, which is a lot. Obviously you can't fit all that in one byte, so BYOND uses a form of encoding internally called UTF-8. Any characters beyond ASCII (1 to 127) are turned into multi-byte sequences like so:

Characters 1-0x7F: 1 byte
0x80-0x7FF: 2 bytes
0x800-0xFFFF: 3 bytes
0x10000-0x10FFFF: 4 bytes

All of the text procs in BYOND still use byte positions when talking about the position of a character, rather than counting characters, so the 2nd character may not be at position 2 like it was in older BYOND; it may be anywhere from 2 to 5. The reason the text routines still use bytes is because using a byte position is much faster; having to count characters would slow everything down considerably.

What this means for your code

In places where you use findtext(), length(), and copytext() together, it's very unlikely there will be an issue at all. For instance, this code will be the same as it ever was:

proc/SnipWordOnce(text, word)
    . = text
    var/i = findtext(text, word)
    if(i)
        . = copytext(text, 1, i) + copytext(text, i + length(word))

Because findtext() produces a byte index and length() returns a length in bytes, you'll have no trouble using a byte index in copytext() no matter what the actual text is.

There are however some situations that may not be as easy, like if you're trying to count one character at a time:

// This works fine in BYOND 512 but breaks for non-ASCII in 513+
proc/AnimateMaptext(object, text)
    var/i, len=length(text)
    for(i=2, i<=len, ++i)
        object.maptext = copytext(text, 1, i)
        sleep(1)

To handle these special cases, BYOND has introduced _char versions of most of the text procs, like findtext_char() and copytext_char(). Here is one way you could, but shouldn't, update the previous code:

// This works in 513+ but performance sucks for long strings
proc/AnimateMaptext(object, text)
    var/i, len=length_char(text)
    for(i=2, i<=len, ++i)
        object.maptext = copytext_char(text, 1, i)
        sleep(1)

Counting characters is not necessarily a fast process, so for long text strings this would be cumbersome. Here's a better way:

// This is better (and will also work in 512!)
proc/AnimateMaptext(object, text)
    var/i, len=length(text)
    var/ch
    for(i=1, i<=len,)
        // the read-only [] operator will grab the whole character at byte i
        i += length(text[i])
        object.maptext = copytext(text, 1, i)
        sleep(1)

Quick tips

Never assume one byte is one character anymore.
Prefer the non-char versions of any text handling procs over the _char versions whenever possible.
If you have to use something like length_char(text)-offset, use -offset alone instead, since most text procs allow negative indexes; this will be much faster.
In your HTML interfaces for the browser element, add <meta charset="UTF-8"> so you get the right encoding.

Locale issues

513 and older versions will talk to each other as best they're able, but ultimately the goal is for everyone to upgrade. Therefore mixing 512 servers with 513 clients or vice-versa could introduce some problems. Chances are you'll want to upgrade both at the same time, so if your server now uses 513 you should require users to update as well.

On older versions, because they're limited to 8-bit characters they're bound to whatever locale they're using. However, older versions do not send their locale info to 513 because, well, they weren't really prepared to do so. As a result 513 will up-convert based on the locale it's running on, not the sender's locale. In practice that shouldn't be a huge problem; most Russian-dedicated SS13 servers for instance had all their players using a Russian locale.

Similarly, Dream Maker will up-convert old code files from ANSI to UTF-8 using its current locale, and will compile any converted strings as UTF-8.

Other notes

Strings in the DM language can now use three new escape sequences if you want to use a special character:

\xNN
\uNNNN
\UNNNNNN

The Ns represent hexadecimal digits, so \x can create characters from 1 to 255; \u can do up to 0xFFFF; and \U can do any character supported by Unicode. You must use the exact right number of digits with the escape sequence, so you may need to pad with zeroes. For instance the bacon emoji 0x1F953 should be written as \U01F953.

Nov 6 2019, 8:06 am
Somepotato	How are you handling UTF8 if the procs are supposedly really that slow? E.g. what's the implementation of the utf8 length proc?

Nov 6 2019, 9:35 am (Edited on Nov 6 2019, 10:13 am)

Lummox JR

The UTF-8 length proc literally just counts bytes that aren't in the range 0x80 to 0xBF. It has one form that can count to the trailing null, and another that can count up to an end pointer.

There are also "count to" procs that can count forwards or backwards by a certain number of characters (to translate from a character index to a byte index), which work the same way.

For small positive or negative index values, the _char procs won't be very different from their byte-based versions. However if you're doing intensive processing and end up in the middle of a long string, that's when performance will degrade considerably.

[edit]
It's also important to note that for procs like findtext_char(), there are potentially three conversions: the start and end positions get converted from characters to bytes, and the result gets converted from bytes to characters. The logic is optimized to avoid whatever conversions it can, like for instance if the end position is 0.

Nov 9 2019, 1:15 pm In response to Lummox JR
WolfETD	1)What was if i had use _char to list var? 2)How much _char slow byte proc? 3)Will be good if DM had proc for fast determining simple or list var.

Nov 9 2019, 3:46 pm (Edited on Nov 13 2019, 12:29 pm) In response to WolfETD
Lummox JR	WolfETD wrote: 1)What was if i had use _char to list var? I don't understand what you mean. 2)How much _char slow byte proc? How much slower the _char procs are is hard to quantify. 3)Will be good if DM had proc for fast determining simple or list var. islist() tells you if something is a list or not.

Nov 12 2019, 3:51 pm

Pokemonred200

If I may ask, why was length_char() introduced rather than repurposing lentext()? (Which now gives deprivation warnings, as per my recent compiles of the rc5 library)

I still need to work on reworking that library to be UTF-8 compatible if possible, but lentext's limited use could have rendered it a great candidate, especially since pre-513 uses of it would likely have been for character counts anyway.

Nov 12 2019, 4:14 pm
Nadrew	length_char() is more consistent with the other procs, that alone is more than enough reason for me.

Nov 13 2019, 11:07 am In response to Lummox JR
WolfETD	1) var/list/temp = list(1,2,3) length_char(temp) Why proc reacting? 4) var/temp = list() temp["1"] = "it first" temp["2"] = "it second" Why length() and length_char() was reacting on it variable list?

Nov 13 2019, 12:30 pm
Lummox JR	length_char() just shortcuts to length() when not dealing with a string.

Nov 20 2019, 11:48 am

In response to Lummox JR

Somepotato

Lummox JR wrote:

The UTF-8 length proc literally just counts bytes that aren't in the range 0x80 to 0xBF. It has one form that can count to the trailing null, and another that can count up to an end pointer.

There are also "count to" procs that can count forwards or backwards by a certain number of characters (to translate from a character index to a byte index), which work the same way.

For small positive or negative index values, the _char procs won't be very different from their byte-based versions. However if you're doing intensive processing and end up in the middle of a long string, that's when performance will degrade considerably.

[edit]
It's also important to note that for procs like findtext_char(), there are potentially three conversions: the start and end positions get converted from characters to bytes, and the result gets converted from bytes to characters. The logic is optimized to avoid whatever conversions it can, like for instance if the end position is 0.

That's not valid UTF-8; and UTF-8 doesn't use NULLs.

Nov 20 2019, 1:58 pm

In response to Somepotato

Lummox JR

You're confusing the character encoding method with the structure of the string.

Whether a string stores its length as an integer, or whether it's null-terminated, is a completely different concern from how the characters are encoded—with the obvious exception that a 0 byte, or 0 integer in the case of wide chars, is not allowed in null-terminated strings since it's used as the terminator.

The character encoding however is not about the string; it's about how each valid character as represented by a sequence of bytes.

The method used internally to count the number of characters in a UTF-8 string is of course based on both the structure of the string and the encoding of the characters in it.

Nov 20 2019, 2:05 pm

In response to Lummox JR

Somepotato

Lummox JR wrote:

You're confusing the character encoding method with the structure of the string.

Whether a string stores its length as an integer, or whether it's null-terminated, is a completely different concern from how the characters are encoded—with the obvious exception that a 0 byte, or 0 integer in the case of wide chars, is not allowed in null-terminated strings since it's used as the terminator.

The character encoding however is not about the string; it's about how each valid character as represented by a sequence of bytes.

The method used internally to count the number of characters in a UTF-8 string is of course based on both the structure of the string and the encoding of the characters in it.

I misunderstood you using the NULL as the end-of-string marker.

Consider using something like !(ch & 0x80 && !(ch & 0x40)) instead of a range check

Nov 20 2019, 2:59 pm In response to Somepotato
Lummox JR	It already is a bitwise check.

Apr 27 2020, 2:14 pm
Bumblemore	Wow all this jargon doesnt make sense to me, but Ive downloaded the 513 Beta and I'm eager to see what it looks like. I was really impressed when I downloaded Dreammaker again and found out how far the pixel editor had come.