How Unicode Works: What every developer needs to know about strings and 🦄

#

Slack screenshot - Maybe U need code but I’ve got plenty of code #dadjokes

Waaay back in 2003 Joel Spolsky wrote about Unicode and why every developer should understand what it is and why it’s important. I remember reading that article (and have since forgotten most of it) but it really struck me how important character sets and Unicode are. I figured it was about time I revisited our old friend Unicode and why it’s important in today’s emoji filled world 🦄💩. You might not realize it, but you’re already working with Unicode if you’re working with WordPress! So let’s see what it is and why it matters to developers.

ASCII encoding

Before we get into Unicode we need to do a little bit of history (my 4 year history degree finally getting use 🎉). Back in the day when Unix was getting invented, characters were represented with 8 bits (1 byte) of memory. In those days memory usage was a big deal since, you know, computers had so little. David C. Zentgraf has a great example about how this works on his blog:

01100010 01101001 01110100 01110011
b        i        t        s

All those 1s and 0s are binary, and they represent each character beneath. But writing in binary is hard work, and uh, would suck if you had to do it all the time. ASCII was created to help with this and is essentially a lookup table of bytes to characters.

ASCII Table

The ASCII table has 128 standard characters (both upper and lower case a-z and 0-9). There are actually only 95 alphanumeric characters, which sounds fine if you speak English. In actual fact each character only requires 7 bits, so there’s a whole bit left over! This led to the creation of the extended ASCII table which has 128 more fancy things like Ç and Æ as well as other characters. Unfortunately that’s not enough to cover the wide variety of characters used in languages throughout the world, so people created their own encodings. Awesome.

Character encodings broke the internet

Alright, so now we kind of know what’s up with all those bajillion character encodings you may have encountered like Microsoft’s Windows-1252 and Big5 – people needed to represent their own language and unique set of characters. And this mostly worked OK when documents weren’t shared with other computers . You know, the time before the internet.

young Bill Gates

The internet broke all of this because people started sending documents encoded in their native encoding to other people. Sometimes people weren’t using the same encoding and they’d see something like this as an email subject line:

�����[ Éf����Õì ÔǵÇ���¢!!

To further complicate things, some encodings would use 16 bits rather than 8. This would make for massive lookup tables. Far larger than for ASCII!

Someone finally got fed up with seeing gobbledygook in their documents and decided to create Unicode to unify all these encodings.

Enter Unicode

Unicode is really just another type of character encoding, it’s still a lookup of bits -> characters. However, Unicode encoding schemes like UTF-8 are more efficient in how they use their bits. With UTF-8, if a character can be represented with 1 byte that’s all it will use. If a character needs 4 bytes it’ll get 4 bytes. This is called a variable length encoding and it’s more efficient memory wise. Unicode encodings are simply how a piece of software implements the Unicode standard.

As Adam Hooper puts it:

UTF-8 saves space. In UTF-8, common characters like “C” take 8 bits, while rare characters like “💩” take 32 bits. Other characters take 16 or 24 bits. A blog post like this one takes about four times less space in UTF-8 than it would in UTF-32. So it loads four times faster.

UTF-8 is by far the most common encoding you’ll come across on the web. The great thing about UTF-8 is that the first 128 code points are exactly the same as ASCII. So UTF-8, if you’re an English speaker, is exactly the same as ASCII.

This is all important in our day and age because of the emoji 🚀. Emoji after all, are just characters – like the letter ‘a’ or ‘Z’. Because Unicode is flexible enough to use whichever amount of bits it needs, emoji can be added to Unicode character sets quite easily.

The Unicode standard now encompasses 137,439 characters as of version 11.0. It includes all of your favorite emoji, as well as characters used in almost every language on the planet.

Code Points

Unicode characters can be referenced by their code point. This Stack Overflow article does a good job of explaining what a code point is:

A code point is the atomic unit (irreducible unit) of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.

The current Unicode standard defines 1,114,112 code points – that’s a lot of 🍝. Unicode further divides up all those code points into 17 planes or groupings. We don’t need to know all about the internal workings on Unicode but it’s helpful to understand where it’s coming from.

To access code points we use the following syntax:

U+ (hexadecimal number of code point)

The hexadecimal numbering system is used as it’s a shorter way to reference large numbers. That’s why you’ll see things like U+1F4A9 or \u1F4A9 in emoji tables.

For example:

Character Hex Binary
💩 U+1F4A9 0001 1111 0100 1010 1001

To make things more complex, some characters can be expressed as a combination of code points.

é can be represented in Unicode as U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE)

We’ll learn more about this when we look at JavaScript’s implementation of Unicode, but complex or not, Unicode is the international standard for character encodings and it’s not all 🌹☀️.

Problems with Unicode

Different programming languages, operating systems, even iOS Apps handle Unicode differently, and there’s still a lot of confusion out there about what Unicode actually is. Let’s look at some examples that are close to home.

PHP

We’ll start with the ElePHPant in the room, PHP. PHP’s claims on its strings documentation page that it only supports a 256-character set. What this really means is that PHP assumes that 1 byte = 1 character for strings. This is actually something I came across working on the batching feature for the Theme & Plugin Files Addon in WP Migrate DB Pro.

If you want to get the size, in bytes, of a string, just count the characters! strlen() for a string in PHP is essentially how many bytes it takes up. Cool.

Buuuut, what about a string that contains this bad boy – 🔥. How many bytes would that be? One?

echo strlen( '🔥' );
// Outputs: 4

Go home PHP you’re drunk.

This is where PHP’s multibyte string functions come in. To get the legit string length of 🔥, in characters, you’d need to use mb_strlen().

echo mb_strlen( '🔥' );
// Outputs: 1

Cool! So that works. But what was the length of 4 about with the standard strlen()? As I mentioned earlier, PHP thinks 1 character = 1 byte, so internally it checks the memory size of a string. The 🔥 emoji actually takes up 4 bytes of memory!

4kb memory

What a memory hog 🐷.

In reality though, PHP only messes up Unicode if you’re manipulating strings. If you’re simply getting or outputting strings, PHP doesn’t care and will work just fine. But if you trying to get substrings or lengths of strings, stick with the multibyte functions.

JavaScript

JavaScript engines use UTF-16 internally, another variable length encoding. If you remember UTF-16 is a lot like UTF-8 except that the lowest amount of bits used is 16. Simple characters like ‘C’ use 16 bits, while fancy characters use 32 bits.

In JavaScript, strings are treated as UTF-16 code units, all that means really is that you might have to use two code points to reference a character.

let poop = '💩';
console.log( poop.length );
// Outputs 2

Similar to PHP’s strlen(), JavaScript’s length property will return the code unit length of a character. Because JavaScript uses the UTF-16 encoding type, complex characters like emoji will be a length of 2.

let poop = '\uD83D\uDCA9'
console.log( poop ) // 💩
console.log( poop.length ) // 2

You can use this handy tool to convert emoji or other characters to their hex escaped values.

When using functions like String.prototype.slice() or String.prototype.substring() it’s important to keep this in mind. Basically, in JavaScript think of strings as code units and you’ll be ok. As of ES2015 String.prototype.normalize is available. It allows you to convert strings to a standardized Unicode format. This is helpful if you have strings that could have been encoded incorrectly or you are comparing string lengths.

The topic of JavaScript, Unicode and code units is a large one, but I recommend reading through Dimitri’s post if you’d like to learn more. It’s an eye opener.

MySQL

MySQL’s issues with Unicode are where I first encountered character encoding compatibility issues. It’s also when I first started losing my hair 😢.

Like PHP, MySQL doesn’t fully support UTF-8, or really, Unicode at all. MySQL’s utf8 encoding isn’t really UTF-8 at all. The utf8 encoding that we were all using back in the day, only uses 3 bytes. Why? Well who on earth would need more than 3 bytes, 24 WHOLE BITS, to represent a single character! The why is a long story (I suggest you read Adam’s article if you’d like to hear it) but a fix was rolled out in 2010 that brought us the utf8mb4 encoding.

The utf8mb4 character set has been added. This is similar to utf8, but its encoding allows up to four bytes per character to enable support for supplementary characters.

Nice. So if you’re using the utf8 character set you won’t see a fancy 😬.

The WordPress core peeps realized this in 2015 and made utf8mb4 the default for new installs, as well as upgraded tables to use the new encoding if possible.

As someone who works on a database migration plugin, this one has bitten me more than once and we often have customers email us with issues migrating from a utf8mb4 encoded database to a utf8 encoded database.

Thanks MySQL!

We have a workaround , but your best bet is to make sure both sides involved in a migration use the utf8mb4 character set.

TL;DR

Unicode is a common, massive character set for all the world’s languages, glyphs and emoji. The UTF encoding family is how computers know which sequence of bits should be represented as which character. However, every programming language, app and OS implements and supports Unicode differently (if at all). This is where the developer’s job gets fun 😬.

Protip: Know what encoding your strings are using, and you know, use the same encoding everywhere!

Have you had issues with Unicode in your work? Anything I’ve missed in the above? Let us know in the comments.

About the Author

Peter Tasker

Peter is a PHP and JavaScript developer from Ottawa, Ontario, Canada. In a previous life he worked for marketing and public relations agencies. Love's WordPress, dislikes FTP.