Some scripts contain thousands of different characters, so an 8-bit character code for any of these would be futile to begin with. While different vendors in different countries each tried to fit their ideal character set into an 8-bit code, China and Japan already knew this would not work for them.
Japan developed a multi-byte code Shift-JIS, in which ASCII characters would take 1 byte and Kanji (or Katanana or Hirigana) would take two bytes each. Chinese speaking countries had the Big5 character code, using a similar concept.
Unicode was already envisioned in 1987, the year the first official 8-bit character code for Western European languages (ISO-8859-1) became an ISO standard.
The 16-bit Code
Unicode was to become a 16-bit code, rather than an 8-bit code. It should include all characters from all scripts used in modern languages. No Egyptian hieroglyphics, but Chinese, Japanese and Korean had to be included. With a 16-bit code you have up to 65536 different characters.
How many Chinese characters are there? A person is considered literate if he or she knows around 3,000 characters. But a good dictionary contains 20,000 different characters. But if you count them all, including characters only appearing in ancient texts and some proper names, you get close to 100,000.
Chinese, Japanese Kanji and the corresponding Korean characters are basically the same character set. much like our Dutch letter A is the same letter as the French letter A.
The “Han” characters from Chinese, Japanese and Korean should be unified. A set of around 20,000 was considered sufficient.
Hangul is the other script used in Korea. In Korean there are about 11,000 possible Hangul syllables, each consisting of three “letters”, stacked on top of each other. Korean Hangul is typically encoded using one code point per syllable. You can compare that to the accented letters in Latin. The é is composed of the letter e with an acute accent ´ stacked on top of it. Yet the letter é is normally encoded using a single code point instead of a separate letter e and a combining acute accent.
Combined CJK characters plus Hangul syllables would already consume half of the available code points. But all other scripts would have only tens of characters each, a few hundred at worst, when there were so many different accented characters. A quick back-of-the-envelope calculation showed that it would work out. Latin, Greek Cyrillic, Arabic, Hebrew, Thai, Devanagari, Georgian, Armenian, all of them combined would occupy a relatively small fraction of the code space.
The first version of Unicode was released in 1991 as a 16-bit code. Each character now occupied two bytes instead of one, but once you got over that, it would be simple. It took a few years for CJK characters and Korean Hangul syllables to be included.
Unicode always had to have some unused characters available, so new symbols (for example new currency symbols like the Euro sign) could always be added.
16-bit Unicode can be stored either in little-endian or in big-endian format. A Unicode file typically contains the value 0xFEFF, which is the byte order mark. The swapped value 0xFFFE is not a valid character and the system opening the file can see which byte order the file should be in.
It took nearly two decades for Unicode to be nearly universally adopted. In Western European countries, 8-bit codes were sufficient and in the USA, ASCII was sufficient. Why use twice the memory and disk space if ASCII works just fine?
UTF-8
One clever trick however, made the transition to Unicode in the USA a no-brainer. It was invented in 1992 by Ken Thompson and Rob Pike (of Unix and Plan9 fame). This is called UTF-8 (Unicode Transformation Format). Character codes between 0x00 and 0x7f are transferred as-is. So ASCII remains ASCII. Character codes in the range 0x80..0x7ff would be encoded by two bytes: one in the range (0xC2..0xDF), followed by a byte in the range 0x80..0xBF. Character codes in the range 0x800..0xFFFF would be encode by 3 bytes: on in the range 0xE0..0xEF, followed by two bytes in the range 0x80..0xBF. This could be extended to longer encodings for higher character codes.
UTF-8 has the following desirable properties:
- ASCII files are valid UTF-8.
- If a program is 8-bit clean (it ignores and.or transfers unchanged any bytes in the range 0x80..0xFF), it can cope with UTF-8 files. For example a compiler that knows only ASCII, can pass bytes in string literals unchanged and ignore comments, so it can handle UTF-8 source files to some extent.
- UTF-8 is self-synchronising. Any byte value in the range 0x80..0xBF cannot be the start of a character. Any byte value in the range 0xC0..0xFF is the start of a multi-byte character and it tells you how many trailing bytes are going to follow.
Efficiency depends on the script and language used:
- For ASCII files it is one byte per character, just as efficient as ASCII, twice as efficient as 16-bit Unicode.
- For Latin scripts it is almost as efficient as 8-bit character codes. The occasional accented letter takes two bytes instead of one, but in most languages these form only a small fraction of the characters.
- For Greek, Cyrillic, Arabic and Hebrew, all letters will take 2 bytes each, but digits, spaces, commas etc. will only take one byte each, so we are still better off than with 16-bit Unicode (but worse than with an 8-bit code).
- Even for languages like Chinese, we are not that worse off compared to 16-bit Unicode. Text files typically contain a large fraction of spaces, digits and punctuation, each of which takes only one byte.
Increased Processing Requirements
Even though UTF-8 made the transition from ASCII to Unicode easy, Unicode still requires considerable processing power compared to 8-bit character sets. This used to be a problem in the 1990s.
While an ASCII-only font takes a few kilobytes, a full Unicode font easily takes megabytes. Chinese characters require more pixels to be recognisable, compared to ASCII characters. Plus we have way more characters.
Arabic requires complex rendering algorithms to display correctly, because it’s a cursive script with no.printed form with separate letters. Both Arabic and Hebrew require right-to-left printing. Bidirectional text support is one of the trickiest and most counterintuitive aspects of text processing. A text in Hebrew is to be presented from right to left, but a multi-digit number inside such a text is to be presented from left to right again. But when there is a list of such numbers separated by commas, the numbers have to be presented with the first number rightmost, but the digits inside each number are from left to right. We can have English quoted words inside a Hebrew quote, inside an overall English text.
Converting a string to uppercase is trivial in ASCII. In Unicode we have quite a bit more characters to put in a lookup table and the code points of a lowercase letter and corresponding uppercase letter have no fixed relationship.
Sorting requires quite a bit more processing, You have to run a language-dependent collation algorithm on each string. Depending on the language, the letter Ö might sort after Z (Swedish), equivalent to OE (Some German conventions) or equivalent to O where the presence of an accent is only a second-level collation criterion (Dutch).
Some systems only handle a subset of Unicode well, for example only Latin, Greek and Cyrillic.
The 20.09-bit Code.
In 1996 it turned out that 16 bits was not really enough for Unicode. For one thing, all Chinese characters (around 100,000) had to go in. Further we wanted to include historic scripts like Egyptian hieroglyphics and cuneiform after all. Emoji weren’t even a thing at the time.
After playing with the idea to extend Unicode to 32, 31 or 30 bits, they finally made it a 20.09-bit code. To be exact, Unicode supports up to 17×65536 code points. This is the number supported by UTF-16. UTF-8 now requires up to 4 bytes to represent each Unicode character.
UTF-16 works as follows:
- Code points in the range 0xD800..0xDFFF are unassigned.
- Each character in the range 0x0000..0xFFFF represents itself as a single 16-bit value. This range is called the Basic Multilingual Plane. It excludes the range 0xD800..0xDFFF as these code points are unassigned.
- Each character in the range 0x10000…0x10FFFF is represented by two 16-bit values: one in the range 0xD800..0xDBFF and one in the range 0xDC00..0xDFFF. These two 16-bit values are called surrogate values and they can never appear on their own, was these code points are unassigned.
This way, the vast majority of characters in daily usage are in the Basic Multilingual Plane (0x0000..0xFFFF) and can be represented by a single 16-bit value. Programs that were designed to handle 16-bit Unicode, would continue to do so for these characters.
The more exotic characters are now represented by two 16-bit values. They take 4 bytes each, both in UTF-8 and in UTF-16.