Computer character sets: the 8-bit mess

Written by

in

ASCII is a 7-bit code and defines characters 0.. 127. Computers have 8-bit bytes and each byte can store values 0..256. When using these bytes to store ASCII characters, the values 128..256 are never used. Would it be a good idea to extend ASCII with 128 more values that could store the characters we need that are not in ASCII? Maybe that could be a good idea. The problem was: too many different people got the same idea and each of them defined different additional characters and added them in a different order. The really great idea only came in 1992 in the form of UTF-8.

There were separate encodings from different vendors and for each vendor there would be many different code pages to cover different sets of languages. The only good thing: the ASCII part of the set would be left alone, so at least all character sets supported full ASCII.

Different Vendors

Let’s describe some vendors who extended ASCII to an 8-bit code:

  • In 1981, IBM introduced the IBM-PC with a new 8-bit character set. That characters set was known as Code Page 437 (CP437). It had a limited selection of international Latin characters (mostly lowercase) and both the selection of which characters to include and the order in which they appeared, seem to be rather random. No plan, but a crazy brainstorm session. MSX (1884) and Atari-ST (1985) based their character sets on this set, but they both made extensive changes, adding more international characters and leaving out box drawing characters.
  • Apple introduce the Macintosh in 1984, complete with its own extended 8-bit character set. The selection of characters seems far more logical than IBM’s.
  • Digital added a Multinational Character Set (MCS) to their vt220 terminal in 1983. This would later be developed into ISO-8859-1.
  • There were others: computers like the Acorn BBC Master, Grundy NewBrain and Sinclair QL each had their own proprietary 8-bit character set, never seeing any usage on other platforms. Rightfully forgotten.

All except Digital filled in all code points from 128 to 255. The DEC vt220 terminal used the range 128..159 for additional control characters and only add printable characters for 160..255.

Code Pages

A single 8-bit character set usually does not cover all languages for all markets where the computer was sold. Computers sold in Turkey needed Turkish letters such as the dotless i and s-cedilla. Computers sold in Greece needed support for the Greek alphabet. Therefore all important vendors had not one, but many different 8-bit character sets, called code pages. Different code pages were for different countries.

Some countries, like Russia had their own 8-bit extended character sets. Russia had the KOI8 family of character sets.

ISO-8859

In 1987, ISO introduced a family of 8-bit character codes, called ISO0-8859. All of these reserve the range 128..159 for C1 control characters. https://en.wikipedia.org/wiki/ISO/IEC_8859

Eventually, there would be 15 character sets, numbered from 1 to 16 (12 was omitted), 10 of which would be for the Latin script. Five others would be for Cyrillic, Arabic, Greek, Hebrew and Thai. ISO-8859-16 would be introduced in 2001, when Unicode was already well established.

The first set in the family, ISO-8859-1, would be based on the vt220 Multi National Character Set. It would include support for Finnish, Swedish, Norwegian, Danish, Icelandic, Dutch, English, French, German, Spanish, Portuguese and Italian, all relevant Western European languages. Even though the old vt220 set included the French letters œ, Œ and Ÿ, they were excluded from ISO8859-1, being replaced by ×, ÷ and Icelandic letters. Later, ISO-8859 would add these letters back in, along with š and ž (for Finnish) and the Euro sign, all at the cost of some rarely used symbols.

ISO-8859 would be included whole in Unicode and it was also the base of Windows Code Page 1252, which is a superset of ISO-8859-1, but replacing some of the C1 control characters with additional symbols. Most other Windows code pages are not compatible supersets of their corresponding 8859 sets. So apart from the ISO-8859 character sets, we also have to deal with a bunch of slightly incompatible Windows code pages.

All Latin sets in ISO-8859 support German (including the ß) but only half of them would support French. Polish is in three of these sets, but with some characters in different positions in each of them. Czech is only in one of these sets. There’s no single set with both French and Czech. Ten different character sets for Latin was just too many.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *