Ç, ü, é, the result of one crazy brainstorm session became a world standard

Ever wonder where the original IBM-PC character set came from and how it was designed? This character set is known as CP-437 and there is a very informative Wikipedia article about it. https://en.wikipedia.org/wiki/Code_page_437. At one time this character set was synonymous with Extended ASCII and plaintext files were encoded in it. Even on modern Windows, that uses Unicode, you can still type ALT codes with the numeric keypad and you get the characters that originally had this code in CP-437, for example ATL-1-2-8 gives you capital C with cedilla. Millions of people still have muscle memory for at least some of these codes.

There are a few things sane about it:

  • Codes 0x00..0x1F are fun symbols, such as smiley faces, card symbols, musical notes etc.
  • Code points 0x20..0x7E are ASCII. This may be a small miracle in its own right, given that it came from IBM, the inventors of EBCDIC.
  • Code 0x7F is the house symbol or capital delta, nobody knows exactly what it’s meant to be.
  • Codes 0x80..0xAF are accented letters for western European languages, plus other typographic symbols.
  • Codes 0xB0-0xDF are graphics characters, mostly for box drawing.
  • Codes 0xE0..0xFF are maths symbols and some selection of Greek letters.

This looks logical enough. The Monochrome Display Adapter (one of the video cards you could have in the original IBM PC) supported no graphics, just those characters. Box drawing characters let you design fancy text-based user interfaces and the fun symbols came in handy for games. And a serious computer could use some serious maths symbols.

There was support in the MDA hardware to extend the characters in the range 0xC0…0xDF into the ninth pixel column. All other characters had this column just blank (the character ROM as only 8 bits wide), but some of the characters needed to form continuous lines, hence the rightmost bit from the ROM was extended to the next pixel. This was a very clever hack indeed.

Apparently the character set was based on the character set of Wang word processors, so some of the quirks may have been inherited from that.

I have always wondered why the maths characters includes less-than-or-equal and greater-than-or-equal symbols, but not the not-equal symbol. Why is the section sign § (fairly common in Germany) in the range of fun symbols? This makes no sense.

The selection of international letters was poorly picked. Why do we have Å and Æ as used in Danish and Norwegian, but not Ø? Indeed the Greek lowercase letter Phi (code 0xED) was sometimes used as small letter ø and in many fonts it looked sufficiently like it. The German sharp S (ß) looks a lot like a Greek lowercase beta and indeed it shares the character code 0xE1 with it. Portuguese A and O with tilde are not included, while some fairly unusual currency symbols are there. Many accented letters exist only in lowercase version, but that’s to be expected when you also need code space for box drawing characters and maths symbols. But all in all, the Apple Macintosh character set looked a lot more reasonable in terms of character selection.

Which brings us to the order of the international characters in the code page. Why on earth does this range start with Ç, ü, é? This makes no sense. at all! This must be the result of a crazy brainstorm session, in which characters were haphazardly added. Later during the session, somebody discovered that some characters had been forgotten and some that were already included, had to be kicked out.Nobody had a list of characters that were essential for each language and which languages had to be included. Even worse, they were too lazy to reorder the set after the session, so the characters that were eventually selected, could be in a somewhat logical order.

At least this character set is more reasonable than the one selected in 1985 for the Acorn BBC Master.. They did include the full Greek alphabet, both uppercase and lowercase, but they left out some accented letters that were important for French and Ditch, like the I-with diaeresis (ï). Fortunately this never caught on and the Acorn Archimedes switched to ISO Latin-1 (with some extensions).

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *