Computer character sets: ASCII

Written by

in

In the early 1960s, the world (in fact the USA) needed a standardised character code for both telegraphy and computers. The five-bit ITA2 code was lacking, mainly because it required frequent use of letters and figures shift control characters. Work on ASCII started in 1961 and a first version was released in 1963.

ASCII was to become a 7-bit code and initially half of the code points were reserved for control characters.

The 1963 version included most familiar characters with codes 0x20 to 0x5F. Lowercase letters were not defined yet and there was an up-arrow instead of the caret and a left-arrow instead of the underscore.

In 1965, lowercase was added and some character codes were swapped (most of them were swapped back in the final 1968 version). In 1968 we got the final 7-bit ASCII code. Many control codes were defined with their names, but little was specified about their meanings. Many use cases were envisioned, such as mail systems where the start of a header could be marked with a dedicated control character and database formats where records and groups could be separated with specific control characters.

We got accent characters ‘`’ , ‘^’ and ‘~’ and the apostrophe was supposed to do double duty as the acute accent while the double quote character could do double duty as umlaut/diaeresis. This way, you could type text in most Western European languages by overprinting letters with accent characters. The underscore character was likewise intended to add underlining to text. This overprinting never got popular and on early CRT terminals, it did not work at all. Therefore the accent characters were later mainly used as separate characters, not as combining accents.

EBCDIC

Of course, IBM did things their own way and in 1963, they extended the 6-bit BCDIC character set into a 8-bit EBCDIC character set. It included some characters missing from ASCII, such as ‘¬’ and ‘¢’ and it included both uppercase and lowercase letters, something the ASCII team was still heavily debating whether to have it or not. Most of the 256 code points were unassigned.

When ASCII got finalised, EBCDIC got all characters that ended up in ASCII. There were code pages galore, with country specific extensions and later there was even a UTF-style encoding to encode all Unicode characters with it (using sequences of those byte values that are unassigned in basic EBCDIC). EBCDIC is still heavily used in the mainframe world. Everything outside the mainframe world uses ASCII or Unicode

ISO-646

ASCII was to become part of a large family of country-specific character sets. Most characters were standardised, but some characters like ‘@’, ‘[‘, ‘\’ and ‘]’ could be replaced by language-specific letters, for example Æ, Ø and Å in Danish. This worked well for the Scandinavian languages (that each need three additional letters), as well as for German with its Ä, Ö, Ü and ß. French however required more accented letters than there were code points available for country-specific usage. Combining accents did not work well (or even at all) with the hardware and software of the day. The same was true for Dutch.

Country-specific ASCII variants were a pain in the ass, especially when you had to exchange data among different European countries. With the rise of the C programming language in the 1980s, characters like ‘{‘, “}’ and ‘~’ had become essential. The ISO standard describe trigraphs as substitutes, but really nobody liked to use these.

8-bit character codes and later Unicode, would make country-specific ISO-646 variants obsolete.

A Lasting Legacy

Nearly all programming languages use exclusively ASCII symbols for their source code and nearly all ASCII symbols are used in some programming languages. Take one printable ASCII character away and some operators in C or some constructs in Unix shells can no longer be typed. Nobody would think much of the ‘~’ character or curly braces, but without them, C would not be C. Now that these characters are part of the syntax of popular programming languages, nobody can take them away.

Source code is ASCII. Most programming languages allow identifiers to contain letters in any scripts defined by Unicode, but this is seldom used. Source code is international and international means English, especially for open source programs. The US keyboard layout allows all ASCII characters to be typed with a single key or a single shifted key (no awkward AltGr combinations). Therefore it is so popular among programmers.

Most programming language standards specify that source code is Unicode, but Unicode still has ASCII at its core. The most popular encoding for Unicode is UTF-8 and guess what: if you use only the lowest 128 code points of Unicode (which are identical to ASCII), your source file would just be ASCII. All other Unicode code points are encoded with sequences of bytes that have the high bit set. Unicode could be in strings and comments and that would work without problems most of the time.

Of course there is also ASCII art. It is called ASCII art for a reason and true ASCII art uses ASCII characters exclusively.

File names should also be in ASCII (though most file systems allow arbitrary Unicode file names nowadays).

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *