Lennart’s Blog

  • Floating point: the early years

    When Konrad Zuse designed his first mechanical computer in 1936, it supported binary floating point operations, much like modern IEEE-754, but with less precision (only 22 bits). It even had special values for infinity. Its successor, the Z3 was based on relays and was completed in 1941. It used the same type of floating point numbers.

    Electronic computers started to get floating point in the late 1950s. Every computer architecture had its own floating point format and its own quirks. In those years, floating point got a bad reputation of being intrinsically inexact. When I studied at the university in the late 1980s. that was still the common wisdom. Never rely on any exact value, not even 2.0 × 2.0 is exactly equal to 4.0.

    Some early machines, like the IBM 7094, had a 36-bit word size and proper binary floating point. Single precision was 36 bits and double precision was 72 bits. The IBM System 360, introduced in 1965, had a word size of 32 bits instead of 36 bits. Worse still. the floating point format was base-16 instead of base-2. For normal numbers, the mantissa could have a value between 220 and 224 – 1, or 0x100000.. 0xFFFFFF. Incrementing the exponent by 1 would increase the value of the number by a factor of 16. This way we had only 21 significant bits for some values and 24 for other values. Even worse still, the computations were not as accurate as they could have been within the constraints of this format, because no guard bits were used. Needless to say, some users of the IBM System 360 were not happy with its much inferior floating point accuracy.

    Numerical programming was emerging as a significant discipline in computer science. Using clever tricks they could get much more accurate results from floating point operations than they were entitled to. The problem was: different computer architectures required different clever tricks. Floating point hardware was primarily designed to be fast, even if that meant loss of precision in some corner cases.

    IEEE-754

    Something had to be done to clear up this mess. Some good ideas:

    • Use a pure binary floating point format: 32 bits for single precision and 64 bits for double precision. Pure binary formats allow for an implicit leading one, so even though only 23 mantissa bits are stored, a 24-bit number in the range 223 … 224-1 can be represented. This format was used by Digital Equipment Corporation (DEC) in the PDP-11 and VAX machines. This turned out to be much better than IBM’s base-16 format.
    • Specify the exact value that must be returned for each operation. Do this with four different rounding modes that each implementation must support. No more stupid shortcuts to compromise accuracy
    • Specify distinct values for + and – Infinity and a set of NaN (Not a Number) values for completely invalid results.

    One thing that was annoying on every floating point system before IEEE-754: there was a large gap between zero and the smallest positive floating point number. This gap was much larger than the gaps between numbers with the lowest exponent. For example: there ware millions of numbers between 2-126 and 2-125, but there are no numbers between 0 and 2-126. Therefore, A == B was not equivalent to A – B == 0. For example the number 1.5×2-126 – 2-126 would be zero, while these numbers are different.

    A solution to this problem would be to reserve the lowest possible exponent value for the value 0 and any numbers between 0 and 2-126. These numbers were to be called subnormal numbers. This feature was costly to implement in hardware, so many hardware vendors (most notably DEC) were opposed to it, but the software guys won this one.

    Work on IEEE-754 started in the 1970s and while the standard was only released in 1985, Intel announced the 8087 floating point unit already in 1980. This processor implemented the standard before it was a published standard.

  • Computer character sets: Unicode

    Some scripts contain thousands of different characters, so an 8-bit character code for any of these would be futile to begin with. While different vendors in different countries each tried to fit their ideal character set into an 8-bit code, China and Japan already knew this would not work for them.

    Japan developed a multi-byte code Shift-JIS, in which ASCII characters would take 1 byte and Kanji (or Katanana or Hirigana) would take two bytes each. Chinese speaking countries had the Big5 character code, using a similar concept.

    Unicode was already envisioned in 1987, the year the first official 8-bit character code for Western European languages (ISO-8859-1) became an ISO standard.

    The 16-bit Code

    Unicode was to become a 16-bit code, rather than an 8-bit code. It should include all characters from all scripts used in modern languages. No Egyptian hieroglyphics, but Chinese, Japanese and Korean had to be included. With a 16-bit code you have up to 65536 different characters.

    How many Chinese characters are there? A person is considered literate if he or she knows around 3,000 characters. But a good dictionary contains 20,000 different characters. But if you count them all, including characters only appearing in ancient texts and some proper names, you get close to 100,000.

    Chinese, Japanese Kanji and the corresponding Korean characters are basically the same character set. much like our Dutch letter A is the same letter as the French letter A.

    The “Han” characters from Chinese, Japanese and Korean should be unified. A set of around 20,000 was considered sufficient.

    Hangul is the other script used in Korea. In Korean there are about 11,000 possible Hangul syllables, each consisting of three “letters”, stacked on top of each other. Korean Hangul is typically encoded using one code point per syllable. You can compare that to the accented letters in Latin. The é is composed of the letter e with an acute accent ´ stacked on top of it. Yet the letter é is normally encoded using a single code point instead of a separate letter e and a combining acute accent.

    Combined CJK characters plus Hangul syllables would already consume half of the available code points. But all other scripts would have only tens of characters each, a few hundred at worst, when there were so many different accented characters. A quick back-of-the-envelope calculation showed that it would work out. Latin, Greek Cyrillic, Arabic, Hebrew, Thai, Devanagari, Georgian, Armenian, all of them combined would occupy a relatively small fraction of the code space.

    The first version of Unicode was released in 1991 as a 16-bit code. Each character now occupied two bytes instead of one, but once you got over that, it would be simple. It took a few years for CJK characters and Korean Hangul syllables to be included.

    Unicode always had to have some unused characters available, so new symbols (for example new currency symbols like the Euro sign) could always be added.

    16-bit Unicode can be stored either in little-endian or in big-endian format. A Unicode file typically contains the value 0xFEFF, which is the byte order mark. The swapped value 0xFFFE is not a valid character and the system opening the file can see which byte order the file should be in.

    It took nearly two decades for Unicode to be nearly universally adopted. In Western European countries, 8-bit codes were sufficient and in the USA, ASCII was sufficient. Why use twice the memory and disk space if ASCII works just fine?

    UTF-8

    One clever trick however, made the transition to Unicode in the USA a no-brainer. It was invented in 1992 by Ken Thompson and Rob Pike (of Unix and Plan9 fame). This is called UTF-8 (Unicode Transformation Format). Character codes between 0x00 and 0x7f are transferred as-is. So ASCII remains ASCII. Character codes in the range 0x80..0x7ff would be encoded by two bytes: one in the range (0xC2..0xDF), followed by a byte in the range 0x80..0xBF. Character codes in the range 0x800..0xFFFF would be encode by 3 bytes: on in the range 0xE0..0xEF, followed by two bytes in the range 0x80..0xBF. This could be extended to longer encodings for higher character codes.

    UTF-8 has the following desirable properties:

    • ASCII files are valid UTF-8.
    • If a program is 8-bit clean (it ignores and.or transfers unchanged any bytes in the range 0x80..0xFF), it can cope with UTF-8 files. For example a compiler that knows only ASCII, can pass bytes in string literals unchanged and ignore comments, so it can handle UTF-8 source files to some extent.
    • UTF-8 is self-synchronising. Any byte value in the range 0x80..0xBF cannot be the start of a character. Any byte value in the range 0xC0..0xFF is the start of a multi-byte character and it tells you how many trailing bytes are going to follow.

    Efficiency depends on the script and language used:

    • For ASCII files it is one byte per character, just as efficient as ASCII, twice as efficient as 16-bit Unicode.
    • For Latin scripts it is almost as efficient as 8-bit character codes. The occasional accented letter takes two bytes instead of one, but in most languages these form only a small fraction of the characters.
    • For Greek, Cyrillic, Arabic and Hebrew, all letters will take 2 bytes each, but digits, spaces, commas etc. will only take one byte each, so we are still better off than with 16-bit Unicode (but worse than with an 8-bit code).
    • Even for languages like Chinese, we are not that worse off compared to 16-bit Unicode. Text files typically contain a large fraction of spaces, digits and punctuation, each of which takes only one byte.

    Increased Processing Requirements

    Even though UTF-8 made the transition from ASCII to Unicode easy, Unicode still requires considerable processing power compared to 8-bit character sets. This used to be a problem in the 1990s.

    While an ASCII-only font takes a few kilobytes, a full Unicode font easily takes megabytes. Chinese characters require more pixels to be recognisable, compared to ASCII characters. Plus we have way more characters.

    Arabic requires complex rendering algorithms to display correctly, because it’s a cursive script with no.printed form with separate letters. Both Arabic and Hebrew require right-to-left printing. Bidirectional text support is one of the trickiest and most counterintuitive aspects of text processing. A text in Hebrew is to be presented from right to left, but a multi-digit number inside such a text is to be presented from left to right again. But when there is a list of such numbers separated by commas, the numbers have to be presented with the first number rightmost, but the digits inside each number are from left to right. We can have English quoted words inside a Hebrew quote, inside an overall English text.

    Converting a string to uppercase is trivial in ASCII. In Unicode we have quite a bit more characters to put in a lookup table and the code points of a lowercase letter and corresponding uppercase letter have no fixed relationship.

    Sorting requires quite a bit more processing, You have to run a language-dependent collation algorithm on each string. Depending on the language, the letter Ö might sort after Z (Swedish), equivalent to OE (Some German conventions) or equivalent to O where the presence of an accent is only a second-level collation criterion (Dutch).

    Some systems only handle a subset of Unicode well, for example only Latin, Greek and Cyrillic.

    The 20.09-bit Code.

    In 1996 it turned out that 16 bits was not really enough for Unicode. For one thing, all Chinese characters (around 100,000) had to go in. Further we wanted to include historic scripts like Egyptian hieroglyphics and cuneiform after all. Emoji weren’t even a thing at the time.

    After playing with the idea to extend Unicode to 32, 31 or 30 bits, they finally made it a 20.09-bit code. To be exact, Unicode supports up to 17×65536 code points. This is the number supported by UTF-16. UTF-8 now requires up to 4 bytes to represent each Unicode character.

    UTF-16 works as follows:

    • Code points in the range 0xD800..0xDFFF are unassigned.
    • Each character in the range 0x0000..0xFFFF represents itself as a single 16-bit value. This range is called the Basic Multilingual Plane. It excludes the range 0xD800..0xDFFF as these code points are unassigned.
    • Each character in the range 0x10000…0x10FFFF is represented by two 16-bit values: one in the range 0xD800..0xDBFF and one in the range 0xDC00..0xDFFF. These two 16-bit values are called surrogate values and they can never appear on their own, was these code points are unassigned.

    This way, the vast majority of characters in daily usage are in the Basic Multilingual Plane (0x0000..0xFFFF) and can be represented by a single 16-bit value. Programs that were designed to handle 16-bit Unicode, would continue to do so for these characters.

    The more exotic characters are now represented by two 16-bit values. They take 4 bytes each, both in UTF-8 and in UTF-16.

  • Computer character sets: the 8-bit mess

    ASCII is a 7-bit code and defines characters 0.. 127. Computers have 8-bit bytes and each byte can store values 0..256. When using these bytes to store ASCII characters, the values 128..256 are never used. Would it be a good idea to extend ASCII with 128 more values that could store the characters we need that are not in ASCII? Maybe that could be a good idea. The problem was: too many different people got the same idea and each of them defined different additional characters and added them in a different order. The really great idea only came in 1992 in the form of UTF-8.

    There were separate encodings from different vendors and for each vendor there would be many different code pages to cover different sets of languages. The only good thing: the ASCII part of the set would be left alone, so at least all character sets supported full ASCII.

    Different Vendors

    Let’s describe some vendors who extended ASCII to an 8-bit code:

    • In 1981, IBM introduced the IBM-PC with a new 8-bit character set. That characters set was known as Code Page 437 (CP437). It had a limited selection of international Latin characters (mostly lowercase) and both the selection of which characters to include and the order in which they appeared, seem to be rather random. No plan, but a crazy brainstorm session. MSX (1884) and Atari-ST (1985) based their character sets on this set, but they both made extensive changes, adding more international characters and leaving out box drawing characters.
    • Apple introduce the Macintosh in 1984, complete with its own extended 8-bit character set. The selection of characters seems far more logical than IBM’s.
    • Digital added a Multinational Character Set (MCS) to their vt220 terminal in 1983. This would later be developed into ISO-8859-1.
    • There were others: computers like the Acorn BBC Master, Grundy NewBrain and Sinclair QL each had their own proprietary 8-bit character set, never seeing any usage on other platforms. Rightfully forgotten.

    All except Digital filled in all code points from 128 to 255. The DEC vt220 terminal used the range 128..159 for additional control characters and only add printable characters for 160..255.

    Code Pages

    A single 8-bit character set usually does not cover all languages for all markets where the computer was sold. Computers sold in Turkey needed Turkish letters such as the dotless i and s-cedilla. Computers sold in Greece needed support for the Greek alphabet. Therefore all important vendors had not one, but many different 8-bit character sets, called code pages. Different code pages were for different countries.

    Some countries, like Russia had their own 8-bit extended character sets. Russia had the KOI8 family of character sets.

    ISO-8859

    In 1987, ISO introduced a family of 8-bit character codes, called ISO0-8859. All of these reserve the range 128..159 for C1 control characters. https://en.wikipedia.org/wiki/ISO/IEC_8859

    Eventually, there would be 15 character sets, numbered from 1 to 16 (12 was omitted), 10 of which would be for the Latin script. Five others would be for Cyrillic, Arabic, Greek, Hebrew and Thai. ISO-8859-16 would be introduced in 2001, when Unicode was already well established.

    The first set in the family, ISO-8859-1, would be based on the vt220 Multi National Character Set. It would include support for Finnish, Swedish, Norwegian, Danish, Icelandic, Dutch, English, French, German, Spanish, Portuguese and Italian, all relevant Western European languages. Even though the old vt220 set included the French letters œ, Œ and Ÿ, they were excluded from ISO8859-1, being replaced by ×, ÷ and Icelandic letters. Later, ISO-8859 would add these letters back in, along with š and ž (for Finnish) and the Euro sign, all at the cost of some rarely used symbols.

    ISO-8859 would be included whole in Unicode and it was also the base of Windows Code Page 1252, which is a superset of ISO-8859-1, but replacing some of the C1 control characters with additional symbols. Most other Windows code pages are not compatible supersets of their corresponding 8859 sets. So apart from the ISO-8859 character sets, we also have to deal with a bunch of slightly incompatible Windows code pages.

    All Latin sets in ISO-8859 support German (including the ß) but only half of them would support French. Polish is in three of these sets, but with some characters in different positions in each of them. Czech is only in one of these sets. There’s no single set with both French and Czech. Ten different character sets for Latin was just too many.

  • Computer character sets: control characters

    Enter the fun world of control characters. The five-bit ITA2 code contained a bunch of them, all of which got equivalents in ASCII:

    • Carriage return, to return the carriage (that contained the paper) of the typewriter to the rightmost position. In old typewriters, the carriage moved to the left as you typed, so the next character position on the paper would become the target of the tying hammers. In later printers, the print head would move to the right as you typed, while the paper stayed in the same position. Carriage returns would return the print head to the leftmost position.
    • Line feed, which advanced the paper vertically by one line.
    • Letters and figures shift. ASCII has no letters and figures shift as such, but it has the SO (shift out) and SI (shift in) control characters that can shift to and from an alternate character set. Teletypes could for instance switch between Greek and Latin this way.
    • The null character, all zero bits, which did nothing.
    • The Bell character, that sounded a bell.
    • The WRU (Who are you) character. This caused the receiving teletype to send a response. This way you could be sure a teletype was running at the other end of the line when you typed a message. You could not know if the teletype had paper and ink, but at least you could know it was connected and switched on. ASCII has the ENQ control character for this purpose.

    The ability to print a carriage return and a linefeed independently, allowed you to return to the start of the line without advancing the paper to the next line. This way you could overprint a line with different characters, possibly adding accents to letters or underlining some words. The Backspace character in ASCII would be simpler to use for this purpose. This overprinting worked on printing teletypes, but not on CRT terminals.

    ASCII added many more control characters, some of which are still widely used, some of which are all but forgotten. See https://en.wikipedia.org/wiki/C0_and_C1_control_codes. The widely used ones are:

    • BEL Sounds a bell or a beep, or emits a visual attention signal.
    • Backspace, Moves the print head one position to the left.
    • Horizontal Tab. Moves the print head to the next tab position.
    • Vertical Tab. Advances the paper to the next vertical tab position. This is not actually widely used but its purpose is still recognised. It was used with pre-printed forms that required some fields to be filled in.
    • Form feed, Advances the paper to the next page. On CRT terminals this was sometimes used to clear the screen.
    • DC1 and DC3 (control-Q and Control-S). These were originally used to control a paper tape reader. A program receiving data from a paper tape would send those control characters to pause and resume the paper tape reader, as a means of flow control. This type of flow control was widely used with CRT terminals to pause and resume terminal output and it’s still implemented today.
    • Escape. This is mostly used today as a prefix for more advanced terminal commands, for example to position the cursor or to insert and delete lines on the screen. The ANSI escape sequences are universally used for terminal-based programs.
    • DEL (0x7F). This control character, lonely at the top end of the 7-bit ASCII range, was originally intended to be used on papertape. One could rewind the tape a few character positions and then punch DEL over those characters to erase them. When the papertape was later read back, the DEL characters would be ignored, just like Null characters. As DEL has all bits set, punching it over another character would punch all holes, turning anything else into DEL.

    Later, another range of control characters was introduced: the C1 range 0x80.. 0x9F. These are not widely used. There is an unambiguous newline character in this range (0x85), but this is not widely used. Unicode added line and paragraph separators, but none of these are widely used in plain text files.

    Modern Usage

    Nearly all modern terminals (typically implemented in software on a computer), allow you to type the control characters 0x01 to 0x1A (Control-A to Control-Z). The Tab key will output Control-I (0x09), the Return key will output Control-M (0x0D) and the backspace key will output either Control-H (0x08) or DEL (0x7F), depending on which side of the holy war you are. There is a dedicated ESC key (0x1B) and some less obvious key combinations will get you NUL and the characters in the range 0x1C..0x1F. Control keys are often used in a way that is totally unrelated to their original meaning in ASCII.

    • Unix uses Ctrl-C (ETX) to terminate a running program, Ctrl-D (EOT) to indicate the end of input and Ctrl-Z (SUB) to suspend the currently running program.
    • CP/M files had a length specified in 128-byte blocks. When a text file was not a multiple of 128 bytes in size, the file was padded with SUB (Control-Z) characters. Even if the file was a multiple of 128 bytes, they would still add a block of Ctrl-Z, so that would be a reliable end-of-file indicator. This habit was carried over to MS-DOS (that did store exact file sizes). A single Ctrl-Z was typically appended to each text file. Some programs choked on Ctrl-Z, some would choke when the Ctrl-Z character was not present. It was a mess.
    • WordStar was an early word processor under CP/M. It got ported to MS-DOS and many other editors copied its control key layout. It used Ctrl-S for cursor-left, Ctrl-D for cursor-right, Ctrl-E for cursor-up and Ctrl-X for cursor-down. The choice of these control codes has nothing to do with their meanings in ASCII, but everything with the layout of these keys on the keyboard. WordStar was developed at a time that many computer terminals did not have cursor keys.
    • Many Unix editors use control codes, such as nano and emacs make extensive use of control codes in a way totally unrelated to their meaning in ASCII.
    • GUI applications use Ctrl-Z for undo, Ctrl-X for “cut”, Ctrl-C for Copy and Ctrl-V for “paste”.

    Holy Wars

    How should text files be separated into lines? This has never been settled for real.

    There are two aspects of this:

    • Should a line terminator be at the end of each line you see? or should there only be one at the end of each paragraph? For source code, every line should have a line terminator, but for running text. this is not so obvious. Some authors prefer putting an entire paragraph in a long line. The expect text editor programs to wrap these lines to the width of the screen they are using. Others want to put a line separator at the end of each visual line and put a blank line (two line separators) between paragraphs.
    • Should the last line of a text file always end in a line terminator/separator?
    • What should the line terminator be?
      • CP/M and MS-DOS settled on the sequence CR-LF. This is what printers require when you print the file. This convention was carried over to Windows.
      • Apple and a bunch of 8-bit systems settled on just CR at the end of each line. Apple later followed the Unix convention.
      • Unix settled on just LF at the end of each line.

    Today it’s common wisdom that programs should at least accept text files with LF-only and with CR-LF on reading Both conventions are here to stay. Unicode line separators or the new NL character 0x85 never caught on.

    Another source of heated debate is the use of tab characters in source files.

    • Some authors prefer their source files to be free of any tab characters. Any indenting is done with spaces.
    • Other authors prefer indenting with Tab characters instead. The configuration of the tab stops becomes another point of discussion. Some programmers insist on tab stops every four spaces, others want tab stops every eight spaces.

    Finally there is discussion on the character code that should be emitted by the backspace key (the big left arrow right of the ‘=’ key), in particular on Unix systems.

    • Unix purists insist on Backspace = Backspace (0x8 = Ctrl-H).
    • Others insist on Backspace = DEL (0x7F).

    Terminal programs can be configured to emit either character as backspace and the “erase” character on the Unix line input function can be configured to any character. Many programs accept either convention. But it does get super annoying when not everything on the same system is configured the same way. If some terminal programs on your desktop emit BS and others emit DEL and your shell isn’t configured correctly for some of them.

  • Computer character sets: ASCII

    In the early 1960s, the world (in fact the USA) needed a standardised character code for both telegraphy and computers. The five-bit ITA2 code was lacking, mainly because it required frequent use of letters and figures shift control characters. Work on ASCII started in 1961 and a first version was released in 1963.

    ASCII was to become a 7-bit code and initially half of the code points were reserved for control characters.

    The 1963 version included most familiar characters with codes 0x20 to 0x5F. Lowercase letters were not defined yet and there was an up-arrow instead of the caret and a left-arrow instead of the underscore.

    In 1965, lowercase was added and some character codes were swapped (most of them were swapped back in the final 1968 version). In 1968 we got the final 7-bit ASCII code. Many control codes were defined with their names, but little was specified about their meanings. Many use cases were envisioned, such as mail systems where the start of a header could be marked with a dedicated control character and database formats where records and groups could be separated with specific control characters.

    We got accent characters ‘`’ , ‘^’ and ‘~’ and the apostrophe was supposed to do double duty as the acute accent while the double quote character could do double duty as umlaut/diaeresis. This way, you could type text in most Western European languages by overprinting letters with accent characters. The underscore character was likewise intended to add underlining to text. This overprinting never got popular and on early CRT terminals, it did not work at all. Therefore the accent characters were later mainly used as separate characters, not as combining accents.

    EBCDIC

    Of course, IBM did things their own way and in 1963, they extended the 6-bit BCDIC character set into a 8-bit EBCDIC character set. It included some characters missing from ASCII, such as ‘¬’ and ‘¢’ and it included both uppercase and lowercase letters, something the ASCII team was still heavily debating whether to have it or not. Most of the 256 code points were unassigned.

    When ASCII got finalised, EBCDIC got all characters that ended up in ASCII. There were code pages galore, with country specific extensions and later there was even a UTF-style encoding to encode all Unicode characters with it (using sequences of those byte values that are unassigned in basic EBCDIC). EBCDIC is still heavily used in the mainframe world. Everything outside the mainframe world uses ASCII or Unicode

    ISO-646

    ASCII was to become part of a large family of country-specific character sets. Most characters were standardised, but some characters like ‘@’, ‘[‘, ‘\’ and ‘]’ could be replaced by language-specific letters, for example Æ, Ø and Å in Danish. This worked well for the Scandinavian languages (that each need three additional letters), as well as for German with its Ä, Ö, Ü and ß. French however required more accented letters than there were code points available for country-specific usage. Combining accents did not work well (or even at all) with the hardware and software of the day. The same was true for Dutch.

    Country-specific ASCII variants were a pain in the ass, especially when you had to exchange data among different European countries. With the rise of the C programming language in the 1980s, characters like ‘{‘, “}’ and ‘~’ had become essential. The ISO standard describe trigraphs as substitutes, but really nobody liked to use these.

    8-bit character codes and later Unicode, would make country-specific ISO-646 variants obsolete.

    A Lasting Legacy

    Nearly all programming languages use exclusively ASCII symbols for their source code and nearly all ASCII symbols are used in some programming languages. Take one printable ASCII character away and some operators in C or some constructs in Unix shells can no longer be typed. Nobody would think much of the ‘~’ character or curly braces, but without them, C would not be C. Now that these characters are part of the syntax of popular programming languages, nobody can take them away.

    Source code is ASCII. Most programming languages allow identifiers to contain letters in any scripts defined by Unicode, but this is seldom used. Source code is international and international means English, especially for open source programs. The US keyboard layout allows all ASCII characters to be typed with a single key or a single shifted key (no awkward AltGr combinations). Therefore it is so popular among programmers.

    Most programming language standards specify that source code is Unicode, but Unicode still has ASCII at its core. The most popular encoding for Unicode is UTF-8 and guess what: if you use only the lowest 128 code points of Unicode (which are identical to ASCII), your source file would just be ASCII. All other Unicode code points are encoded with sequences of bytes that have the high bit set. Unicode could be in strings and comments and that would work without problems most of the time.

    Of course there is also ASCII art. It is called ASCII art for a reason and true ASCII art uses ASCII characters exclusively.

    File names should also be in ASCII (though most file systems allow arbitrary Unicode file names nowadays).