The 8-bit byte

Today a byte is always a unit of exactly eight bits. Eight-bit units (bytes, octets), have long been a key part of the TCP/IP protocols, file format specifications and other standards. Disk files have their sizes specified in bytes, not in bits or any other multiple of bits. Pretty much every general-purpose computer architecture invented since the 1970s, addresses memory in 8-bit bytes. If you add one to an address, you get to the next byte in memory, not to the next full word. For the next full word, you have to add 4 or 8 to the address, depending on whether you are on a 32-bit or 64-bit system. In the past, we also had 16-bit systems, where you add 2 to an address for the next word. Because 8-bit bytes are assumed in so many standards, file formats and protocols, they are deeply ingrained in our culture and they are here to stay for the foreseeable future..

In French, an 8-bit unit is called an “octet”, and this term is also used in some official standards documents. For all practical purposes, byte and octet are synonyms.

Bytes have not always been 8 bits though. Most mainframes of the 1950s and 1960s had word sizes of 36 bits, but there were also machines with different word sizes, such as 40, 48 or 60 bits. Memory addresses selected full words, so to get to the next full word in memory, you always had to add exactly 1 to the address. Each word contained one number.Text data was stored as a fixed number of characters in each word. The size of a character was often 6 bits. This let you have 64 different characters: 26 (uppercase) letters, 10 decimal digits and a bunch of other symbols. A single 36-word could hold six characters.Instruction sets often contained instructions that helped you compose 36-bit words form single characters and to extract single characters from 36-bit words.

Not all computers were word-addressable. For example the 1401 addressed memory as single characters. Numbers consisted of a variable number of characters, each of which represented one decimal digit.These machines accessed memory one character at a time, therefore they were slow, just like the early 8-bit microcomputers. But they were suitable for business applications.

In 1964, IBM decided to define one instruction set architecture for all its computer systems. This became System/360. The word size was 32 bits, the character size as 8 bits and memory was addressed in units of 8-bit bytes. This is exactly the same way as modern 32-bit machines address memory. Characters were 8 bits and IBM defined an 8-bit character set called EBCDIC. To make hardware efficient, it is important that the number of bytes in a single machine word is a power of two. If this were not a power of two, you would have to perform a division to obtain the word address from a byte address. With a power of two, you can just ignore the last few bits of the address (and use these to select one byte in a memory word when doing byte access).

As the number of bits in a single byte is also a power of two, the number of bits in a whole machine word is a power of two too. This makes it very efficient to implement bitmaps. Bitmaps can implement sets (like in the Pascal programming language), they can represent monochrome graphics and they can keep track of free blocks in memory or on disk. Now that we have these benefits, nobody will ever move away from power-of-two word sizes.

In 1970, the DEC PDP/11 was a very influential machine that had 8-bit bytes and 16-bit words. As this was one of the main machines that Unix was developed on, this helped Unix to standardise on 8-bit bytes. The first microprocessors were 4-bit but 8-bit microprocessors followed soon. This also helped popularise the 8-bit byte.

As octal numbers represent 3 bits per digit and 8 is not a multiple of 3, bytes are not a whole number of octal digits. If you write a 16-bit number in octal, the two constituent bytes will have different octal digits from the 16-bit word as a whole. For example the number 27125 is 064765 in octal, but if you break the number into two bytes, they become 0151 and 0365. This is a major pain in the butt. In hexadecimal the number is 0x69f5 and the separate bytes are 0x69 and 0xf5, This is why hexadecimal is vastly more popular than octal today. Each hexadecimal digit represents 4 bits and 8 is a multiple of 4. IBM knew this from the start and went all out on hexadecimal with System/360, but at DEC they were not so smart and they specified everything in octal. Granted, their PDP-11 instruction set contained many 3-bit fields and they came out nicely when written in octal as 16-bit numbers. The Unix and C legacy still contains octal numbers in many places:

Leading zero of an integer in C denotes an octal number. 030 means 24, not 30.
Octal escapes in string literals in C.
The mode parameter in the chmod command
The od command indeed displays octal by default.

Comments

Leave a Reply Cancel reply

More posts

Floating point: the early years

Computer character sets: Unicode

Computer character sets: the 8-bit mess

Computer character sets: control characters