The Data Studio

A Short History Of Character Encoding

When I started work, in 1973, I was programming an ICL 1900 computer. This had a 24-bit word made up of 4 × 6-bit characters.

A 6-bit character could have 64 possible values (26). These values were organised to represent uppercase letters, numeric decimal digits, some punctuation symbols and some control characters. For alphabetic characters our printer printed uppercase only, no lowercase, no accented characters, no foreign characters, certainly no emojis.

It didn't take long to work out that we needed more. IBM was already using 8-bit bytes, but IBM had a very idiosyncratic character set called EBCDIC. The main problem with EBCDIC was that the characters did not sort into a useful sequence according to their binary values. We have overcome that now, but we certainly won't be going back to EBCDIC.

In the 1970s and 1980s "mini-computers" from DEC (and others) used bytes with an encoding called ASCII. This used 7 bits of each byte, giving 128 usable values. That was enough for uppercase and lowercase Latin letters, numeric digits, punctuation, some common mathematical symbols, a couple of currency symbols and the control characters. Sorting by the numeric values of each character gave sensible results, a significant win over EBCDIC. ASCII became very popular and was dominant for a long time.

Gradually people wanted to use more characters and a couple of schemes came along to use the other 128 values available when taking all 8 bits of a byte. So the values from 128 to 255 were assigned to other useful characters. Just what was useful depended on where you lived: your language and your currency, in particular.

Two families of 8-bit codes became dominant: the ISO/IEC 8859 family and the Windows 12xx family. Both of these are still widely used. In both of these families, the byte values from zero to 127 are compatible with ASCII and the byte values from 128 to 255 vary between the families and from one member of each family to another. There is scope for great confusion here, and it happens.

The main problem was that 256 separate codes was nowhere near enough. The Unicode project started in 1987 with the intention of supporting all living languages. Now it also supports dead languages and even emojis. Unicode allows for over one million different characters. Each character is uniquely identified by a "Code Point" which is usually written as a hexadecimal string, preceded by "U+", so you will see U+xxxx, where xxxx is a hexadecimal number. You may see other variants, such as U&"\xxxx" or simply \uxxxx. There may be 4, 5 or 6 digits in the hexadecimal number. The basic characters in most languages are covered by 4 digits.

The really important thing to know about Unicode Code Points is that they do not represent the numbers in the bytes. For example, the Euro currency symbol (€) has the Unicode Code Point U+20ac, and in UTF-8 it is stored as hex: e2 82 ac.

The Unicode Code Points are mapped into actual storage by the "Unicode Transformation Format". On the Web, the most popular Unicode Transformation Format is UTF-8. The popularity of UTF-8 is growing because it has a number of advantages. Other Unicode Transformation Formats are UTF-16 and UTF-32 both of which have two versions depending on the byte order.

Although UTF-8 now dominates the Web, the older 8-bit codes, especially Windows 12xx and ISO 8859 continue to be widespread in the files and databases in large organisations. These are gradually changing to UTF-8 but it will take a long time. Those of us working in large organisations (in particular) will continue to come across the 8-bit encoding systems and we must understand enough about them to be able to convert between them accurately and to recognise encoding errors when we see them.

UTF-8 is taking over and the reasons are:

There is a more comprehensive history of Unicode on Wikipedia.