The Data Studio

 

Where Did These Funny Characters Come From?

 

Are you seeing something
like either of these?

The question marks here are a problem because they represent values in our data that are not valid characters,
we don't know what those bytes mean. This article explains what is happening and how to solve this problem.

Background

Systems used for Analytics are generally referred to as Data Lakes. In the past, systems performing the same function were called Data Warehouses. Either way, data is loaded from many "source" systems into the Data Lake (or Data Warehouse), which is the "target" system.

There are usually several source systems because there are usually many operational systems that run the organisation. There may be one for keeping customer records, one for handling the accounts, one for managing the products or services that the organisation sells and so on. Very often there are several systems handling the same function in different parts of the organisation. The source systems may have been developed in house, or bought as standard packages, or been put together from various components, maybe by a consulting firm. Some systems may be very old. Big established companies often have mainframe systems, minicomputer systems such as AS/400 (really!) and any organisation of any size will probably have UNIX systems and Microsoft Windows systems.

This diversity means that we must look at the issue of character sets or, more precisely, "character encoding".

Firstly this page gives some background to the subject of character encoding. Then, as an example using the Hadoop-related tool Hive, we cover some features that help us in managing multiple character sets and moving towards a consistent standard approach for data in our Data Lake.

Conflicts between different character encodings used in different source systems generally cause problems in the following areas:

When parsing delimited files, the presence of a newline character inside a delimited string will usually cause an error. With weaker parsers, or badly formed delimited files, the presence of a delimiter or a quote character inside a string may cause errors too.

When parsing fixed length files, the parser may ignore certain characters with the result that the rest of the line will be broken into fields at the wrong places, corrupting all the fields following the ignored character(s) in the line.

When accessing data that has been loaded into Hive we see two problems:

References on Relevant Character Sets

On this subject we have found Wikipedia to be a reliable reference.

Useful references for the character sets we most frequently encounter are:

Useful references for the character sets we are likely to encounter in older source systems are:

There are many variations on ISO/IEC 8859. 8859-1 (Latin-1) and 8859-15 (Latin-9) are used widely.

UTF-8 is a good thing. Why?

UTF-8 has grown in popularity to the point where over 88% of the web pages on the internet are encoded using UTF-8. (Reference: Wikipedia->UTF-8)

UTF-8 is backward compatible with ASCII, which has been the leading character set for UNIX and PCs for many years.

UTF-8 uses only one byte for common English characters and has the capability to use multi-byte characters to encode almost every character known to humanity.

UTF-8 is the way to go.

The Basics

Apologies if you know this already; a surprising number of people in the industry really do not understand it. As illustration, the chatter on stackoverflow.com (which is usually a good source) is particularly muddled in this area.

We process character data in bytes. When we receive numeric data from source files this is almost always in characters, because the binary formats are incompatible across systems. (If we do ever get binary numbers we will need to process them very carefully.)

A byte is 8 bits and can conveniently be represented in hexadecimal (usually abbreviated to "hex") or in decimal or, less conveniently, in octal or binary.

For example the character "A" in UTF-8 is represented like this:

A
Binary 0 1 0 0 0 0 0 1
Hex 4 1
Decimal 65
Octal 1 0 1

This is the same in UTF-8, ASCII, ISO/IEC 8859 and Windows 1252, all our usual sources.

If we look at the Euro symbol (€) it's a completely different story:

€ - Euro currency symbol
Character Encoding UTF-8 (3-byte sequence) ISO/IEC 8859-15 Windows-1252
Binary 11100010 10000010 10101100 10100100 10000000
Hex e2 82 ac a4 80
Decimal 225 130 172 164 128
Octal 342 202 254 244 200

ASCII and EBCDIC do not have a way to represent the Euro symbol.

When you read bytes from a file, you are getting a series of numbers each of which has a value between zero and 255. If you do not know the encoding of the file, you do not know what these bytes mean!

The meaning of each byte is consistent across UTF-8, ASCII, the ISO/IEC 8859 family and Windows-1252 for only the following characters:

For other characters you cannot make assumptions.

Note that UTF-8 (and all of the Unicode character sets) use more than one byte for some characters. This is necessary because there are far, far more than 256 characters that we need to record. With English we are struggling to manage with 256 characters; as soon as you add other writing systems (Russian, Arabic, Japanese, etc.) you need many thousands of characters. UTF-8 encodes over 1 million different characters.

Notice that the Euro currency symbol uses one byte in ISO/IEC 8859-15, one byte in Windows-1252, but three bytes in UTF-8. Many characters behave like this, and in fact any character that cannot be recognised as a single character or the start of a multi-byte sequence, in UTF-8 will be shown as the sequence (in hex): ef bf bd (this is the "replacement character" �). If you are trying to read a fixed width file without delimiters, and this file has some bytes that are invalid in UTF, and you convert the file to UTF-8 before trying to split each record into files, it will fail because the conversion will have added extra characters and moved your fields to the right. So if you have a file with fixed width columns, break it into its fields before changing the character encoding.

Avoid EBCDIC!

For EBCDIC almost all of the byte values stand for different characters from those in our commonly used character sets. EBCDIC was used on ancient IBM mainframes, and you should never encounter it. Some companies do have such machines, but the files we receive from them have (usually) been converted to a pretty good approximation of ASCII before we get them. We do not want to get involved with EBCDIC data because it will take a lot of work to unscramble it, especially where there are mixtures of character data, binary numbers and packed-decimal numbers. Such files do exist. You really should avoid having to deal with EBCDIC. If you are forced to do so, you are welcome to contact me (ron@thedatastudio.net) and I will explain more.

Pollution of one character set with another

You are extremely unlikely to get EBCDIC characters in Windows, Mac or UNIX files, because EBCDIC is completely incompatible with the other character sets. It is unlikely that anyone will get the opportunity to cut and paste between EDCDIC and anything else. If you see dodgy characters in your files you can be pretty sure that they are not EBCDIC, and, if they were, you would have no way of knowing.

You will see files encoded in UTF-8, ASCII, the ISO/IEC 8859 family and Windows-1252 polluted with characters from other encodings. People commonly copy and paste from Word documents and web pages into other data entry forms and documents. The application receiving the pasted data cannot always know what it is getting, and usually does not bother trying. As a result we get curly quotes, em-dashes, no-break spaces and many other characters in data entry forms. These end up in our databases, and then we cannot find the data because it does not match what we think we are seeing.

Because these characters are from a different character set they are often misrepresented in the data stored.

Applications with data entry forms should validate the data that is entered. This is not difficult. The online banking applications, that most of us use, restrict the characters we can enter for a payment reference as you have probably noticed. Every commercial application should restrict the data entered depending on what is needed. They very rarely do so.

Control Characters

The good news about control characters is that they are the same in UTF-8, ASCII, the ISO/IEC 8859 family and Windows-1252. The bad news is that they are invisible and cause SQL queries to fail silently. This means failing to return all the data you expect, with no warning. When our customers point out gaps in our reports it is embarrassing at best. We therefore must deal with control characters.

Most of the control characters are obsolete device control codes. Bell (which sounds the bell on a teletype), carriage return, vertical tab, etc. (which refer back to typewriters and early printers) and several others are no longer needed. Sometimes we see some of these re-used as field delimiters. It's a big mistake, but it happens.

Newline (also known as "linefeed") is an important control character because, by convention, it marks the end of a record in a UNIX file.

Unfortunately, Windows marks the end of a record with a pair of control characters: carriage return followed by newline.

We use newline, or carriage return and newline, to delimit the records in a file. If a control character has been used as a field delimiter, we split each record into fields using the delimiter. After splitting the data into records and fields, we have no use for any control characters in any database system. We should therefore leave the delimiters behind and not carry them forward to our database. OK, you may have multi-line text fields in your database. In general, I try to avoid that. If you have quite long text fields I would put them in separate files and include a link to each one in the database. For very short notes fields it sometimes makes sense to keep them in the database, sometimes. They will be a pain to import from files, but it can be done.

All other control characters must be removed and we should never see them in any database.

The control characters are decimal 0 to 31 and 127. That is hex 00 to 1F and hex 7F. Newline is decimal 10, hex 0a. Carriage return is decimal 13, hex 0d.

There is a simple bash script that will read a file and give you a copy of it with all the control characters except newline converted to spaces. Each control character is converted to a single space so the file will still work as a fixed format file. csv files will also convert correctly. Delimited files that (foolishly) use a control character as a delimiter will not survive this conversion, but you could modify the script to preserve only the control character that is used as a delimiter.

You can download the script here. The interface is simply:

clean_control_characters.bash INPUT_FILE OUTPUT_FILE

The script, as written, will not allow you to overwrite a file; you have to specify a new file as the output, so it's hard to do any harm with it. On a moderately sized Linux box it processes over 100MB/sec; that's about 250,000 records/sec (depending on the record width) so it won't break the bank either.

Clashes Between Character Sets

In UTF-8 the following ranges are "continuation bytes", only used as part of a multi-byte sequence:

Also, in UTF-8, the bytes in the following range are "leading bytes", introducing a multi-byte sequence:

And finally, in UTF-8, the following ranges are invalid, these codes never appear in valid UTF-8 data:

Characters from the above ranges may appear if we are loading data that is stored in Windows-1252 or one of the ISO/IEC 8859 character sets. Such characters will not load into Hadoop and, most noticeably Hive, successfully. They will either silently fail to load, or they will load and be misinterpreted. (Some more testing would be useful here to be able to document what happens in each case.)

The Good News: Hive Support for Multiple Character Encodings

We have not been able to find much information in the Hive documentation, but the Hortonworks Community website came to the rescue. This is the relevant page:

https://community.hortonworks.com/articles/58548/processing-files-in-hive-using-native-non-utf8-cha.html . Be sure to look at the comments too, because these address some particular issues that we were having in a recent project.

The way to use this feature is to specify the encoding of the underlying file when you create a Hive external table over that file. Here is an example:

create external table jufgmrs_fixed_length 
    (line string) 
row format
     serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
     with serdeproperties("serialization.encoding"='WINDOWS-1252') 
location
     'hdfs://cluster/landing/jufgmrs_fixed_length'
     tblproperties('skip.header.line.count'='1');

(Mainframe files often have stupid names like "jufgmrs" - some people seem to think that data names should be like this.)

You can now select data from this table and see all the characters as they would have been seen in the source system. More usefully you can now use this table to populate a Hive table (in orc or some other format) either by using create table as select ... or by creating the target table and using insert into table select ... In either case, the data will be nicely converted into UTF-8 for you and the target table will be usable by itself or in conjunction with other Hive tables.

In this example we used the WINDOWS-1252 encoding because that is what the source system used for this file. There are many, many encodings supported; you can find out which ones by typing on an edge node:

iconv -l

You'll probably be surprised by the number of encodings that are supported, even though there are some synonyms, such as WINDOWS-1252, CP-1252 and MS-ANSI.

What If I Don't Know The Character Set?

Although the character encoding is essential for us to know what the bytes in a file mean, it is amazingly often the case that nobody seems to know what the character set is for a particular file or a particular system. So you would hope that there would be some clever application that would work it out for you. Again, Wikipedia has some good sense on Character-set detection. The bottom line it that character-set detection is not an exact science. There are some tools available but none of them can give a completely reliable answer to what the actual character set has been used.

We do have our own tool (again, this will be made available on this website soon). It has the same limitations that all the tools have, but it can help to point you in the right direction. It is a simple Java program that reads a byte-stream and provides some information about what the bytes could be. For unlikely bytes it gives the record number and the position in that record. It also gives some overall statistics about the file. Analysts can use this output to decide what the likely encoding is.

The program is a single Java class and it can be run like this:

java CharSetProfile /data/tmp/doewnbfi.DAT >doewnbfi.scan

The one parameter is the path of the file to be profiled. The output goes to stdout, and in this case it is redirected to a file called doewnbfi.scan. This can turn out to be quite a big file so a bit hard to read interactively. Here are some highlights from a real example:

Line: 938543, column: 585  Found UTF-8 continuation byte 162, Hex: a2, Octal: 242, UTF-8 Continuation byte, Windows-1252 Cent (currency symbol) 
Line: 938548, column: 585  Found UTF-8 continuation byte 162, Hex: a2, Octal: 242, UTF-8 Continuation byte, Windows-1252 Cent (currency symbol) 
Line: 938553, column: 1  Carriage return (hex 0d) without newline (hex 0a) 

This extract shows a couple of lines that have a byte value of 162 (decimal) which is hex a2 and octal 242. If the file were in UTF-8 this would be a continuation byte so there should be a leading byte before it (and there is not in this file because it would show on the line above). If the file were in Windows-1252 this would be the cent currency symbol (¢). Since this is a plausible symbol in Windows-1252 and part of an invalid byte in UTF-8, it does suggest that this file is more likely to be in Windows-1252 and less likely to be in UTF-8. So that's one data point.

The third line in this extract shows that there is a carriage return without a newline. This is unusual because we usually see carriage return/newline as the end of line in Windows files, and since we gave up teletypes there has not been much use for the carriage return character in any other case.

After the reports on individual lines containing interesting characters, we get some summary statistics on the file, as shown in the next extract:

Lines: 3811262
Bytes: 3898861310
Windows line ends: 3811261
UNIX line ends: 0
Carriage returns without newline: 2

A line is defined here as any sequence of characters ending in newline, or carriage return/newline. "Windows line ends" counts the number of carriage return/newline pairs, and in this case we have 3,811,261 of these. UNIX line ends counts the number of newlines that are not preceded by a carriage return and there are none of these. This tells us that the file we are looking at was produced on a Windows system (not UNIX and not Mac).

This section also tells us that there are 2 carriage returns without newlines. We saw one of them before and there is another one.

The next extract is the start of a table that lists every byte value (from 0 to 255) and the count of bytes with that value. The extract shown below is just the first 32 values. This covers all but one of the control characters:

+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+-------------------------------+-----------------------------------------------+
|     |     |     |                 |   ASCII   |   C    | teletype |                              |                               |                                               |
| Dec | Hex | Oct | Number of Bytes | Printable | Escape | notation | Name                         |   Specific to Windows-1252    |                  UTF-8 Group                  |
+-----+-----+-----+-----------------+-----------+--------+----------+------------------------------+-------------------------------+-----------------------------------------------+
| 000 |  00 | 000 |              28 |           |   \0   |    ^@    | null (NOT SQL or Java null!) |                               | Control character, ASCII compatible           |
| 001 |  01 | 001 |               0 |           |        |    ^A    | start of heading             |                               | Control character, ASCII compatible           |
| 002 |  02 | 002 |               0 |           |        |    ^B    | start of text                |                               | Control character, ASCII compatible           |
| 003 |  03 | 003 |               0 |           |        |    ^C    | end of text                  |                               | Control character, ASCII compatible           |
| 004 |  04 | 004 |               0 |           |        |    ^D    | end of transmission          |                               | Control character, ASCII compatible           |
| 005 |  05 | 005 |               0 |           |        |    ^E    | enquiry                      |                               | Control character, ASCII compatible           |
| 006 |  06 | 006 |               0 |           |        |    ^F    | acknowledgement              |                               | Control character, ASCII compatible           |
| 007 |  07 | 007 |               0 |           |   \a   |    ^G    | bell                         |                               | Control character, ASCII compatible           |
| 008 |  08 | 010 |               0 |           |   \b   |    ^H    | backspace                    |                               | Control character, ASCII compatible           |
| 009 |  09 | 011 |           1,428 |           |   \t   |    ^I    | horizontal tab               |                               | Control character, ASCII compatible           |
| 010 |  0a | 012 |       3,811,261 |           |   \n   |    ^J    | newline                      |                               | Control character, ASCII compatible           |
| 011 |  0b | 013 |               0 |           |   \v   |    ^K    | vertical tab                 |                               | Control character, ASCII compatible           |
| 012 |  0c | 014 |               0 |           |   \f   |    ^L    | form feed                    |                               | Control character, ASCII compatible           |
| 013 |  0d | 015 |       3,811,263 |           |   \r   |    ^M    | carriage return              |                               | Control character, ASCII compatible           |
| 014 |  0e | 016 |               0 |           |        |    ^N    | shift out                    |                               | Control character, ASCII compatible           |
| 015 |  0f | 017 |               0 |           |        |    ^O    | shift in                     |                               | Control character, ASCII compatible           |
| 016 |  10 | 020 |               0 |           |        |    ^P    | data link escape             |                               | Control character, ASCII compatible           |
| 017 |  11 | 021 |               0 |           |        |    ^Q    | device control 1             |                               | Control character, ASCII compatible           |
| 018 |  12 | 022 |               0 |           |        |    ^R    | device control 2             |                               | Control character, ASCII compatible           |
| 019 |  13 | 023 |               0 |           |        |    ^S    | device control 3             |                               | Control character, ASCII compatible           |
| 020 |  14 | 024 |               0 |           |        |    ^T    | device control 4             |                               | Control character, ASCII compatible           |
| 021 |  15 | 025 |               0 |           |        |    ^U    | negative acknowledgement     |                               | Control character, ASCII compatible           |
| 022 |  16 | 026 |               0 |           |        |    ^V    | synchronous idle             |                               | Control character, ASCII compatible           |
| 023 |  17 | 027 |               0 |           |        |    ^W    | end of transmission block    |                               | Control character, ASCII compatible           |
| 024 |  18 | 030 |               0 |           |        |    ^X    | cancel                       |                               | Control character, ASCII compatible           |
| 025 |  19 | 031 |               0 |           |        |    ^Y    | end of medium                |                               | Control character, ASCII compatible           |
| 026 |  1a | 032 |               0 |           |        |    ^Z    | substitute                   |                               | Control character, ASCII compatible           |
| 027 |  1b | 033 |               0 |           |   \e   |    ^[    | escape                       |                               | Control character, ASCII compatible           |
| 028 |  1c | 034 |               2 |           |        |    ^\    | file separator               |                               | Control character, ASCII compatible           |
| 029 |  1d | 035 |               0 |           |        |    ^]    | group separator              |                               | Control character, ASCII compatible           |
| 030 |  1e | 036 |               0 |           |        |    ^^    | record separator             |                               | Control character, ASCII compatible           |
| 031 |  1f | 037 |               0 |           |        |    ^_    | unit separator               |                               | Control character, ASCII compatible           |
  

We can see several interesting things from this table: