The Data Studio

 

Hive: How To Load A File With The Correct Character Encoding

The Good News: Hive Support for Multiple Character Encodings

We have not been able to find much information in the Hive documentation, but the Hortonworks Community website came to the rescue. This is the relevant page:

https://community.hortonworks.com/articles/58548/processing-files-in-hive-using-native-non-utf8-cha.html . Be sure to look at the comments too, because these address some particular issues that we were having in a recent project.

The way to use this feature is to specify the encoding of the underlying file when you create a Hive external table over that file. Here is an example:

create external table jufgmrs_fixed_length 
    (line string) 
row format
     serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
     with serdeproperties("serialization.encoding"='WINDOWS-1252') 
location
     'hdfs://cluster/landing/jufgmrs_fixed_length'
     tblproperties('skip.header.line.count'='1');

(Mainframe files often have stupid names like "jufgmrs" - some people seem to think that data names should be like this.)

You can now select data from this table and see all the characters as they would have been seen in the source system. More usefully you can now use this table to populate a Hive table (in orc or some other format) either by using create table as select ... or by creating the target table and using insert into table select ... In either case, the data will be nicely converted into UTF-8 for you and the target table will be usable by itself or in conjunction with other Hive tables.

In this example we used the WINDOWS-1252 encoding because that is what the source system used for this file. There are many, many encodings supported; you can find out which ones by typing on an edge node:

iconv -l

You'll probably be surprised by the number of encodings that are supported, even though there are some synonyms, such as WINDOWS-1252, CP-1252 and MS-ANSI.