The Data Studio

Hive: Muddled Implementation Of Dates And Timestamps

Tested Using Hortonworks Data Platform (HDP) Sandbox, Release 2.5 with Hive 1.2.1 and Hive 2.1.0

In the early part of my career I spent a lot of time writing code to unpick date formats, to compare dates and to do basic date arithmetic. We all wasted a lot of time doing this and we made mistakes. The "millennium bug" was the most notorious, but there were many others. A common feature of the best modern software is that date-handling is provided in comprehensive, well-tested libraries so that we can all avoid the tedious and error-prone bit-twiddling that we did before. I think particularly of relational databases and Java as tools that saved us a huge amount of work in processing dates and timestamps. These all support:

Some databases do this better than others. Oracle defines a "date" as a timestamp with precision down to the second; an Oracle "timestamp" is a timestamp with precision down to one billionth of a second. But apart from that idiosyncracy, Oracle date and timestamp processing is good. Other databases have comprehensive and reliable date and timestamp processing and PostgreSQL, as usual, has a particularly well thought out and comprehensive implementation. Unfortunately, there are several issues with Hive dates and timestamps.

Hive returns dates and timestamps as strings.

The Hive documentation suggests that dates and timestamps will be treated as their proper data-types from Hive release 2.1.0 and that built-in functions will return the correct data-type rather than strings. I have yet to test thsi and to find out if it applies in JDBC too. What follows is true of Hive version 1.2.1, which is still the current version in Hortonworks HDP 2.5 (although Hive 2.1.0 is provided in this HDP release as a "Technical Preview").

This is particularly important when using JDBC because we have to get the date/timestamp as a string and then convert it to the type it should have been. This means we have to use:

      ResultSet.getString("<column name>")

rather than:

      ResultSet.getDate("<column name>")         // This does not work
      ResultSet.getTimestamp("<column name>")    // This does not work

Hive comparisons between a date and a timestamp are confusing.

I accept that a date is not the same data-type as a timestamp, so we should not compare them, maybe. The trouble is that users do compare such fields and they do not get the results they expect. Most databases make an implicit conversion when doing such comparisons, treating a date as a timestamp with zero hours, zero minutes and zero seconds. Most databases say that, for example:

        2017-03-24 = 2017-03-24 00:00:00

Hive does not do this, possibly because it treats both as strings. Using the same example, Hive says:

        2017-03-24 < 2017-03-24 00:00:00

The workaround is to explicitly convert dates to timestamps when doing comparisons, like this

        cast(account_start_date as timestamp) = paid_timestamp;

The result is that users must be trained to do this, and must be very disciplined in always casting dates to timestamps in the appropriate cases. In reality it means that users will make mistakes and will get misleading results with no warning about what has happened. If Hive does not do the implicit conversion it should raise an error. Silently returning misleading results is not acceptable.