The Data Studio

Hive Count Gives Wrong Answer

Tested Using Hortonworks Data Platform (HDP), Release 2.4, Hive 1.2.1

This is possibly the most common SQL statement:

      select count(*) from <table>;

Amazingly, this will not always give you an accurate result in Hive, so that makes this basic statement completely useless. The reason is that count(*) gets the answer from the metadata, the table statistics. The statistics may not have been updated and the result can be completely wrong. Why anyone would think that a wrong answer quickly is better than the right answer however long it takes, is completely beyond my understanding.

If you use count(<column_name>) , this will give you a count of the non-null values in <column_name>. This may be more helpful but is not the same as count(*).

A universal (but ugly) workaround is to write:

      select count(*) as count_all, count(<column_name>) as dummy from <table>;

"count(<column_name>)" forces a table scan and "count(*)" then gives the right answer. Here's a real example:

      select count(*) as wrong_count from real_live_table;

      | real_live_table.wrong_count | 
      | 352716444                   |

      select count(*) as count_all, count(real_column) as count_dummy from real_live_table;

      | real_live_table.count_all | real_live_table.count_dummy | 
      | 35616181885               | 32577764395                 |

These are real results from a live system. Note the number of digits here. The correct result is 35 billion; the wrong result is 352 million. It is wrong by a factor of 100!

I have changed the table and column names because the company using this system may be embarrassed by their disastrous choice of software. The numbers are absolutely real.

When it's wrong it can be really wrong; two orders of magnitude in this case.

Update: Using Hortonworks Data Platform (HDP), Release 2.5, Hive

I no longer have access to the system that produced the results above, but I ran across a similar problem with the latest release on my own system. I was testing some aspects of character encoding and translation and I made a small table with 593 rows. So I checked that I had loaded it correctly, like this:

      select count(*) from unicode_latin_as_text;

      |  c0  |
      | 589  |

I have cut out the many lines of trace information, but what is left is genuine and unedited. It wasn't what I was expecting, so I tried the workaround shown above.

      select count(*) as count_all, count(code_point) as count_dummy from unicode_latin_as_text;

      | count_all  | count_dummy  |
      | 593        | 593          |

That's the same table with the queries run one after the other, and several times just to be sure.

Whether we're looking at 35 billion rows or 593 rows, a database that cannot give an accurate count of the rows in a table, is just not worth using. I do not intend to use Hive ever again.