The Data Studio

Big Data - Big Mistake

Protest Warning

Every organisation I have worked for in recent years has been getting very excited about "Big Data". Most have tried using the products that are sold under this banner.

I have been working on large databases for some decades now, and many of these databases certainly qualified as "big data". I am an enthusiast for large databases; they can be used for systems that have great commercial and/or social value. (They can also be used for bad things - but that's true of any technology.) So why am I saying the Big Data is a Big Mistake? Note the capital letters. Massive marketing programmes have tried, with quite a lot of success, to equate "Big Data" with Hadoop-based systems. The main players here have been Cloudera, Hortonworks and MapR, and they have succeeded both in making a lot money for themselves, and in giving CIOs and Enterprise Architects the idea that relational databases are not "big data". This is nonsense, but it shows the power of well-funded marketing. This view has been recognised by respectable database vendors such as Teradata, IBM and Oracle and they have built up Hadoop-based business when their own real relational databases are far more appropriate tools for most of the applications that have been implemented (or where implementation has been attempted) using Hadoop.

My beef is not so much with "the Hadoop ecosystem", although it is, for the most part, a dire collection of badly-engineered tools; rather my complaint is about the over-hyped marketing of these tools. I'll say it again: most applications for which Hadoop has been attempted in big corporates would have been achieved with more success and at lower cost if they had used proper relational databases.

Big Data Tools Are Open Source But They Are Not Free

The software may be free, but the costs of using it are very significant.

The hardware will not be cheap. I worked on a project where the hardware cost over £1 million (which was a bit more than $1 million and a bit more than €1 million then). For the same money that company could have bought a Netezza appliance (from IBM) and that would have been faster, much more reliable and much easier to use. That company already had a Microsoft SQL Server system that cost about £250,000 and was faster.

The support from Hortonworks had a very significant cost and was patchy. I have no reason to believe that it would have been much different from any of the other suppliers.

The real cost was in development. The Big Data system is inconsistent, idiosyncratic and complex. We spent vast amounts of time investigating bugs, trying to negotiate fixes, searching for workarounds. We also had a significant team working on administration. Despite their dedication and skill (both of which were outstanding in my experience) they had to battle with some very long-winded processes to make the whole thing work and to protect the security of the data.

Hortonworks shouts: "ZERO LOCK-IN. 100% OPEN SOURCE FOR MAXIMUM FLEXIBILITY." But Open Source does not necessarily mean flexibility. If you have implemented the Hortonworks Data Platform, with all its idiosyncracies, you will be hard-pressed to move your code to any other platform; it will be a huge job.

Big Data Tools Are Not Necessarily Faster

I would love to run some benchmarks between a large Hadoop cluster, Netezza, and some conventional databases. My experience of writing thousands of queries on several large systems tells me that other systems perform better.

Some work has been done on this. See Hive vs. PDW Benchmark Results.

For smaller systems, with tables up to, say, 100 million rows, any of the real relational databases would beat Hadoop/Hive every time. And, actually, most of the Big Data systems are not so big. The companies using them are really wasting their money.

Big Data Tools Easy To Use? You're 'avin a laugh! (as we say in London)

I had to help a bunch of analysts moving from Microsoft SQL Server to Hive. It was painful and embarrassing. Someone described the typical output from Hive as being, four pages of map-reduce trace, three pages of Java stack trace and a misleading error message. It was not easy to disagree with him when I had experienced the same frustration many times.

The system administrators experienced similar frustrations. One example was Ranger. This is one of the products in the "ecosystem" and it performs the function of the "grant" statements in a real relational database system. Managing the permissions was hugely time-consuming using the Ranger GUI. A script-based solution using grant statements would have been easier to implement and could have been managed through the source control system. When the administrators sought support they were told that most Ranger users have only about 30 security rules. That won't cover the needs of any "big" system.

Hive SQL Does Not Work

Have a look at Hive For SQL Developers.

Are You Google, Facebook or Twitter? No? So Why Are You Wasting Money On Big Data Tools?

The Big Data tools grew out of products that were designed to search through vast quantities of text. These companies don't use Big Data tools to manage their accounts. They use the Oracle E-Business Suite. That may not be cool, but it is rational. If they used the Big Data tools to manage their accounts, these companies would have failed because their data would have been damaged and their systems would have been slow and unreliable.

If you are wanting to analyse vast quantities of text, plus images, sound clips and video, then I am prepared to believe that the Big Data tools may do this well. I have not experienced this myself, but it seems feasible. By "vast" I mean petabytes. If your data is a few terabytes don't put yourself through the pain of using Big Data tools.

I do have experience of using the Big Data tools on a large database of structured data. It was not a pleasant experience. If your data has any significant proportion of fields that are numbers, dates or timestamps, don't use the Big Data tools; use a proper relational database.

If you are collecting data from the Internet of Things, that data probably consists almost entirely of numbers, dates and timestamps. We saw a system using Big Data tools for such a job. It was prohibitively difficult for them to analyse their data, so their testing was weak and when they lost great swathes of customers' data they were completely unable to recover it. At the same time we were using a Netezza system to hold all of their data and a similar amount from another supplier. We never lost any, and we were the team that discovered the holes in their data in the first place.

Most business data is numbers, dates and timestamps. If that is what you are dealing with, use a proper relational database.

Of course, many businesses want to cash in on the data they have about their customers and suppliers. They think that they will be able to make billions by spying on their customers and using that data to sell them more stuff. In many countries such activity is illegal. It is also mostly futile. You will have seen that yourself when you bought, perhaps a pair of shoes, and you have been pestered about buying more shoes ever since. I don't know about you, but when I have just bought a pair of shoes, I am very unlikely to buy another pair for a while. As a business you would generally do better to make sure that your core functions are efficient and that you meet your customers' needs as you promised to. As an individual, you should install an ad-blocker or two and reduce your irritation levels at a stroke.

Big Data Tools Do Not Replace Relational Databases

The leading Relational Databases are rock-solid reliable and fast. They can handle hundreds of millions of records with total reliability. They are easy to use in applications. There are many reporting and "visualisation" tools you can run on them. They have reliable and simple security features. They handle normal business data with ease.

The Big Data tools don't meet any of these criteria. You need to have a very special use-case to take on the cost, complexity and general flakiness of the Big Data tools.

Big Data Tools Are Not Operational Databases; Gartner Is Wrong.

I sent the following message to Gartner on 17-Feb-2017. I received an acknowledgement very quickly, but no response to the content of my message yet.

The Gartner Magic Quadrant for Operational Database Management Systems (Published: 05 October 2016 ID: G00293203) states that:

"OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability (ACID) model is not a requirement."

This suggests that we don't care about our transactions being atomic, consistent, isolated and durable. If the money you just paid into your bank account suddenly disappears, that's OK now is it? Of course it isn't.

I am very distressed by the promotion of the "noSQL" and "Big Data" databases and their adoption in completely inappropriate situations. I was recently involved in a significant project trying to make one of these things work. It was painful in the extreme. I have been working with databases for nearly 40 years and I have delivered many successful projects on several different database products. The Big Data product I recently worked on was not cheaper, not faster, was less reliable and less secure than any of the genuine relational databases. The touted cost savings are just not there. The software may be free, but cost of hardware is significant and the cost of support is astronomical if you measure it in terms of value for money. The really big cost though is in application development. The mess of tools that you get with these systems means that it takes developers much more time to deliver any working applications than it would with one of the established relational database systems. Add to that the number of bugs in basic functionality, and the development costs go through the roof.

Many businesses look to Gartner to give them advice they can trust. The suggestion that ACID compliance doesn't matter for "Operational Database Management Systems" is gravely mistaken and lends credibility to the many new so-called database management tools that should never have seen the light of day in the commercial world. In Gartner's position of trust you should be seen to call out that the Big Data emperor has no clothes.

The Big Data Tools Suffer From Very Poor Architecture

The Big Data Tools describe themselves as "an ecosystem". That is a realistic metaphor. A rotting log is an ecosystem. It is something that was once a tree, probably beautiful, possibly grand, but now is returning to the earth by a process of decay. Various things live in this log: fungi, moss, ferns, centipedes, wood lice, ants, spiders, snails, slugs, worms, birds, possibly mice, possibly a snake. Some of these are eating the log and some are eating one another. The living things interact with one another, competing with and destroying one another. The end result is dirt. I would not want to call my software system an ecosystem.

What do I mean by poor architecture? There are several elements to this.

Relational databases follow some principles. One important one is the separation of the physical storage from the functionality of the interface. The physical storage can take many forms. It can be indexed in different ways, it can be partitioned across the persistent storage (disks, solid-state drives, etc.), it can be a "row-store" or a "column-store", it could have a hash key, may or may not be compressed, and so on. Whatever method is used to store the data, the functionality of the SQL statements you can run to retrieve or manipulate the data are the same. One physical storage scheme will perform better than another in any given situation, but the choice of physical storage does not interfere with the logic of the application. This is important because it enables developers to concentrate on getting the functionality right and it encourages tuning when that becomes necessary.

In Hadoop there are many different storage schemes and the functionality changes from one to another. In one case the choice of storage, coupled with a choice of how to access that storage, silently changes the data-types in the table definition.

As well as having different storage options, the Big Data platforms have several different database engines with different characteristics and different interfaces. So I may have to transfer data from one engine to another just to be able to use the tools and applications I need.

Real relational databases include functionality to manage security. This is the grant system. It is very consistent between databases so the learning curve when moving from one product to another is shallow and short. This scheme is mature and comprehensive. Ranger cannot compete.

Real relational databases manage their metadata. Each one has a different way of representing the metadata internally, but they present it in a consistent way, that is accessible from the standard SQL interface. The Big Data tools use a real relational database to store their metadata (Microsoft SQL Server, MySQL, Oracle or PostgreSQL). In isolation, that would be a good choice, but having to access the metadata through a different interface is just unnecessary complexity.

A well-architected software system ensures that the data it manages is consistent and accurate. The "silent failure" is the enemy of this principle, since it allows errors to go unreported and therefore allows incorrect data to be stored or simply ignores any data that doesn't make sense, resulting in missing information. When data-loads fail silently, large numbers of transactions, customer details, alerts, etc. can be missed altogether. Hive seems to operate at two extremes: either you get to scroll through pages of Java stack trace, looking for the gem of a useful message (which can be found sometimes), or you get a success message and later often find that what you expected did not happen. There are several examples in the topics listed on Hive For SQL Developers.

Any software that is to be used for real work needs to have complete error-handling built in from the start. Hive clearly has been built with either the default Java stack trace or wilfully ignored exceptions. I have seen very little willingness to improve the error handling, but it would be a hopeless task now that so much software has been built without addressing this area. Trying to add such robustness after the code has been written is an almost impossible task.

Every little bit of inconsistency costs money, and they don't just add up, they multiply.

But Big Data is the Way of the Future, Surely?

For analysing huge quantities of gossip, maybe. But for real hard data, with numbers and dates and times, no. The problem is not so much with the products (although they are dire). Rather, the problem is with the commercial greed which sees the opportunity to sell something that is already free (the Big Data products are almost entirely Open Source) and make a fortune.

People like me are naively, the Big Data vendors will say, standing in the way of an enormous business opportunity. And people who sell Big Data tools are naively wrecking other businesses by promoting tools that increase costs and deliver misleading results. Just as accurate insights can increase a business's effectiveness, so inaccurate results can destroy businesses along with the prosperity they create. Be careful what you believe in.