The Data Studio

Monitoring Data Quality

A few years ago I was working for a mobile phone company that had a large data warehouse. They had teams of people working on ETL (Extract Transform and Load) processes to populate the Data Warehouse and other teams that carried out reporting and analysis on the Data Warehouse. There were a number of recurring data quality issues and always a few new ones every month. There were reports that monitored the data loads giving pages and pages of reconciliation reports. That was good but it was time-consuming to check through all the reports every day, and easy to miss something important.

The manager of the Data Quality team was one of the most enlightened people that I ever worked for. He described what he would like to see to manage the data quality reports that we had, and the kinds of extra checks he would like to add. We talked about this scheme and I implemented it. That was 10 years ago. His design has stood the test of time. We have used it in several large organisations and I have gradually developed the original code to add support for any database that we can access with JDBC. That means all serious databases (as well as a few silly ones).

Click on the image to see how the monitor presents the state of the system to a data quality analyst. (The Driving Metrics subject area gives the best examples.)

Data Quality Monitor

The way Trend Charts (in particular) work is very useful to data quality analysts but is not immediately obvious, so we recommend that you read on; all is explained below.

We use this Data Quality Monitor mostly with Data Warehouses. We have used it with Data Lakes, but since Data Lakes dispense with cleaning, the Monitor shows an overwhelming sea of red traffic lights. The purpose of the Data Quality Monitor is to highlight only those measures that are unusual today, so that the Data Quality analysts can focus on those. In a well-managed Data Warehouse, data quality issues will still happen because the warehouse is populated from many other systems and those will have changes or their own data quality issues from time to time. But data quality issues should be exceptions. If the organisation cares about data quality then it will fix issues as they arise and gradually improve the state of its systems, enabling its technical teams to get on with new work, and providing its users with high quality reliable data so that they can get on with their work.

The first page in the Data Quality Monitor is a summary by subject areas. In the example system we use here, we see the various departments of an imaginary motor insurance company. (The Driving Metrics subject area gives the best examples.)

The system runs all the data quality checking queries, usually overnight, and then presents the results in a way that makes it easy to find any anomalies. The presentation is designed to be useful; not to impress Dumas (who would probably prefer a "dashboard" that looks like a convenience store window).

The factors measured are presented in groups to allow quick access to the factors interesting to each group of users. A factor can be included in more than one group where that is convenient for the users.

Within each group, individual factors are shown with a period of history. The measures are presented as a chart. From the chart you can drill down to the actual measure values over the period of history. We have found that the last 6 months is a good period to consider for most data sets, but you can set it to any number of months you like.

Each factor and each group is rated using a traffic light system

Traffic-light colour Meaning for a factor Meaning for a Group
red The factor being measured is outside its acceptable range and the source of this data needs to be fixed. At least one factor in the group is red
amber The factor being measured is not in its normal range and the source of this data should be investigated. The worst factor in the group is amber
green The factor being measured is within its acceptable range so no action is required. Everything in the group is green
grey There is no data available to measure this factor. This may suggest missing data in the source or there may be something wrong with the measurement of this factor. Nothing in the group can be measured.

There are two types of charts: absolute values and trends.

An absolute values chart shows the value of some measure every day. Examples are Count of Data points from all insured cars and Total kilometres driven per day. In these charts, the chart title starts with "VALUES for ..." and there is a blue border around the chart.

In Trend charts, the chart title starts with "CHANGES for ..." and there is an orange border around the chart. Trend charts take a bit more explanation, but they are very useful for identifying unusual events. Such events that may suggest a failure in a particular part of the many systems we run.

We started with absolute values charts and then found a problem which we solved with trend charts. The problem is to look at a changing value and to automatically detect whether it is changing in an expected way or not. The Total kilometres driven per day chart illustrates this.

Total kilometres driven per day

We have set the amber limit to 4.5 million because, from the steady increase that we see in the chart, if we had a sudden jump above 4.5 million then that would be a surprise and we would want to investigate it. It might be that some data had been loaded twice. But this limit will not last for long. If we keep selling more policies then the total number of kilometres driven each day by our policy holders will increase. If we put the projection on our chart, it shows that we will break our limit in July:

Total kilometres driven per day

When we first came across examples like this, we did not want to be resetting limits all the time, so we considered making limits like the red lines shown in the next chart:

Total kilometres driven per day

This would be good for a measure which was expected to grow linearly at a predictable rate, but the real world is rarely as simple as this. In fact, in the very first batch we used this technique on, this is what happened:

Total kilometres driven per day

The company increased its sales! How inconsiderate! But this is what we want to happen, so we had to find a better way to detect an unexpected change. The answer we came up with was the Trend chart, and we have now used it successfully in several organisations. Here it is:

Total kilometres driven per day

and this is how it works:

Instead of plotting the actual number of kilometres driven each day, we plot the change from one day to the next. We are not concerned about whether the change is an increase or a decrease, so we show an increase of 100 as +100 and we show a decrease of 100 as +100, because we are only interested in the size of the change. (Stay with it, there is method in this madness.)

The limits are set dynamically, based on the values over the period covered by the chart, the last 6 months in this case. When this technique was used at a UK mobile 'phone company to measure the number of calls each day, Christmas Day always showed a red value because in the UK far more people make calls on Christmas Day than any other day of the year. Sometimes a spike can be explained by a normal business event, Christmas Day for the phone company, and sometimes it indicates a failure that we need to do something about.

We are assuming that the daily changes over the period covered form a Normal Distribution, or something reasonably close to it. We calculate the mean and the standard deviation over the changes during the period covered. A point within 2 standard deviations of the mean is green; a point between 2 and 3 standard deviations from the mean is amber and a point more than 3 standard deviations from the mean is red. As you can see in the picture below we do not expect to get many points in the red sections - you have to look closely to see them:

Normal Distribution

If our data does form a Normal Distribution then we can expect about 95.4% of values to be in the green area, 4.3% to be in the amber area and 0.3% to be in the red area. 0.3% is about one day a year for a daily measure. We have found that the variations in most real data do follow a pattern pretty close to a Normal Distribution, so the exceptional values do show up in the red zone. All that this tells us, is that something unusual happened and it is worth checking; it may or may not be an error.

This Trend chart is reporting on exactly the same data that we looked at in the absolute values chart for Total kilometres driven per day. By looking at the changes, rather than the absolute values we overcome the need to reset the limits because the chart sets them for us automatically. We also catch unsually low values as well as unusually high values. And, on top of that we reveal several interesting things about the cycles of change in the data. In this case we can see that there is a weekly cycle that is quite consistent but that does get disrupted from time-to-time. In this case an example happens in mid-February. We can find out more by clicking on the Data Table link on the chart.

These charts are very useful to Data Quality analysts, but they will drive Dumas mad because they are measuring changes and Dumas is "not a detail man" so he won't be bothered to read these few paragraphs and so the charts won't make sense to him. Just show him the issues you find with your suggested solutions to have any hope of being appreciated for the good work you do.

Dumas also asks me why the charts are in pastel colours - "they aren't red, yellow and green like real traffic lights" he says. The answer is that using primary colours on charts makes them hard to read and is just ugly. Dumas only reads the pictures so he doesn't see the problem. Also, he thinks primary colours are cool, so that's another thing we don't see eye-to-eye about.