In 2016, IBM attempted to estimate the cost of poor data quality to the United States economy. The number they landed on was $3.1 trillion.
3.1 trillion dollars. With a ‘t’.
There is absolutely no doubt that the implications of poor data quality are far-reaching, and extend beyond what you might consider the “data portion” of an organisation. Data quality issues do not exist in a silo – they cascade, like a waterfall. And every step along the way, somebody is paying for it.
What is data quality?
Data quality is a measurement of how suitable your data is to meet the decision-making requirements of your organisation. More specifically, the broad category of data quality can include metrics of completeness, timeliness, accuracy, consistency, and uniqueness.
- Completeness: Is your data complete? Is there any missing information?
- Timeliness: Is your data up-to-date? How often does it refresh from its source?
- Accuracy: Does your data align to reality? Is your data a genuine reflection of business processes and events? Is data being recorded correctly at the source?
- Consistency: Does your data align between your data warehouse and your data source? What about between different data warehouses?
- Uniqueness: Is there one, and only one, source of truth for every record? Is this guaranteed at the source system level, or is it being enforced through governance alone?
These five factors all contribute to the reliability of the data being used to make decisions, and thus the reliability of those decisions.
So why does this matter?
Let’s consider a single hypothetical example within a hospital:
- A healthcare analytics reporting team receives a request to produce a report on the uptake of telehealth for outpatient appointments, in order to measure the impact of better access to care for regional communities.
- A savvy data analyst undergoes a thorough investigation of the available data, and finds that the delivery mode (telehealth vs. in-person) is either encoded incorrectly, or missing entirely in a surprising number of cases. There are also multiple sources of truth as to each patient’s clinic.
- The analytics team decide to investigate together. They find that a number of different dashboards produced by various teams are all using different mapping spreadsheets to determine the clinic for a particular service event (eg. one appointment), using a standard list of appointment types.
- The analyst team raises the issue with their team manager, who notes that program managers have been using these other production dashboards for years now in order to make decisions. She decides that these reports will now have to be temporarily taken offline, in order to minimise the risk of any further decisions being made with potentially incorrect data.
- The data analytics team scrambles to align the organisation’s reporting practices. They find that other teams have, over time, independently come up with their own sources of truth and methods for assigning clinics and programs. The scope of this issue has expanded far beyond just one dashboard, and now permeates a number of reports across the whole organisation.
Recall that the above chain of events all started with a manageable portion of missing/incomplete data. This is the risk of an organisation becoming “data-driven”, without also putting in place the proper structures to enable being data-driven.
The truth of the matter is that data management is like a chain – you’re only as strong as your weakest link. If that link is broken repeatedly, your users and stakeholders are going to lose trust in the data that you have worked so hard to collate and analyse. The five factors of data quality (completeness, timeliness, accuracy, consistency, and uniqueness) determine the reliability of your data, which ultimately serves the end goal of the reliability of data-driven decision-making. Reliable decisions need reliable data, and stakeholders won’t want to stake their projects and reputations on shaky foundations.
We’ve explored the risks of poor data quality in an increasingly data-driven world. So what can be done about it?
Solving the data quality problem
At the risk of falling back on a slightly overused cliché, you have to walk before you can run. The first step of remediating poor data quality is actually understanding the scope of the problem, which means building the capability to monitor data quality.
Monitoring data quality starts from listing all of your assumptions about the data, and challenging whether those assumptions are true. From working in consulting, I can’t count how many times I have been assured by someone that it’s not possible for a field to have missing or duplicate values, only to investigate the database and very quickly find that assumption to be false.
It can be comforting to believe that errors can’t occur because of governance controls in your source system. However, until this assumption is tested through data quality monitoring practices, that is all it is – an assumption.
Here is a handy list of some starting principles to keep in mind when building out a system for measuring data quality. A few examples have been included in each case relating to the healthcare industry, to demonstrate how these principles can be applied at a more granular and practical level, but the principles themselves can be generally applied to any data model in any organisation;
- Do my primary keys have duplicate values in them?
- Do I have multiple patients with the same Patient ID?
- Do I have multiple wards with the same Ward ID?
- Do I know for a fact that no duplicates exist, or is this an assumption based on how I believe current systems and business processes work?
- Do my primary keys (or other important fields) have missing values in them?
- Does every patient in the system have a Patient ID?
- Does every patient have a name, sex, and date of birth?
- Do I know for a fact that the eMR (electronic medical record) will force a response in certain fields, or is this an assumption based on governance?
- What else should patients always have?
- If a field should only be populated with certain values, is this actually the case?
- The country of birth of a patient can only be one of a certain list of distinct values – is this the case in reality?
- Is an Australian patient’s residential postcode always a 4-digit number?
- Is the allowed list of clinics, programs, units, etc. defined by the system, or the user (through governance)?
- Is a patient’s Medicare number always a 10-digit number, with a 1-digit Individual Reference Number?
- Are there any fields where values are technically possible from a data storage perspective, but would be impossible or extremely unlikely in reality?
- Do any patients have a date of birth before the year 1900?
- Do any patients have a date of birth in the future?
- Do any patients have a name that is more than a certain number of characters long?
- Do the links between tables function correctly?
- Are there certain records that don’t join but should?
- If I link the patient table to the appointment table, are there any patients that we have the details for, but no appointments?
- Are there any appointments that don’t link back to a valid patient ID
As an exercise to the reader – after reading this article, I would encourage you to spend 5 minutes brainstorming some examples where these principles can be applied in your own organisation;
- What inherent assumptions about the data are being made by the people using it?
- How can I either ensure that these assumptions are met, or ensure that someone is aware if these assumptions are violated?
A final note on prioritisation and remediation
I’m sure at this stage, your brain is buzzing with ideas specific to data quality in your organisation – but keep in mind that none of the above matters without actually making data quality metrics visible, in front of people who matter, with a specific aim towards action and remediation. That call to action is important, because it spells the difference between a functional remediation solution which can reduce long-term organisational risks and costs, and a shiny toy with meaningless metrics which will inevitably be forgotten to the sands of time.
Organisations will typically go about solving these challenges in one of a few different ways:
- Make use of a Master Data Management tool, many of which come (at a cost) with automatic built-in data quality monitoring functionalities.
- Leverage an existing visualisation layer by building a bespoke data quality dashboard, designed with the goal of providing the user with specific records that fail data tests.
- Connect a purpose-built tool like Great Expectations, or integrate with a data transformation tool like Data Build Tool (dbt) (or combine the two with dbt expectations), to leverage a huge suite of pre-built standardised data tests.
As always, it is important to start with a small end-to-end solution to begin generating immediate value, before fleshing out the solution with more complex rules and functionality. Checking whether the primary key of a table has duplicate values is fairly straightforward, and it is a scalable test that can be generalised across multiple tables. As your tests become more complex, the effort and value of implementing those complex tests at the early stage must be compared against the effort and value of getting a solution in front of stakeholders early and seeing data remediation early.
We understand that managing data quality can be difficult to do at scale. If the advice given here isn’t enough to solve your data-related issues, rest assured that we can help. Aginic specialises in delivering high-quality data analytics solutions within an agile framework, which prioritises early results and getting a solution in front of stakeholders as quickly as possible. We don’t do sales pitches – we create solutions.