How to rescue healthcare’s data quality from Big Data’s mosh pit

Shelly Lucas

Shelly Lucas
Creative Director, Redox

In conversations about healthcare data, we’ve heard it so many times before: More, more, more. If only we had more complete medical records, more datasets to train AI, more data to feed our predictive analytics…we’d achieve better outcomes.

Clutching to this “more is better” mindset, providers are diligently stockpiling information, convinced the sheer volume will someday translate to value-based gold.

But here’s the blatant truth: If the data they’re collecting is faulty, inaccessible, or otherwise unusable, it doesn’t matter how much of it providers have, it’s worthless.

Healthcare executives may dream of having steady streams of consistent, accurate information flowing throughout their organization. But the view from the trenches is quite different. Disparate systems, inconsistent formats, and questionable accuracy (and let’s not forget healthcare’s data explosion). These are the makings of a Big Data mosh pit.

The culprits behind this data quality breakdown are a familiar bunch:

  • People – inconsistent behavior when creating, using, and/or sharing data
  • Processes – outdated methods for data collection, storage, management, and governance
  • Technologies – a cacophony of unconnected systems with clashing data standards and table/field definitions

The real kicker? These elements aren’t isolated problems. They’re interconnected in a tightly woven web. And when they intersect in specific ways, they can seriously undermine data quality.

As ominous as this sounds, there is hope. Healthcare data quality doesn’t have to be a chaotic mess. With a better understanding of the interdependent elements that shape data quality, you can identify the weak points where data integrity crumbles. These areas, in turn, can become the foundation for smarter data governance policies, targeted data literacy training, and strategic technology upgrades. 

This article is dedicated to illuminating the dark corners of healthcare data quality. We’ll dissect the different facets of this complex beast, assess its current state, and explore the ways—even with the best intentions—data quality can go surprisingly wrong. Finally, we’ll identify a few steps you can incorporate in your pursuit of better data.

The costs of poor data quality in healthcare

Poor data quality isn’t just a headache for data analysts. It translates into real financial and clinical consequences for providers and patients alike. Let’s take a closer look at bad data’s impact on healthcare’s bottom line.

The U.S. Attorney Office estimates bad data costs the U.S. healthcare industry more than $300B per year. According to various studies, it can absorb anywhere from 15 to 20% of a business’s revenue.

Data quality woes are silently draining provider wallets. Here are a few ways bad data derails healthcare finances:

  • Administrative overhead– software and labor costs for data cleansing and correction plus increased operational costs for staff inefficiencies
  • Clinical costs – treatment delays, repeated procedures, diagnostic errors, and poor patient outcomes
  • Legal and compliance fees – regulatory fines and legal costs from noncompliance and medical errors based on poor data quality
  • Opportunity costs – lost revenue from incorrect billing, insurance claim denials, and long-term reputational damage

This is just a snapshot of the damage poor data can do to a provider’s bankroll.

Providers are relying on digital technologies—e.g., data analytics, AI/machine learning, and automation—to help them recover from skyrocketing labor costs, rising inflation, and payer underpayments.1 However, if these digital solutions are using or creating bad data, they may do more than fail to relieve providers’ financial pressure; they may actually increase it.

While the financial impact of poor data quality is undeniable, the good news is there are ways to improve it. But before you can boost their data quality, they need to unpack the core dimensions of their data. This way, they can ensure they’re not chasing shiny metrics that may mask underlying quality issues.

The dimensions of healthcare data quality

  • Accuracy – Is the information correct?
  • Accessibility – Is the data available when needed?
  • Completeness – How comprehensive is the data?
  • Consistency – Are all instances of the data uniform and accurate across various systems?
  • Timeliness – How up-to-date is the information?
  • Relevance – How useful is the data for its intended purpose?
  • Integrity – How accurate are the data relationships (e.g., parent and child linkage)?
  • Unbiasedness – Does the data accurately represent the patient population?

Healthcare data quality dimensions

Healthcare data quality dimensions

Data quality through the lens of provider organizations

Hospital and health system executives express confidence in the initial quality of their data. According to HIMSS research, they believe 69% of their data is accurate and reliable.4 But as data moves through various systems, its quality declines. As indicated by the drop in the amount of data hospitals believe is complete and timely, quality seems to erode downstream.

Graph showing estimate of percent of organizations data is...

HIMSS also finds that the larger the organization, the less complete their data is. For providers with more than 7,500 employees, the amount of complete data drops to only 21%.5 Getting timely health data is also a struggle, with only 28% of providers saying their integrated clinical data is refreshed in real time.6

Providers also admit that much of their data is not fully used. In fact, they say 47% of healthcare data is underutilized for clinical and business decision-making.7 This may be due in part to unstructured data, which makes up about 80% of clinical data.8 Because it is difficult to process and standardize, unstructured data remains largely untapped by providers. Another reason for underused clinical data: The information is siloed, difficult to access, or just too difficult to find.

Health data that’s buried in the EHR may not only go unused, but its invisibility may even cause users to actually create bad data by entering incorrect—but easily findable—information. During a webinar discussion, a clinician describes this very scenario, drawing on her own experience.

One of my least favorite things is to find the right ICD-10 [diagnostic] code. [The EHR] isn’t helping me by suggesting the code. Instead, I’m putting things into a search bar and the first hypertension options that come up are pregnancy-related, [but] I don’t see pregnant patients. There are only so many seconds I can look and then I’m going to pick the closest answer and go on.9

Although clinicians desperately need good data, the constant scramble for the quickest answer can inadvertently perpetuate the problem of poor data quality.

There’s another factor at play here. Clinicians are not incentivized to improve data quality. But under the fee-for-service model, they are paid for the number of patients they see or the number of services they perform. Based on this incentive structure, it’s unrealistic for them to spend too much time correcting or structuring data.  Unfortunately, this leads to a perception among many providers that data quality is a cost center.10

Why data quality progress is slow

Unless dirty data creates a safety issue or increases mandatory reporting costs, providers are less likely to invest in data quality over areas that bring in income.

Interoperability as data quality’s champion

While improving data quality is a significant hurdle for healthcare providers, interoperability offers a potential lifeline.

Interoperability involves the integration of both systems and data. While system interoperability provides the framework and infrastructure for diverse systems to communicate, data interoperability establishes data standards and formats to make the shared data meaningful and useful. Working together, these two flavors of interoperability standardize data so it can be exchanged and used across different healthcare settings. 

Interoperability paves the way to better data quality in several ways.

  • Data completeness: Interoperability equips providers to access a more comprehensive view of a patient’s health information, including data from different sources and care settings.
  • Data consistency: By facilitating data mapping, transformation (using HL7® and HL7 FHIR®), and exchange, interoperability can help reconcile data discrepancies between different systems or sources.
  • Data timeliness: When backed by the appropriate processing power and managed scaling, interoperability can facilitate real-time data exchange.
  • Data integrity: Interoperability frameworks often include error-checking and validation mechanisms to help ensure data accuracy and reliability.

For all its benefits, achieving healthcare interoperability at scale is a Herculean task for in-house IT teams, especially when 55% of them lack the necessary resources and 33% lack the expertise.13 As a healthcare Chief Informatics Officer once explained to me, “Healthcare interoperability is like trying to herd cats with telekinesis—possible, but not exactly user-friendly.”

Yet, as providers move into healthcare’s interconnected, digital future, interoperability is a must-have. Multiple trends are pressing providers to master interoperability.

  • A regulatory push: The 21st Century Cures Act mandates that providers and vendors improve data-sharing capabilities, and ONC Interoperability Rules require providers to use certified EHR technology that supports interoperability standards. (Many providers switched EHR vendors in 2023 because their existing EHR could not exchange information with other EHRs.14 )
  • Market demand: Patients expect seamless, anywhere access to their health information across different providers and platforms, and value-based care models require data sharing and analysis across the care continuum.
  • Technological advancements: The adoption of APIs and the HL7® FHIR® standard is simplifying data exchange between diverse sources, and cloud-based solutions are making secure data-sharing easier.

With these trends pushing interoperability to the forefront, providers are facing an even bigger data explosion. Traditional storage and processing solutions are starting to look like filing cabinets in a hurricane. The cloud, on the other hand, offers a more resilient option.

The cloud: Bigger data may not be better

The cloud is probably providers’ best option for storing and processing clinical data, real-time and otherwise. But the cloud’s larger capacity—and the shift to an extract, load, and transform (ELT) data integration process—may not be doing data quality any favors, say some engineers.

Storage has become almost free, but designing schemas and cleaning up data hasn’t.

Fortunately, healthcare is recognizing a powerful ally in this quest: AI.

AI to the rescue?

Healthcare is cautiously bullish on AI’s ability to enhance data quality. Currently, 43% of providers use GenAI for data cleaning.19 Providers that are using AI farther upstream are seeing positive results: According to 45%, integrating AI with their clinical workflows has improved data quality and accessibility.20

Clinicians, however, are quick to point out that data quality-enhancing technology still has a long way to go before it earns their trust. As a clinician emphasized during a webinar on healthcare data quality:

We can do so much better with technology, but for some reason, we’re not applying it in a way that makes sense…. How can I trust in a piece of data? There’s no information on the provenance of it. Why doesn’t the computer validate whether or not a piece of information is accurate or current? Why doesn’t the system organize the information in a group of related things so I can look at and evaluate [them]?21

To be sure, legacy systems throw a ferocious curveball at AI. They store health data in wildly different formats, with no standard labels for tables and fields, even for data coming from the same EHR. To make matters worse, there’s practically zero documentation to explain what anything means.22 The result? Ambiguous data that AI struggles to interpret and transform correctly.

Legacy data is a wild pitch that can throw interoperability and reusability completely off course. Most likely, providers won’t be able to tackle legacy data on their own. When we look at providers’ current funding for overall data modernization, the numbers are alarming. Only 1% of providers have the internal resources needed; the vast majority (79%) say they will need a “moderate” or “large” amount of external help.23

Powering data quality with people

Securing quality with data governance

Getting staff to become more data savvy won’t always ensure they do the right thing with data. This is where data governance comes in. Data governance programs contain principles, practices, and tools to ensure providers get the most out of their data. These internal standards specify the actions people must take, the processes they must follow, and the technology that supports them throughout the data life cycle.27

Ultimately, an effective data governance strategy assures data is reliable and consistent, empowering providers to make data-driven decisions with confidence. It also streamlines data access while fortifying your security posture and establishing regulatory compliance.28

“Good [AI] models don’t overcome bad data and fragmented governance.”  – Mike Sanky, Global Industry Lead, Healthcare & Life Sciences at Databricks29 

Providers’ use of AI data has added complexity to traditional data governance strategies30 on multiple fronts.

  • Data quantity: To train effectively, AI requires large volumes of data—all of which must be stored, processed, and governed.
  • Data diversity: AI models may use data from a variety of sources and formats (e.g., wearables, genomics, images, and other unstructured data sources)
  • Bias: Healthcare data bias can be amplified by AI, leading to discriminatory results.
  • Explainability: Some AI models, particularly those based on deep learning, can be difficult to understand and therefore to govern.
  • Evolving regulation: New guidelines are constantly being rolled out to address AI-specific data use.

AI models are only as good as the data that feeds them. If providers are to realize even a fraction of AI’s promise, they will need to invest heavily in data quality and develop clear AI governance policies.


Healthcare data holds immense power. But to fully tap that power, you don’t just need more data. You need better data.

Achieving better data quality is a delicate dance. It requires navigating an organic web of people, processes, and technologies and pinpointing the junctions where data quality is most likely to break down. We’ve already explored a few examples:

  • An EHR with interoperable capabilities may have a clunky UX that doesn’t mesh with clinical workflows, causing even data-quality-conscientious clinicians to pick incorrect diagnosis codes because they can’t find the correct ones quickly enough.
  • AI technologies can be used to cleanse and standardize data, but legacy healthcare data doesn’t have enough context for AI to learn from.
  • Data governance programs can define policies and practices to promote data quality, but staff may fail to follow these due to low data literacy.
  • You can hire data engineers, but the sheer volume and complexity of healthcare data—along with the stubborn proliferation of data silos—can overwhelm their efforts.

To succeed, you’ll need more than data expertise, data quality tools, and a good governance policy. You’ll need a multi-pronged, flexible approach based on a solid understanding of the intricate interplay between people, processes, and technologies that impact healthcare data quality.

This is a tricky tango, but it’s far better than getting stuck in a Big Data mosh pit.

While all provider organizations are different, you can move closer to better data quality by following four essential guidelines:

  • Empower your people – Invest in data literacy and data engineering expertise. Equip and encourage your staff to become data champions, minimizing errors and maximizing data use.
  • Establish strong processes – Implement data governance frameworks that prioritize consistency, security, and compliance across the entire data lifecycle.
  • Embrace the right technology – Leverage modern data analytics platforms and explore AI tools, but remember the old adage “garbage in, garbage out.” These solutions fall flat without high-quality data to power them.
  • Demolish barriers to data excellence: Avoid integration shortcuts (e.g., insufficient testing and flimsy process documentation). Use consistent data standards, close data governance gaps, and ensure robust security to prevent data tampering.

These are just a few areas for further exploration as you chart your own data quality progress. Hopefully, along the way, you’ll be able to leave the Big Data mosh pit behind and push forward on a new path—one that leads to better data, used well.


1. “Hospitals Face Financial Pressures as Costs of Caring Continue to Surge.” American Hospital Association, 10 May 2024.

2. “Data Quality Assessment.” The Office of the National Coordinator for Health Information Technology, Accessed 17 May 2024.

3. Larsen, Taylor. “How to Run Analytics for More Actionable, Timely Insights: A Healthcare Data Quality Framework.” Health Catalyst, 5 Nov. 2020.

4. “The Current State of Healthcare Analytics Platforms.” 2024 HIMSS Market Insights Survey sponsored by Arcadia Solutions, LLC.

5. Ibid.

6. Ibid.

7. Ibid.

8. Eastwood, Brian. “How to Navigate Structured and Unstructured Data as a Healthcare Organization.” HealthTech, 8 May 2023.

9. “How to Improve Poor Data Quality Across the Healthcare Ecosystem and Make Workflows More Manageable.” IMO, 12 Sept. 2023. Webinar.

10. Meraj, Sam. Comment on “I came across an area where there’s whitespace for AI in healthcare.” Bobby Guerlich, 23 April 2023.

11. “Healthcare State of Data Report 2024.” Hakkoda, 16 Feb. 2024.

12. Ibid.

13. Ojo, Elizabeth. “DIY or Outsource: A Cost Comparison for Providers Looking to Scale Digital Health Integrations.” Redox, June 2024.

14. “Medscape Physicians and Electronic Health Records Report 2023.” Medscape, 29 Nov. 2023.

15. @NicholasDorier. Data integrity and data quality has gotten way worse over the past 10 years. 22 Mar, 2024.

16. “The Current State of Healthcare Analytics Platforms,” p. 25.

17. “The New Healthcare C-Suite Agenda: 2024-2025.” Sage Growth Partners, 23 Jan. 2024.

18. Kennedy, Shania. “Healthcare Orgs Value Data Analytics for Improved Care Quality.” Health IT Analytics, 07 May 2024.

19. “Healthcare State of Data Report 2024,” Section 04.

20. “The New Healthcare C-Suite Agenda: 2024-2025,” p. 13.

21. How to Improve Poor Data Quality Across the Healthcare Ecosystem and Make Workflows More Manageable. IMO, 12 Sept. 2023. Webinar.

22. Currie, Michelle.  Comment on “I came across an area where there’s whitespace for AI in healthcare.” Bobby Guerlich, 23 April 2023.

23. “Healthcare State of Data Report 2024,” Section 06.

24. Ibid, Section 04.

25. “The Current State of Healthcare Analytics Platforms,“ p. 4.

26. “What Data Engineer Roles are Most in Demand and Where are They Located?” FinTech Technology, 12 Nov. 2023.

27. “What is Data Governance?” Google Cloud.,throughout%20the%20data%20life%20cycle. Accessed 24 May 2024.

28. “Data Governance.” Databricks. Accessed 24 May 2024.

29. Wolf, Nichole. “AI Governance: A Provider’s Guide.” Redox, 2024.

30. Ibid.