When people talk about oceans of data, it is safe to assume they are more often than not mentioning a typical healthcare organization. It will make more sense when I deep dive into the kind of data a healthcare organization generates and operates daily.
In Healthcare IT, I started working as an Automation QA expert for a team working on a solution for a Hospital looking to build a Patient Administration system (PAS). While the actual flow of the system seemed relatively straightforward, it was quite a challenge to make sure the data architecture was robust & secure to handle the end-to-end flow.
The demands of a system used in a Healthcare Organization
Patient Administration System is a system designed to manage & streamline patient care & data to improve patient safety and efficiency. It typically comprises of -
- Patient Registration
- Appointment Scheduling
- Medical Records
- Billing & Insurance
In a Patient Care facility, each of these modules is handled and managed by different applications and may be entirely different backend systems. In a traditional sense, they might store data that does not correlate with each other even though the information stored is relative.
The question here is to validate not only the data exchange but also the accuracy and quality of the data used between the systems.
As a member of the team working on testing the system, knowing how to test with relevant data is very important for the accuracy of the system, especially healthcare systems.
Data Exchange within the healthcare IT ecosystem needs to be studied and managed delicately.
The Open Standards of Data Exchange in Healthcare IT
In the software development world, two systems talk to each other with the help of APIs. The respective databases store all the data and two systems in need of information will communicate with each other based on pre-agreed contracts/policies.
The exchange of data in healthcare IT works similarly. The systems interacting with each other in the same facility or different facilities agree upon a pre-defined format to exchange data.
The onus is on the respective systems to figure out a way to read the data and ingest it into their respective databases.
While working on building an accurate and robust PAS system, the application needs to be able to read, parse, and ingest large amounts of data from multiple systems and readily display it as needed in near real-time.
We were on a quest not only for a better user experience but also to create a continuous flow of accurate data to test the system thoroughly. It meant working with data across multiple systems.
The phenomenon of exchanging data between systems is called Interoperability in Healthcare. The organizations make use of Health Information Exchanges that adhere to strict interoperability standards for data sharing.
One such standard is ANSI X12 used for Electronic Data Interchange(EDI). It is based on an open-source framework called Fast Healthcare Interoperability Resources (FHIR).
Here’s how a sample X12 file looks
CLM*26463774*100***11:B:1*Y*A*Y*I~ REF*D9*17312345600006351~ HI*ABK:0340*ABF:V7389~ LX*1~ SV1*HC:99213*40*UN*1***1~ DTP*472*D8*20061003~Without going into specifics of the message, it represents a claim submitted to a provider service for a particular procedure. There are other detailed examples on the below link -
https://datainsight.health/edi/claims/professional-837p/anesthesia/
The complexity here is not with the format but with parsing this transaction and each segment while validating the accuracy against the respective source systems.
Managing the Critical Issue of Quality of Data & Systems in Healthcare
The quality of a system under development can be maintained by efficient automated tests executed in a CI/CD pipeline to continuously check for the system’s health.
How do we manage the quality of data coming from multiple systems?
The problem of data silos in a healthcare organization is as old as the profession itself. The data should not only be captured accurately by the source systems but the validity & integrity should be maintained across multiple systems that store and reference the data. In a general sense, this is termed as a Data Integration problem.
In our early days, we used Pentaho Data Integration for seamless data extraction, transformation, and loading to the destination systems for analysis & reporting. It provided a simple drag-and-drop codeless interface to build (ETL) workflows and allowed the use of complex PL/SQL for data cleaning purposes. Integrated with a cron job, it was easier to create a workflow that could be scheduled to be executed periodically.
Over some time, the oceans of data that are being created, the term ETL quickly transitioned to ELT where the data was Extracted, Loaded, and Transformed based on each use case. The oceans of data were extracted and transformed into data lakes to make it easier to transact.
As our needs evolved, we decided to move away from the Pentaho-based solutions and began to explore more popular cloud-based solutions like -
To be cost-effective, we had two other options in mind as well -
To be honest, DBT is a brilliant solution to manage all the data pipelines programmatically.
No matter the tool, as soon as the data pipeline kicks off and data becomes readily available with the required accuracy from source to destination, the quality of the system improves drastically and can be guaranteed with every release.
Making my Own Data Pie & Eating it too
The satisfaction in resolving the data quality conundrum is not in the result, but in the insight that suddenly becomes so obvious that it is hard to ignore.
Imagine working with healthcare data related to the recent Coronavirus pandemic. As tragic as the event was for humankind, it handed the Data Engineering world mountains of data and unique problems to deal with for the Healthcare IT systems.
Data-related problems like -
- Standardizing the code for COVID-related cases so that all the healthcare organizations in the world can recognize COVID cases and distinguish them from other forms of illnesses.
- Application-related problems that pressured the entire healthcare IT to figure out application workflows & identifiers to quickly deal with the rapid patient admissions related to COVID-19. Not to mention the performance issues.
The rate at which the data was evolving and the pace at which the insights were needed for the world to make sense of the mass hysteria around COVID-19 proved to be the biggest challenge.
An efficient solution in the data engineering world for such problems will be the ability for different systems in the world to update the existing data pipelines with the standardized code once in the source system and the rest of the systems in the pipeline pick up the update as the data moves downstream.
These kinds of solutions allow healthcare organizations to extract quick insights from their version of data and provide the world with a better perspective. A quick example of this is the COVID death statistics for every country and the impact it has had on the population with pre-existing conditions. It allowed the governments to take appropriate measures and provide better quality information about patient care.
Every Organization needs such elegant data engineering solutions to tackle the problem of data silos, fragmented & inaccurate data, duplicated effort in repetitive patchwork to fix similar issues, and the biggest of all data unavailability.
Conclusion
Any organization that deals with large amounts of data needs to tackle the problems related to data to maintain its pace in the market.
Data Engineering teams set out to tackle such problems and preemptively resolve any issues arising from new data that is being created at an astounding rate.
Healthcare organizations rely on the high accuracy and high availability of their systems to provide critical care because it is literally a matter of life & death. The challenges and the nature of problems grow as the data grows in the real world. These new problems require innovative and faster solutions without compromising quality.
In a world that is consumed with AI and all the jobs that it can replace, one key takeaway here is that any form of Artificial Intelligence is only as smart and accurate as the data it is trained on, skewed data will yield skewed results.