Data lakes are centralized repositories that can store all structured and unstructured data at any desired scale. Data can be stored as-is, without first structuring it, and different types of analytics can be run on it, from dashboards and visualizations to big data processing, real-time analytics, and machine learning to improve decision making. The power of the data lake lies in the fact that it often is a cost-effective way to store data.
Deploying Data Lakes in the cloud
Data Lakes are an ideal workload to be deployed in the cloud, because the cloud provides performance, scalability, reliability, availability, a diverse set of analytic engines, and massive economies of scale. The main reasons why customers perceive the cloud as an advantage for Data Lakes are better security, faster deployment time, better availability, frequent feature and functionality updates, more elasticity, better geographic coverage, and costs linked to actual utilization.
Moving data lake to the cloud has a number of significant benefits including cost-effectiveness and agility. However, to see these benefits, it’s important to understand how to structure the data lake architecture in the cloud, which is slightly different as compared to a traditional on-premises architecture. Also, moving to a cloud-based data lake or multi-cloud environment cannot happen in one go. It is a journey that is covered over time.
Best practices to build a Data Lake
- For the data lake to be more effective, it is better to start with a business problem in mind, stay focused, and solve it and deliver results that can please the people on top.
- Have a game plan ready, for either hiring the people needed, or giving the existing people comprehensive training.
- Avoid the misperception of thinking of a data lake as just a way of doing a database more cheaply.
- While doing Hadoop in the cloud, design around object storage. Object storage in the cloud adds to the complexity but is more flexible, cost effective and gives better performance.
- It is not just about data storage but also about data management too. Data should be actively and securely managed.
- Make sure to think carefully about what the team will build or buy because very rarely a perfect vendor solution that will meet all the needs can be found. Everything bought will have a cost. But everything built has a time cost and an efficiency cost associated with it.
- Load data into staging, perform data quality checks, clean and enrich it, steward it, and run reports on it completing the full management cycle. Numbers are only good if the data quality is good. To get an in-depth knowledge of the practices mentioned above please refer to the blog on Oracle’s webpage.
Data Lake is turning the tables in Healthcare
Data Lakes bring value to healthcare because it stores all the data in a central repository and only maps it as the need arises. When the data is stored in the data lake, it is impossible to know how to structure the data since all the use cases for that data are not known. Using the data lake approach of bringing data in and then adding structure as use cases arise is best suited for healthcare to avoid multi-year projects that usually fail.
Data in healthcare industry can be broadly classified into two sources: clinical data and claims data. Claims data comes from the payers, containing extremely uniform and structured data about patients receiving care, their demographics and the care setting they are in. Since the data is complete and meant for reimbursements, it contains all valuable information one can need. But, since the data is first abstracted and then summarized to bring only the data meaningful for provider reimbursement on the forefront, it does not list all information and is very general in nature. The second source of data in healthcare is clinical data. Patients’ important and critical information about diagnoses, claims and medical history stored in EHRs and used to analyze the patients’ health in every time frame, all at once.
The data is first in an abstract form and has to be summarized and analyzed to derive meaningful insights. Data is pulled into the data lake, where each data element is assigned a unique identifier with a set of metadata tags. Frequently, data lake is structured on a Hadoop Distributed File System which can accommodate data from disparate sources: structured or unstructured and is a cost-effective repository. This data is then subjected to extract, load and transform (ETL) methods for collection and integration of data, which can later be processed by Spark, a simple, analytical framework.
Possibilities with Data Lakes
Data lake offers endless possibilities that can be put to use in healthcare: from keeping in line with transition to value-based care and providing transparency, to growing in a charged manner and delivering a holistic view of care services. A few of the possibilities are:
- Empowers a network between PCPs, patients and specialists by integrating and sharing data, combined with analysis of patient’s clinical and claims data to provide patients with the right care at the right time.
- The raw data stored in data lakes is never lost, it is stored in its original format for further analytics and processing. Since data governorship comes into effect on the way out, the user does not need a prior knowledge of how data has been ingested. This enhances efficiency, along with high concurrency and improved query processing.
It stores large amounts of siloed data with flexibility to grow and shrink when required. Implementing it on Hadoop, an open source platform makes it cost effective and efficient performing.
The Road Ahead
Data lake has a massive environment that accommodates data in bulk in its raw form and when equipped with strategic and analytic tools, this data is machine readable and easy to use for providers and payers. Using scalable data lake as a repository allows large chunks of data to be stored and processed in an aggregate form, facilitating analysis and drawing insights.
The amount of unstructured data in the healthcare industry is immense, and with this data growing at a rate of 48% per year, we need to make healthcare a data-driven industry with increased scalability, performance and analytic capability. We have only scratched the surface of application of data lakes and in future, when medical imaging would be an essential way of diagnosis, the hitch of having unstructured data would be easily manageable with comprehensive use of data lakes. In the future of healthcare, data lake is a prominent component, growing across the enterprise.
Author: Rakesh Rajalwal Chief Architect – BizAcuity