Healthcare Data Lake on AWS Cloud

About Client

The client is a global leader in advancing personalized oncology treatment and supporting cancer drug development research, headquartered in Greater Boston Area, USA

Background

The importance of data analytics cannot be understated for the healthcare and life sciences industry. Data analytics is used for improving procedures and increasing the quality of research, development, and different levels of management.

Healthcare businesses collect huge amounts of data today, and the biggest challenges they face while collecting the data is storage, data management and big data analytics. A lot of research has been done to ensure healthcare analytics is easy to manage and benefits the staff, doctors and patients alike.

Challenge

The client was handling large volumes of structured unstructured data in different formats coming from various clinical devices.

They were having trouble cataloging all this raw data, lab reports and making the data available for business applications as well as to scientists for future research

Our Solution

 

  • Business Scoping and Source Data Exploration was undertaken to define the solution and best possible support to meet the objectives of the enterprise.
  • The client required multi-geography data access for their work and hence, cloud-based data lake was found to be the ideal solution
  • A solution architecture was defined which included the data lake architecture as well as an analysis layer
  • A detailed scoring mechanism was applied to find the best cloud platform that fit the client’s requirement. Microsoft Azure, AWS, Hortonworks on AWS and Google were considered for the same.

Outcome

  • AWS S3 was used as the base for the data lake
  • ETL Framework was based on AWS Glue to discover data and store the associated metadata in the AWS Glue Data Catalog. Cataloged data is immediately searchable using Elastic Search and available for various business applications through various SQL queries.
  • Semantic Layer (Data Warehouse): Redshift was chosen for the data warehouse services. Data stored in data warehouse was easy to search and retrieve for business intelligence reports and analytics