Laying a Data Infrastructure for Inference and Training

Esteban Rubens

Healthcare is sitting on a treasure-trove of data. Whether it's in electronic health records, imaging systems, bespoke applications, or insurance claims, many organizations have years' or decades' worth of digital health data. This data contains insights that can improve patient care, reduce clinician burnout, and increase efficiency. Regrettably, only a small fraction of this knowledge is ever uncovered or used. 

As has been abundantly discussed and documented in the last few years, the best tool we currently have for unlocking actionable insights from healthcare data is machine learning. No matter which approach one uses (supervised learning, unsupervised learning, or reinforcement learning) the common requirement is ready access to large amounts of data. 

Especially in healthcare, data silos are everywhere and span on-premises equipment, private and public clouds. Data engineers and data scientists spend an inordinate amount of time wrangling data so they can assemble training datasets. This wrangling involves finding, moving, cleansing, and normalizing data from a variety of sources. Therefore, having a modern and flexible data infrastructure in place is a must for anyone interested in using data for machine learning. 

What does a modern and flexible data infrastructure look like to data scientists? 

  • Removes transactional friction from the data wrangling process 
  • Provides access to the right data, at the right time, with the right protocol, and the right performance characteristics in order to minimize the time data scientists need to devote to non-data-science tasks 
  • Allows them to enjoy a full view of all available data from the edge to the core and the cloud 
  • Uses a single set of management tools, no matter where the data resides 
  • Gives the ability to move any data seamlessly, regardless of where it originated or where it currently resides 
  • Abides by the FAIR principle, that suggests that data should be findable, accessible, interoperable, and reusable 
  • Does all the above in compliance with all applicable regulations such as HIPAA and GDPR 

How can you get there?

We at NetApp understand what data scientists need to be successful and productive, and the particular challenges that data scientists working in healthcare face. Our mission is to eliminate as much inefficiency and annoyance from the data-wrangling process as possible, keeping in mind that the goal is to help patients and caregivers achieve better outcomes. 

Bringing our Data Fabric together with our Data Science as a Service stack is the simplest way to deliver the modern and flexible data infrastructure that will unleash data science teams. Go beyond pairing scale-out flash storage with GPU compute and 100/200 Gbps Ethernet or InfiniBand and empower data scientists with data-management tools and frameworks they can use from within the environments they work in, such as Jupyter Notebooks, Kubeflow, Python/PyTorch and many others. 

Free your data, making it available for machine learning projects regardless of whether it resides in your datacenter or the cloud. Move it where it needs to be quickly and seamlessly while maximizing security and minimizing cost. Let data scientists do their jobs instead of spending countless hours chasing and moving data. Simplify collaboration and get faster time to results with NetApp's AI Control PlaneData Science Toolkit, and Machine Learning Version Control framework. 

To summarize, starting with the right hardware is necessary but not sufficient to give data science teams the best shot at success. Working with a partner that sees the big picture and can add the missing pieces is arguably more important. NetApp has the specialists to make this happen and has done it for many healthcare organizations globally. Let's talk.

 

Latest Posts

-->