Don't you love data?

Data Science Blog

AWS reInvent 2020 Week2 Overview

December 11th, 2020

reInvent KeyNote screenshot

Welcome to my week 2 of AWS reInvent recap. A lot of exciting things happened this week including the first-ever Machine Learning Keynote. I posted a lot on LinkedIn and Twitter this week, so I'll do a quick recap with my top 5 list.

5. Data Wrangler

I think this was a logical product for AWS to build. They figured out model training and deployment first. It only made sense to develop a tool for data preparation.

AWS claims that SageMaker Data Wrangler is the fastest way to prepare data for ML. Data Wrangler can perform hundreds of data transformations, explore, diagnose and fix potential data issues.

Based on my limited research, it doesn't look like it can help you with data extraction, aggregation, joins, and filtering, so you still need to apply your domain and data knowledge to get your data semi-prepped and hand it off to Data Wrangler for further diagnostics/feature engineering.

4. Aurora, Redshift and Neptune ML

AWS integrated some ML capabilities into their databases. Aurora was announced earlier this year followed by Redshift and Neptune (their graph database). This feature eliminates any coding typically done by data scientists to train models and instead use a SQL query.

I'm curious about graph databases in general and how you can use graph data for ML since a lot of models expect tabular data, so I think Neptune ML is the most exciting announcement in this category. I can't say that I'll be jumping into Aurora or Redshift ML any time soon, but it does look like AWS is trying to make ML building more accessible to people with limited to no data science skills.

3. NFL + AWS =

There were several examples of companies (ranging from food delivery to medical research) mentioned during the ML Keynote and how they use AWS to drive business impact. The NFL was one of their examples.

NFL uses AWS SageMaker to analyze on-field injuries using hundreds of variables including speed, direction, contact, play type and nature of impact in machine learning and deep learning models. They aim to improve protective equipment with the goal of reducing concussions and other types of injuries (foot, ankle, knee, etc.). My knowledge of American football is very limited, but it's great to see that the NFL utilizes analytics to make this game safer.

2. SageMaker Clarify

There are so many great things about this tool! First, it detects bias in your data and training models. Clarify is a great step toward more ethical AI introduced into a ML platform used by millions of users.

In addition to bias detection, it also now supports Machine Learning Interpretability features. "Accuracy vs Explainability trade-off" is one reason that drives model selection especially in highly regulated areas, so I think it's helpful that AWS finally added it to SageMaker.

Quite frankly, I'm not sure why it hasn't been added sooner. H2O has had explainability in their Driverless AI for a couple of years and has shared tons of open source resources on MLI. I guess it's better late than never.

Clarify seems to support Shapley values which is a hot topic in the data science community and a really explainable way to quantify individual variables' impact on predictions. I couldn't find other local or global MLI methodologies, but I suspect Shapley Values is a great start!

1. Using AI to automate clinical workflows

Wow! This was impressive! The amount of progress AWS has made in OCR, voice recognition and NLP is incredible.

The first part of this talk covered Comprehend Medical and Transcribe Medical that was trained specifically to understand medical vocabulary. It's a nice addition to their core Comprehend and Transcribe products, and I think it's a matter of time before AWS expands these products to other industries.

The second part of the talk focused on "Document Understanding Solution" that extracts text from documents, identifies various named entities, and makes these data easily searchable by utilizing Elastic Search & Kendra. An added bonus is that you can redact any sensitive data from search. This solution uses AWS Textract (their OCR solution) and Comprehend Medical. Here is a detailed explanation of the solution.

What did you like the most about week 2?