Guest Column | March 2, 2021

AWS ML Services To Grow Your Business

By Vyacheslav Gorlov, ClearScale

Trending Up Stairs

For 10 years, renowned industry analyst firms such as Gartner have named AWS a cloud leader in modern IT technologies, which include containers and natural language understanding. To help businesses adapt to continuous market changes, AWS frequently releases new machine learning and Artificial Intelligence services that are designed to make users’ lives easier by reducing the amount of time spent on routine, manual, and error-prone operations. In this article, we’ll highlight the most notable AWS ML releases from the past few months.

Back in 2017, AWS changed the status quo by releasing Glue - the first-ever serverless Spark offering, that makes it easy to discover data stored in different places and formats, preserve that knowledge in the central repository, and then seamlessly expose it for analytical and machine learning operations. Data engineers no longer had to worry about capacity planning, scaling, or provisioning - they only had to write ETL jobs logic in familiar languages, like Python or Scala, leaving Glue to take care of the real heavy-lifting.

Recently AWS announced several new services, making life even easier for engineers. The first is AWS Glue Data Studio, a convenient tool for building and managing ETL workflows visually. It’s worth mentioning that iPaaS competitors overly limit supported platforms (e.g., Windows-only app in C#). In contrast, Glue Studio is entirely free - you only pay for the underlying resources you use. Out-of-the-box, Glue Studio also provides more than 250 transformations, validations, and other operations, all of which are editable in codable and code-free ways.

In addition to using Glue Workflows, jobs also can be orchestrated by Step Functions, Airflow, and custom code. Furthermore, Glue provides interoperability with another service - Elastic Views, allowing organizations to gather data from different sources (e.g., S3, DynamoDB, or RDS) in a single table with a unified schema w/o having to copy it. The tool is ideal for merging data from diverse streams (i.e., real-time vs nightly).

Our second shoutout goes to AWS Glue DataBrew. When data arrives in the cloud, the work is not done. Data must go through additional steps before it is ready for processing by machine learning algorithms. For instance, some operations require all historical data to be in one place and cleaned of missing, duplicated, and invalid records. As with Studio, Glue had another winner with DataBrew. The service charges for active sessions, rather than charging a flat-fee, and makes all tools available via browser or Jupyter. DataBrew can automatically profile data and suggest how to improve quality. In addition, it’s the only tool to interactively test and try data transformations. Any changes are shown in real time as soon as operations are applied, and users can backtrack to any previous point in time.

AWS Data Lake and Database services also got major enhancements this year. For example, AWS Lake Formation received ACID transactions support, row-level security, and automated table optimization (i.e. data compression, formatting, and partitioning). AWS HealthLake is a new purpose-built storage solution that offers the cost and performance-level efficiency of data lakes and relational databases in one place. Data is stored in FHIR and accessible via ELK and REST interfaces. Amazon Neptune also now provides machine learning capabilities on top of the built-in Gremlin low-latency interface for data preprocessing, model training, and prediction generation.

AWS’s pearl in the machine learning world - Amazon SageMaker - also has been updated with new features to address peripheral steps in the already all-encompassing ML Lifecycle Management pipeline:

  • Jumpstart is an up-to-date collection of state-of-the-art models that engineers can build, train, and deploy in only a few clicks.
  • Pipelines manage model life cycles in a similar way to AWS Step Functions - the only serverless stateflow orchestrator on the market - at no cost and includes extra governance and lineage features.
  • Features Store is a fast memory- and SSD-based cache for model parts that are changed frequently and, hence, can't get baked directly into models (i.e., XGBoost weights).
  • Clarify interprets how and why models generate certain predictions.
  • Data Wrangler is an extension of Glue DataBrew with machine learning-dedicated constructs (e.g., training time estimation based on data shape and size, as well as model complexity).
  • Edge Manager allows users to deploy, manage, and monitor machine learning models locally regardless of the underlying platform and hardware for a flat-fee charge per device registration.

AWS made great efforts to expand its managed services offerings to cover more business verticals and better support those without in-depth data science expertise or the capacity to develop, train, and provision their machine learning models. For instance, the pandemic created a huge demand for efficient cargo and logistics services, which Amazon supports with a suite of predictive maintenance services.

Amazon Monitron includes wireless sensors and device gateways that are attachable to industrial equipment and can continuously send telemetry to AWS IoT. The service also comes with basic analytics and anomaly detection capabilities. Amazon Lookout for Equipment is a comprehensive ML model that detects anomalies in telemetry from industrial assets and discerns their root causes. Additionally, it handles data from both Monitron, IoT Core, and any other data stream. Amazon Lookout for Metrics does the same but for domain-agnostic cases. The service can identify reasons behind sudden sales revenue drops, recognize malfunctioning EC2 instances, and even diagnose melanoma on human skin. Amazon Lookout for Vision supports factories well due to its ability to detect parts damages, identify missing components, and uncover process issues.

Computer Vision services have been enhanced with Amazon Panorama - a cost-efficient camera gateway that can infer CV models locally, encode video streams in real-time, and ingest all information to cloud-native data lakes. Amazon Panorama uses IoT Greengrass under-the-hood and does not require additional system maintenance. Compared to AWS DeepLens, Panorama is a production-grade technology, not an educational toolkit. Moreover, it doesn't have embedded cameras. DeepLens is also a monolithic tool that includes both sensors and compute resources in the same box.

Other services also got their fair share of updates this year. Amazon Transcribe Medical can now support various medical specialties, improving the accuracy of domain-specific speech recognition. Amazon Transcribe now automatically detects the dominant language of any recording without any prompt from users. Amazon Kendra debuted last year, got custom data source support and a built-in Model Monitor. In addition, it’s now HIPAA-compliant. Amazon Managed Workflows for Apache Airflow is almost like a Step Functions for customer service. It supports those who are in the process of moving data ops into the cloud, enabling them to reuse existing stateflow code with little or no changes before final cutovers to cloud-native stacks.

Conclusion

AWS is committed to deploying next-gen cloud solutions that enable organizations of all types to leverage the power of machine learning. Over the last several years, the cloud provider has made significant improvements to legacy services and added new features so that engineering teams can gather, store, and process data at scale without overwhelming human resources. For those interested in implementing machine learning for their enterprises, look no further than AWS.

About the author

Vyacheslav Gorlov is a Solution Architect at ClearScale.