After its inception in August 2017, AWS Glue has come a long way in its offerings as a ETL-as-a-Service (ETLaaS). When initially released, AWS Glue offered releases versioned
0._. Interestingly, Glue 0.9 and Glue 1.0 are still available for new and old jobs while I write this post in January 2023. The last non-latest version – AWS Glue 3.0 – was announced in August 2021 along with various improvements such as faster data integration and processing, upgraded Spark runtime, updated JDBC drivers, and more.
AWS Glue 4.0 – the latest release of AWS Glue – was announced on November 28, 2022. According to AWS , “AWS Glue 4.0 upgrades the Spark engines to Apache Spark 3.3.0 and Python 3.10. Glue 4.0 gives customers the latest Spark and Python releases, so they can develop, run, and scale their data integration workloads and get insights faster.” With AWS Glue 2.0 getting its support discontinued on March 31, 2023, it’s right time to think about upgrading your old Glue jobs to Glue 3.0 or Glue 4.0. It also brings some benefits.
In this post, I’m going to discuss the benefits of upgrading Glue 2.0 or Glue 3.0 jobs to Glue 4.0. It will help you decide about upgrading your jobs to the latest version of AWS Glue. If you’re looking for benefits of upgrading Glue 2.0 jobs to Glue 3.0, check this post .
AWS Glue 4.0 offers Spark 3.3 for batch and streaming data integration workloads. Apache Spark 3.3 engine in AWS Glue 4.0 includes enhanced optimizations from AWS Glue and AWS EMR teams as it’s seen in AWS Glue 3.0. These optimizations speed up data processing including enhanced shuffles and partition coalescing, vectorized readers, and adaptive query runs in Apache Spark. AWS Glue 4.0 also avails upgraded JDBC drivers for all AWS Glue sources like MySQL, PostgreSQL, SQL Server, Oracle and MongoDB.
AWS Glue 4.0 adds support for built-in Pandas APIs as well as support for Apache Hudi, Apache Iceberg, and Delta Lake formats, giving you more options for analyzing and storing your data. It upgrades connectors for native AWS Glue database sources such as RDS, MySQL, and SQLServer, which simplifies connections to common database sources. AWS Glue 4.0 also adds native support for the new Cloud Shuffle Storage Plugin for Apache Spark, which helps customers scale their disk usage during runtime. – AWS
AWS Glue 4.0 offers many features and improvements over Glue 3.0, among which, these are a few best features in Glue 4.0:
Let’s compare the architecture, features and job options of Glue 3.0 and Glue 4.0, which will help you decide if you can and want to upgrade your Glue 3.0 jobs to Glue 4.0. At the minimum, your jobs will benefit from the better performance of Glue 4.0.
|Feature||Glue 2.0||Glue 3.0||Glue 4.0|
|Apache Spark||Spark 2.4.3 (open-source version)||Spark 3.1.1 (optimized version from AWS Glue and EMR teams)||Spark 3.3.0 (optimized version from AWS Glue and EMR teams)|
|Auto Scaling||Not supported||Supported||Supported|
|AWS Glue Spark shuffle manager with Amazon S3||Supported||Not supported||Supported|
|Python 2.x||Python 2.7||Not supported||Not supported|
|Python 3.x||Up to Python 3.7||Python 3.7 or above||Python 3.10 or above|
|Scala||Scala 2.11||Scala 2.12||Scala 2.12|
|Startup time||Faster than Glue 1.0||Faster than Glue 2.0||Faster than Glue 3.0|
First and foremost, there are two important and possibly breaking upgrades in AWS Glue 4.0: Spark 3.3 (from Spark 3.1 in Glue 3.0) and Python 3.10 (from Python 3.7 in Glue 3.0). The reason being Spark 3.3 has some breaking changes (for example, the new Spark introduced
cloudpickle to replace the built-in
pickle and bumped the minimum supported Pandas to version 1.0.5) and Python 3.10 will require you to upgrade the Python dependencies (aka third-party modules) used in your AWS Glue jobs.
There are more dependency upgrades including JDBC drivers and Python modules. Check the complete list on AWS’s pages:
If you’re interested in trying out Glue 4.0, check out Migrating AWS Glue jobs to AWS Glue version 4.0 (AWS Glue Developer Guide) . Though Glue 2.0 is reaching end of support on March 31, 2023 , we have no idea about when Glue 3.0 will reach end of support. I think the first or second quarter of 2024 may witness the end of support for AWS Glue 3.0 seeing the timeline for Glue 2.0.
Nevertheless, you can try out and/or upgrade your jobs to Glue 4.0 seeing it avails a ton of benefits over Glue 3.0. My most favorite features are the upgraded Spark runtime (Spark 3.3) and support for Amazon S3-based Cloud Shuffle Storage plugin.
↫ Previous post
Get in touch to discuss an idea or project. We can work together to make it live! You can also enquire about writing guest posts or speaking in meetups or workshops.