After its inception in August 2017, AWS Glue has come a long way in its offerings as a ETL-as-a-Service (ETLaaS). When initially released, AWS Glue offered releases versioned 0._. Interestingly, Glue 0.9 and Glue 1.0 are still available for new and old jobs while I write this post in January 2023. The last non-latest version – AWS Glue 3.0 – was announced in August 2021 along with various improvements such as faster data integration and processing, upgraded Spark runtime, updated JDBC drivers, and more.

AWS Glue 4.0 – the latest release of AWS Glue – was announced on November 28, 2022. According to AWS , “AWS Glue 4.0 upgrades the Spark engines to Apache Spark 3.3.0 and Python 3.10. Glue 4.0 gives customers the latest Spark and Python releases, so they can develop, run, and scale their data integration workloads and get insights faster.” With AWS Glue 2.0 getting its support discontinued on March 31, 2023, it’s right time to think about upgrading your old Glue jobs to Glue 3.0 or Glue 4.0. It also brings some benefits.

In this post, I’m going to discuss the benefits of upgrading Glue 2.0 or Glue 3.0 jobs to Glue 4.0. It will help you decide about upgrading your jobs to the latest version of AWS Glue. If you’re looking for benefits of upgrading Glue 2.0 jobs to Glue 3.0, check this post .

AWS Glue 4.0

AWS Glue 4.0 offers Spark 3.3 for batch and streaming data integration workloads. Apache Spark 3.3 engine in AWS Glue 4.0 includes enhanced optimizations from AWS Glue and AWS EMR teams as it’s seen in AWS Glue 3.0. These optimizations speed up data processing including enhanced shuffles and partition coalescing, vectorized readers, and adaptive query runs in Apache Spark. AWS Glue 4.0 also avails upgraded JDBC drivers for all AWS Glue sources like MySQL, PostgreSQL, SQL Server, Oracle and MongoDB.

AWS Glue 4.0 adds support for built-in Pandas APIs as well as support for Apache Hudi, Apache Iceberg, and Delta Lake formats, giving you more options for analyzing and storing your data. It upgrades connectors for native AWS Glue database sources such as RDS, MySQL, and SQLServer, which simplifies connections to common database sources. AWS Glue 4.0 also adds native support for the new Cloud Shuffle Storage Plugin for Apache Spark, which helps customers scale their disk usage during runtime. – AWS

AWS Glue 4.0 offers many features and improvements over Glue 3.0, among which, these are a few best features in Glue 4.0:

  • An upgraded, performance-optimized Spark runtime: Glue 4.0 avails Apache Spark 3.3 that brings many improvements like row-level runtime filtering, error message improvements, and support for complex types for Parquet vectorized reader.
  • Support for open incremental data lake frameworks: The new Glue brings native support for incremental data lake frameworks like Apache Hudi, Apache Iceberg, and Delta Lake, allowing you to easily implement incremental data lake for your workloads.
  • Support for the Amazon S3-based Cloud Shuffle Storage: Glue 4.0 supports S3-based Cloud Shuffle Storage Plugin (an Apache Spark plugin) that allows Glue jobs to use Amazon S3 for storing shuffling data. This feature is missing in AWS Glue 3.0.
  • Optimized Data Catalog and upgraded Hive metastore: The upgraded AWS Glue also brings an improved Data Catalog with partition indexes, partition listing, pushdown predicates, and more along with an upgraded Hive metastore client.
  • Improved error messages having additional context: Since it avails Spark 3.3, it brings many error message improvements introduced in Spark 3.3, especially related to errors in ANSI SQL. Also, Spark 3.3 offers a profiler for Python and Pandas UDFs (User Defined Functions) while the old profiler only worked for profiling RDD (Resilient Distributed Dataset) operations.

Glue 4.0 vs. Glue 3.0

Let’s compare the architecture, features and job options of Glue 3.0 and Glue 4.0, which will help you decide if you can and want to upgrade your Glue 3.0 jobs to Glue 4.0. At the minimum, your jobs will benefit from the better performance of Glue 4.0.

FeatureGlue 2.0Glue 3.0Glue 4.0
Apache SparkSpark 2.4.3 (open-source version)Spark 3.1.1 (optimized version from AWS Glue and EMR teams)Spark 3.3.0 (optimized version from AWS Glue and EMR teams)
Auto ScalingNot supportedSupportedSupported
AWS Glue Spark shuffle manager with Amazon S3SupportedNot supportedSupported
Python 2.xPython 2.7Not supportedNot supported
Python 3.xUp to Python 3.7Python 3.7 or abovePython 3.10 or above
ScalaScala 2.11Scala 2.12Scala 2.12
Startup timeFaster than Glue 1.0Faster than Glue 2.0Faster than Glue 3.0

Upgrade to Glue 4.0

First and foremost, there are two important and possibly breaking upgrades in AWS Glue 4.0: Spark 3.3 (from Spark 3.1 in Glue 3.0) and Python 3.10 (from Python 3.7 in Glue 3.0). The reason being Spark 3.3 has some breaking changes (for example, the new Spark introduced cloudpickle to replace the built-in pickle and bumped the minimum supported Pandas to version 1.0.5) and Python 3.10 will require you to upgrade the Python dependencies (aka third-party modules) used in your AWS Glue jobs.

There are more dependency upgrades including JDBC drivers and Python modules. Check the complete list on AWS’s pages:

Conclusion

If you’re interested in trying out Glue 4.0, check out Migrating AWS Glue jobs to AWS Glue version 4.0 (AWS Glue Developer Guide) . Though Glue 2.0 is reaching end of support on March 31, 2023 , we have no idea about when Glue 3.0 will reach end of support. I think the first or second quarter of 2024 may witness the end of support for AWS Glue 3.0 seeing the timeline for Glue 2.0.

Nevertheless, you can try out and/or upgrade your jobs to Glue 4.0 seeing it avails a ton of benefits over Glue 3.0. My most favorite features are the upgraded Spark runtime (Spark 3.3) and support for Amazon S3-based Cloud Shuffle Storage plugin.

References

  1. Documentation history for AWS Glue [ AWS Glue Developer Guide (original) (archived) ]
  2. Introducing AWS Glue 4.0 [ AWS (original) (archived) ]
  3. Migrating AWS Glue jobs to AWS Glue version 4.0 [ AWS Glue Developer Guide (original) (archived) ]
  4. AWS Glue version support policy [ AWS Glue Developer Guide (original) (archived) ]
  5. Python modules already provided in AWS Glue [ AWS Glue Developer Guide (original) (archived) ]
  6. Notable dependency upgrades [ AWS Glue Developer Guide (original) (archived) ]
  7. JDBC driver upgrades [ AWS Glue Developer Guide (original) (archived) ]
  8. Connector upgrades [ AWS Glue Developer Guide (original) (archived) ]

Let’s discuss.

Get in touch to discuss an idea or project. We can work together to make it live! You can also enquire about writing guest posts or speaking in meetups or workshops.