Mastering AWS Glue - A Deep Dive into Glue Catalog, Jobs, and best practices | Lijju Mathew

AWS Glue is a fully managed, serverless ETL service. It is used for ETL purposes and, perhaps most importantly, in data lake ecosystems. Its high-level capabilities can be found in my previous post here. In this post, I want to talk in detail about Glue Catalog, Glue Jobs, and provide an example to illustrate a simple job.

Glue Catalog

The Glue Catalog is a metadata repository built automatically by crawling datasets with Glue Crawlers. It contains tables within a database created by crawlers, and these tables can be queried via AWS Athena. Crawlers can crawl S3, RDS, DynamoDB, Redshift, and any on-premises databases that can connect via JDBC. These crawled datasets can further be used as source or target connections in Glue while developing jobs.

The way crawlers work is by using built-in classifiers that run against the dataset in an orderly fashion. If the classifier can’t recognize the data, the crawler invokes the next classifier. We can also build custom classifiers and use them. These crawlers can be scheduled to scan at regular intervals, and there is an option to update the schema of the data catalog tables or ignore updates.

Here are some bad experiences with crawlers that I had, which you should be careful about. First, crawling a dataset in S3 that has millions of small files can take time and be costly. Second, ensure the data is organized in partitions and that the dataset being crawled has similar files in each folder. If that is not the case—say it has 100 files with different schemas—the crawler can end up creating 100 tables.

Glue Jobs

Glue provides two shells: the Python shell and the Spark shell to execute code. The Python shell can be used to execute plain Python code, and it is a non-distributed environment. The Spark shell is a distributed environment to execute Spark code written in either PySpark or Scala.

While using the Spark shell, in addition to the data frames and other constructs that Spark has, Glue introduces a new construct called dynamic frames, which require no schema, unlike data frames. Dynamic frames have a few transformation functions, and we can always convert dynamic frames to data frames and vice versa to use each other’s transformation functions.

Glue Dev Endpoint provides an environment to author and test jobs. It is an EMR cluster with all Glue utilities.

Here are some of the important configuration parameters for Glue jobs:

Bookmark is a feature that tracks data already processed by the job in previous runs by persisting the state information. For example, if we have a Glue job that reads from S3 and the data is partitioned, it reads only the new partition data if the bookmark option is enabled (disabled by default). This feature can be used for relational sources, and it keeps track of new data by the keys defined for the table.
DPU (Data Processing Unit) denotes the processing power allocated to the Glue job. A single DPU is 4 vCPUs and 16 GB of memory. A Spark job requires a minimum of 2 DPUs. The worker type specifies what type of nodes are used. There is Standard, G1.X (for memory-intensive tasks), and G.2X (for ML Transforms).
Delay notification threshold time can be set to notify (via CloudWatch) jobs running above the threshold time.
Runtime parameters can also be passed to the job.

Other Features:

Metrics: Glue provides a Spark web UI to monitor and debug jobs.
It also supports streaming ETL, where we can set up continuous ingestion pipelines from sources like Kinesis, Kafka, and ingest into S3 or other data stores.
It has a feature called Workflow to orchestrate crawlers and jobs with dependencies and triggers.
It also provides an interface to create and work with SageMaker and Zeppelin notebooks.
There is a warm-up time involved when you run a Glue job, which used to be around 10 minutes. With Glue 2.0, it has been reduced to 1 minute.

Pricing:

You are billed only for the time the ETL job takes to run, with no upfront cost for startup or shutdown time. It is charged based on the number of DPUs used for the job, at $0.44 per DPU-hour. Glue version 2.0 has a 1-minute billing duration, while older versions have a 10-minute minimum billing duration. Data catalog and crawler runs have additional charges.

Example

Here is an en example of Glue job that reads from S3, filters data and writes to Dynamo DB.

# Glue Script to read from S3, filter data and write to Dynamo DB.
# First read S3 data using Spark Context, Glue Context can also be used. Using Spark Context just to illustrate that
# dataframe can be conveted to dynamic filter.
import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from pyspark import SparkContext

args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_file_path', 'dynamodb_table'])
sc = SparkContext()
glue_ctx = GlueContext(sc)
spark = glue_ctx.spark_session
job = Job(glue_ctx)
job.init(args['JOB_NAME'], args)

s3_file_path = args['s3_file_path']
dynamodb_table = args['dynamodb_table']

df_src = spark.read.format("csv").option("header", "true").load(s3_file_path)
df_fil_dept = df_src.filter(df_src.dept_no == 101)

dyf_result = DynamicFrame.fromDF(df_fil_dept, glue_ctx, "dyf_result")

glue_ctx. \
    write_dynamic_frame_from_options(
        frame=dyf_result, connection_type="dynamodb",
        connection_options={"dynamodb.output.tableName": dynamodb_table, "dynamodb.throughput.write.percent": "1.0"})

job.commit()

Glue Catalog#

Glue Jobs#

Other Features:#

Pricing:#

Example#

Glue Catalog

Glue Jobs

Other Features:

Pricing:

Example