Intro
Enterprises and organizations all want to extract value from data. To achieve this, data often needs to be transformed, loaded into data warehouses or other modern data platforms, and then analyzed to gain insights. In this article, I will discuss different services in AWS that cater to ETL needs, focusing on core ETL services such as EMR, Glue, and Lambda. While other services like EC2 and Kinesis can also handle ETL tasks, I will focus on these three.
AWS EMR
- EMR is a managed service for processing large amounts of data using big data frameworks like Hadoop, Pig, Hive, Spark, and Flink, as well as advanced frameworks like TensorFlow and MXNet.
- EMR supports two patterns: persistent clusters that run continuously and transient clusters that are created, run the load, and shut down afterward.
- Transient clusters: Ideal for nightly loads, data warehouse (DWH) loads, and daily batch processes or machine learning jobs.
- Persistent clusters: Suitable for streaming jobs, machine learning notebooks, or platforms running multiple jobs throughout the day that load data lakes and DWHs.
- To orchestrate job dependencies, AWS Step Functions and Livy can be used. Alternatively, Airflow, Luigi, or other orchestration tools on EC2 can be utilized.
- EMR supports both batch and stream processing, as well as running machine learning models.
- It provides the ability to read and write to DynamoDB, RDS, Elasticsearch, Redshift, Kinesis, and S3.
- Jupyter Notebooks can be created for data cleaning, transformations, machine learning modeling, and sharing.
- EMR can utilize spot instances for core and task nodes and can auto-scale nodes to handle load spikes.
AWS Glue
- Glue is a fully managed, serverless ETL service.
- It has two main components: the Data Catalog and ETL.
- Glue crawlers can crawl data and build a data catalog, exposing the data as a table (Athena) on top of S3 data.
- Glue ETL Jobs can run Spark code (Python, Scala) and Python shells.
- Glue ETL can read and write to S3, Aurora, RDS, Redshift, and other on-premises databases via JDBC (requires a VPC).
- Glue Workflows can orchestrate and schedule ETL jobs and crawlers.
- Development endpoints create an EMR cluster under the hood, allowing you to run Glue scripts during active development.
- Glue provides SageMaker and Zeppelin notebooks.
- Glue supports both batch and streaming processing, but it does not yet support auto-scaling.
AWS Lambda
- Lambda is a serverless, stateless service that runs code in response to various event sources like APIs, S3 events, SNS, SQS, and more.
- It has many use cases, including microservices for web apps and backends, chatbots, Alexa skills, stream processing, and data processing.
- Lambda supports code in Python, Java, Node.js, C#, Go, and custom libraries.
- Lambda requires a memory configuration (128MB to 3GB) and has a maximum runtime of 15 minutes. It can handle concurrent executions and is highly scalable.
- Lambda is suitable for lightweight ETL tasks that require less memory (<3GB) and can be executed quickly (<15 minutes). Example ETL use cases include moving files between S3 buckets or reading files and storing them in DynamoDB or RDS with minimal transformation.
- Lambda costs depend on the number of executions, memory allocated, and runtime. You are charged only for the actual runtime, not the maximum runtime configured.
Final Thoughts
Use the right tool for the job. Lambdas are great for lightweight ETL tasks and integrate well with services like SQS, SNS, and Step Functions, making them cost-effective. If you have a persistent EMR cluster in place, leverage it for medium to heavy ETLs. For new environments without EMR, Glue is the best option. The only current drawback of Glue is the lack of auto-scaling, but AWS is likely working on this. Recent updates have introduced new features like streaming and machine learning capabilities. It’s perfectly fine to have both EMR and Glue in your ecosystem.