Building a Data Lake on AWS - Essential Components and Best Practices | Lijju Mathew

Intro

Data lakes are one of the most significant developments in the data landscape, alongside Data Science and Cloud Data Warehouses. There are many definitions, metaphors, and views on what constitutes a data lake. At its core, a data lake is a central repository that stores all data, regardless of its source or format. It can also hold refined data derived from raw data, and the entire system can serve as a single source of truth for organizational needs.
Data is growing exponentially, and there is a constant effort to unlock its value, build data products, and democratize data. A data lake is the perfect place to store data for these purposes.

We can build a data lake on-premises, in the cloud, or as a hybrid solution. Here, I want to discuss what it takes to build one on AWS, focusing more on the components and AWS services required to build a data lake ecosystem.

Storage

S3: Storage is the heart of the data lake, and S3 is ideally suited for it. It is a cost-effective and reliable service. Depending on how you want to organize the data, you can have a set of buckets. Generally, organizations take a zone/layer-based approach with a landing zone, raw zone, and refined zone. You can set up lifecycle policies in S3 to move data to S3-IA and Glacier based on retention needs. There are a few best practices for storing data in S3: always partition the data, compress the files, and use columnar formats like ORC or Parquet for storing refined data.

Compute

Compute is required to refine data, store it in the refined zone, deliver data to external parties, or load data from the data lake into a cloud DWH.

Lambda: Lambda is a serverless service that lets you run code. It is excellent for copying S3 files and other lightweight compute tasks, but the limitation is that the code must complete within 15 minutes.

Glue: Glue is a serverless ETL service where you can run Python and Spark code. It is a batch service with an enhanced Spark framework using dynamic frames and can set up workflows. It can also connect to on-premises data sources via ODBC and be used to ingest data into the data lake.

EMR: EMR is a managed Hadoop cluster where you can run big data loads using Spark and other frameworks. When to use Glue versus EMR is another topic for discussion, but at a high level, if you need auto-scaling or have to install specific libraries, use EMR.

Kinesis: If you need to ingest streaming data into the data lake or process data in real time, use Kinesis. It has components like Streams to collect data in real time, Firehose to store data, and Kinesis Analytics to process data in real time.

Catalog

A catalog is another important component of a data lake to know what data you have and where it is located.

Glue Crawler/Catalog: Glue has a crawler that scans all the S3 files and builds a schema and table on top of them. It is not a full-fledged catalog service where you can search or view data lineage, but it does catalog the dataset, which can be queried using the Athena service.

Logging, Monitoring, and Alerting

CloudWatch: CloudWatch is the service to capture logs for processes, send alerts for any failure events, and perform monitoring. The logs can be shipped to third-party tools for further analysis and monitoring.

Orchestration

Step Function: Step Function is a workflow orchestration service that orchestrates jobs. It can directly integrate with Lambda, Glue, and other services.

Security

IAM: IAM (Identity and Access Management) controls access to assets. Always follow the principle of least privilege when granting access.

KMS: KMS (Key Management Service) is used to encrypt data. There are different ways to encrypt data based on the chosen encryption method.

CICD

Infrastructure as code and continuous delivery pipelines are essential parts of modern software delivery.

CloudFormation: CloudFormation provisions all resources in AWS and enables infrastructure as code.

CodePipeline: CodePipeline, along with CodeBuild and CodeDeploy, enables CICD pipelines for fast and reliable deployments.

AWS Lake Formation

AWS Lake Formation is a fully managed service that makes it easier to build, secure, and manage data lakes. It acts as a template or blueprint that automates steps like ingestion, cleansing, moving, cataloging, and securing data.

Data lakes are here to stay, even as the next generation of data architecture, like data mesh, evolves. However, it is crucial to pay attention to governance, data quality, and security when building a data lake, or it can turn into a data swamp. The best way to start is small—take one line of business or domain at a time and build upon it.

Intro#

Storage#

Compute#

Catalog#

Logging, Monitoring, and Alerting#

Orchestration#

Security#

CICD#

AWS Lake Formation#