Summary
Built storage & pipeline infrastructure with AWS CLI then loaded structured and semi-structured YouTube data into S3 Data Lake. Transformed .csv and .json data into Apache Parquet with python scripts on AWS Lambda and Glue ETL. Built data catalogs with Glue Crawler to form the schema of the Data Lake and built a Glue ETL to produce an Analytic Table. Visualized results in a dashboard using AWS QuickSight.
What this project entails?
- Building a data lake in S3
- Consists of disperate semi-structured and structured data
- AWS Data Catalogue using glue
- ETL in AWS Glue Spark jobs
- Amazon SageMaker Jupyter Notebooks
- Amazon SNS for alerting
- SQL using Amazon Athena and Spark SQL
- i.e. impact of querying the optimized data layers
- Writing scripts to ingest changes incrementally
- Build proper schema (also schema evolution)
- BI dashboards in Amazon QuickSight
Decisions
- Why Data Lake and not Data Warehouse?
- Cost: Redshift costs a lot of money
- Glue and Athena are serverless, so you pay per query/operation
- Scalable
- Cost: Redshift costs a lot of money
- Why Parquet?
- Column based storage rather than row based like .csv
- When analyzing data, columns are often subset which are quicker with this file format
- Similar to the storage system Redshift uses (column-oriented DBMS)
- Trying something new
- I already have projects working with conventional, on-premise relational databases
The Dataset
Cloud Benefits
- 'A 10-node cluster running for 10 hours costs the same as a 100-node cluster running for one hour'
- On-premise means:
- Procurement, approval, licensing
- Shipping, installation, electricity, cooling systems,...
- Underutilized vs overutilized
- Pay even when nobody is using the hardware
- Sustainability issues
AWS
- Contains 200+ services which can be used for many :
- Compute
- Storage and databases
- AI and ML
- Data lakes and Analytics
- Internet of Things
Creating an Account
- https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/
- https://www.youtube.com/watch?v=kbtDJHgjrPc&t=294s
Best Practices
- Always protect root account
- Don't use it for projects
- Enable MFA wherever you can
- i.e. Google Authenticator
- Rotate all keys and passwords periodically
- Avoid "credential" files in shared computers
- e.g. https://github.com/99designs/aws-vault
- Follow least privilege principle always
Data Lake Infrastructure
Infrastructure As Code
- AWS console
- Programmatically
- AWS CLI
- SDK
- Python, NodeJS, etc all software development kits
- This project uses AWS CLI to build and configure the data lake
- Some steps require manual interaction with the console (labelled as manual)
- All AWS CLI follows this command reference
S3 for Data Storage
Naming Conventions for S3
s3://company-raw-awsregion-awsaccountID-env/source/source_region/tablename/year=yyyy/month=mm/day=dd/table_<yearmonthday>.<file_format>
env = dev, test, prod
source = name or indicator of source
source_region = region of data source
krozic-youtube-raw-useast1-dev
Creating Our S3 Bucket
Bucket creation:
aws s3api create-bucket `
--bucket krozic-youtube-raw-useast1-dev `
--region us-east-1 `
--object-ownership BucketOwnerEnforced `
--output text >> setup.log
Encryption:
aws s3api put-bucket-encryption `
--bucket krozic-youtube-raw-useast1-dev `
--server-side-encryption-configuration '{\"Rules\": [{\"ApplyServerSideEncryptionByDefault\": {\"SSEAlgorithm\": \"AES256\"}}]}'
Restrict Public Access:
aws s3api put-public-access-block `
--bucket krozic-youtube-raw-useast1-dev `
--public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
Populating the S3 Bucket
- json:
aws s3 cp . s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics_reference_data/ `
--recursive `
--exclude "*" `
--include "*.json"
- csv:
aws s3 cp CAvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=ca/
aws s3 cp DEvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=de/
aws s3 cp FRvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=fr/
aws s3 cp GBvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=gb/
aws s3 cp INvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=in/
aws s3 cp JPvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=jp/
aws s3 cp KRvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=kr/
aws s3 cp MXvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=mx/
aws s3 cp RUvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=ru/
aws s3 cp USvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=us/
IAM Role
- Must first create a 'Role' so that AWS Glue can interact with the S3 bucket
- Manual
- 'use cases for other AWS Services': Glue
- Search for S3: AmazonS3FullAccess
- Role name: krozic-youtube-glue-s3-role
- Add Permission -> Attach Policies
- Search 'Glue': AWSGlueServiceRole
- For EC2, S3, and Cloudwatch Logs
aws iam create-role `
--role-name krozic-youtube-glue-s3-role `
--assume-role-policy-document file://policies/trust-glue.json
aws iam attach-role-policy `
--role-name krozic-youtube-glue-s3-role `
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy `
--role-name krozic-youtube-glue-s3-role `
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
AWS Glue Catalog
-
Data Catalog: Information about our data and what it contains
-
Ex. Labeling all columns and what they are
-
Builds a metadata catalog that allows ETL to be performed
-
Create Database:
aws glue create-database `
--database-input '{\"Name\": \"krozic_youtube_raw\", \"Description\": \"This database is created using AWS CLI\"}'
- Create Crawler:
aws glue create-crawler `
--name "krozic-youtube-raw-glue-catalog-1" `
--database-name "krozic_youtube_raw" `
--role "krozic-youtube-glue-s3-role" `
--targets '{\"S3Targets\": [{\"Path\": \"s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics_reference_data\"}]}'
Run Crawler:
aws glue start-crawler --name "krozic-youtube-raw-glue-catalog-1"
AWS Athena
-
Used to query tables
-
Cannot use without defining an 'output location'
-
Create new S3 bucket:
aws s3api create-bucket `
--bucket krozic-youtube-raw-useast1-athena-job `
--region us-east-1 `
--object-ownership BucketOwnerEnforced `
--output text >> setup.log
aws s3api put-bucket-encryption `
--bucket krozic-youtube-raw-useast1-athena-job `
--server-side-encryption-configuration '{\"Rules\": [{\"ApplyServerSideEncryptionByDefault\": {\"SSEAlgorithm\": \"AES256\"}}]}'
aws s3api put-public-access-block `
--bucket krozic-youtube-raw-useast1-athena-job `
--public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
- Attach to Athena as output
- Manual
- Get error because of json formatting
- SerDe (Serialize deserialize)
- Requirement: Data Cleansing
- Create light ETL: JSON to Apache Parquet
- Column and Row format
- Create light ETL: JSON to Apache Parquet
![[Pasted image 20220613162115.png]]
AWS Lambda
- JSON data is stored on S3
- Inside Lambda you can write python or another language to transform the data
- Then build the Glue Crawler on this clean data
First create a new role: Manual
- Add lambda role
- Use policy AmazonS3FullAccess
- Name: krozic-youtube-lambda-s3-role
- programatic:
aws iam create-role `
--role-name krozic-youtube-lambda-s3-role `
--assume-role-policy-document file://policies/trust-lambda.json
aws iam attach-role-policy `
--role-name krozic-youtube-lambda-s3-role `
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy `
--role-name krozic-youtube-lambda-s3-role `
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
- Create lambda function: Not Done/working
- Must create a zip file with python contents and dependencies
aws lambda create-function `
--function-name krozic-youtube-raw-useast1-lambda-json-parquet2 `
--runtime python3.8 `
--environment file://lambda_environment.json `
//--code file://lambda_function.py `
--role arn:aws:iam::747879646971:role/krozic-youtube-lambda-s3-role
Environment Variables to add:
{
"Variables": {"string": "string"
...}
}
Python Script
-
Uses awswrangler to read S3 files
-
Unnests the json data to extract just the category info
-
This python script uses 'awswrangler' and creates a 'glue catalog', acting like a glue crawler but in a lambda function.
-
Create new S3 to store clean data
aws s3api create-bucket `
--bucket krozic-youtube-clean-useast1-dev `
--region us-east-1 `
--object-ownership BucketOwnerEnforced `
--output text >> setup.log
aws s3api put-bucket-encryption `
--bucket krozic-youtube-clean-useast1-dev `
--server-side-encryption-configuration '{\"Rules\": [{\"ApplyServerSideEncryptionByDefault\": {\"SSEAlgorithm\": \"AES256\"}}]}'
aws s3api put-public-access-block `
--bucket krozic-youtube-clean-useast1-dev `
--public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
- Create new DB for clean data:
aws glue create-database `
--database-input '{\"Name\": \"krozic_youtube_clean\", \"Description\": \"This database is created using AWS CLI\"}'
- Create script to clean data:
import awswrangler as wr
import pandas as pd
import urllib.parse
import os
# Temporary hard-coded AWS Settings; i.e. to be set as OS variable in Lambda
os_input_s3_clean_layer = os.environ['s3_clean_layer']
os_input_glue_catalog_db_name = os.environ['glue_catalog_db_name']
os_input_glue_catalog_table_name = os.environ['glue_catalog_table_name']
os_input_write_data_operation = os.environ['write_data_operation']
def lambda_handler(event, context):
# Get the object from the event and show its content type
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
# Creating DF from content
df_raw = wr.s3.read_json('s3://{}/{}'.format(bucket, key))
# Extract required columns:
df_step_1 = pd.json_normalize(df_raw['items'])
# Change type to match raw_statistics
df_step_1['id'] = df_step_1['id'].astype('int')
# Write to S3
wr_response = wr.s3.to_parquet(
df=df_step_1,
path=os_input_s3_clean_layer,
dataset=True,
database=os_input_glue_catalog_db_name,
table=os_input_glue_catalog_table_name,
mode=os_input_write_data_operation
)
return wr_response
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e
Click 'Deploy' Manual
Configure test event: Manual
- Modify Template: 's3-put'
- Records: s3: bucket: name: 'krozic-youtube-raw-useast1-dev'
- Records: s3: bucket: arn: arn:aws:s3:::krozic-youtube-raw-useast1-dev
- Records: s3: object: key: youtube/raw_statistics_reference_data/US_category_id.json
After testing, an error showing no module 'awswrangler'...
Lambda Limitations
- Deployment package is 50MB
- You can use /tmp or mount EFS volumes: More storage and share it across executions
- 10 GB for memory
- 6 vCPUs
- 15 min timeout
Lambda Layers
- Convenient way to package libraries and other dependencies
- Reduces the size of uploaded deployment, so it's faster to deploy your code
- Promote code sharing and separation of responsibilities
- Simpler iterations, so faster on writing business logic
Add layer: Manual
- AWSDataWrangler-Python38 https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html https://awscli.amazonaws.com/v2/documentation/api/latest/reference/lambda/create-function.html
New Error: GetTable AccessDenied
- Need to add 'AWSGlueServiceRole' to the lambda role
Crawling CSV Data
- Create Crawler:
aws glue create-crawler `
--name "krozic-youtube-raw-csv-crawler-1" `
--database-name "krozic_youtube_raw" `
--role "krozic-youtube-glue-s3-role" `
--targets '{\"S3Targets\": [{\"Path\": \"s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics\"}]}'
Run Crawler:
aws glue start-crawler --name "krozic-youtube-raw-csv-crawler-1"
- This crawler builds a 'partitioned' raw_statistics table schema based on the folder structure containing the many csv files
- All the .csv can be queried as one big table
- This table can also be joined to the .json data that was wrangled and glued earlier
Joining Data in Athena
- raw_statistics 'category_id' is int, while clean_statistics_reference_data 'id' is string
- Either can CAST() in Athena or can change the type
- Actually went back and changed the lambda script to modify type
- As data gets more complex, it will be slower and slower to run these queries
- Time to convert to parquet in an ETL to increase efficiency
Glue ETL Manual
- Glue -> ETL -> Jobs (legacy) -> Add Job
- Name: krozic-youtube-clean-csv-to-parquet
- Role: krozic-youtube-glue-s3-role
- Job bookmark: Enable
- Job metrics: [x]
- Data source: raw_statistics
- Data target:
- Data store: S3
- Format: Parquet
- Target path: s3://krozic-youtube-clean-useast1-dev/youtube/raw_statistics/
- Change all 'long' to 'bigint' where necessary
Optimizing PySpark Script
- datasink4 writes a single file into S3 bucket destination
- We want the data partitioned like the source
from awsglue.dynamicframe import DynamicFrame
# Convert dropnullfields3 into dataframe
datasink1 = dropnullfields3.toDF().coalesce(1)
# Build final output and create dynamic dataframe
df_final_output = DynamicFrame.fromDF(datasink1, glueContext, "df_final_output")
# Add '"partitionKeys": ["region"]' to data target:
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://krozic-youtube-clean-useast1-dev/youtube/raw_statistics/", "partitionKeys": ["region"]}, format = "parquet", transformation_ctx = "datasink4")
- Save -> Run Job
- Error: An error occurred while calling o143.pyWriteDynamicFrame. Unable to parse file: MXvideos.csv
- Program cannot interpret different languages
- Need to encode in UTF (to do)
- For now, just filter out data with this problem
- predicate_pushdown = "region in ('ca', 'gb', 'us')"
- Add 'push_down_predicate = predicate_pushdown' to
- Successfully creates parquet files, partitioned by country
Glue Crawler for Clean Parquet Data
-
Crawler name: krozic-youtube-clean-csv-to-parquet-etl
-
Include path: s3://krozic-youtube-clean-useast1-dev/youtube/raw_statistics/
-
Role: krozic-youtube-glue-s3-role
-
Database: krozic_youtube_clean
-
Create Crawler:
aws glue create-crawler `
--name "krozic-youtube-clean-csv-to-parquet-etl" `
--database-name "krozic_youtube_clean" `
--role "krozic-youtube-glue-s3-role" `
--targets '{\"S3Targets\": [{\"Path\": \"s3://krozic-youtube-clean-useast1-dev/youtube/raw_statistics\"}]}'
Run Crawler:
aws glue start-crawler --name "krozic-youtube-clean-csv-to-parquet-etl"
Lambda for ALL .json files
-
Automatically run Lambda code when files are added
-
Lambda -> Add Trigger -> S3
- Bucket: krozic-youtube-raw-useast1-dev
- Event type: All object create events
- Prefix: youtube/raw_statistics_reference_data/
- Suffix: .json
- Recursive Invocation: I Acknowledge
-
Testing this method:
- Delete all .json in raw_statistics_reference_data
- Delete parquet files in clean folder
- Re-upload all .json files to check lambda is triggered
Creating ETL for the Reporting Layer
- As data grows larger, the data will be difficult to query everyday for analysis
- Better to create another storage system for data meant for analytics
- Glue -> Glue Studio -> Create
- Name: krozic-youtube-parquet-analytics
- Source: AWS Glue Data Catalog
- Database: krozic_youtube_clean
- Table: clean_statistics_reference_data
- Source: AWS Glue Data Catalog
- Database: krozic_youtube_clean
- Table: raw_statistics
- Transform: Join
- Join type: Inner join
- Node properties:
- Node parents: both
- Transform:
- Add condition: category_id = id
- Target: S3
- Format: Parquet
- Compression Type: Snappy
- S3 Target Location: s3://krozic-youtube-analytics-useast1-dev
- Database: krozic_youtube_analytics
- Table name: final_analytics
- Add a partition key: region
- Add a partition key: category_id
- Job Details:
- Role: krozic-youtube-glue-s3-role
- Had to create glue-assets bucket before saving (role issue? other glue ETL created folders...)
- Save -> Run
To do before the above steps:
- Create new bucket for analytical data
aws s3api create-bucket `
--bucket krozic-youtube-analytics-useast1-dev `
--region us-east-1 `
--object-ownership BucketOwnerEnforced `
--output text >> setup.log
aws s3api put-bucket-encryption `
--bucket krozic-youtube-analytics-useast1-dev `
--server-side-encryption-configuration '{\"Rules\": [{\"ApplyServerSideEncryptionByDefault\": {\"SSEAlgorithm\": \"AES256\"}}]}'
aws s3api put-public-access-block `
--bucket krozic-youtube-analytics-useast1-dev `
--public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
- Create DB for analytics data:
aws glue create-database `
--database-input '{\"Name\": \"krozic_youtube_analytics\", \"Description\": \"This database is created using AWS CLI\"}'
Benefits of Analytic Dataset
- Using clean data:
SELECT ref.snippet_title, stats.title, stats.title
FROM raw_statistics stats
INNER JOIN clean_statistics_reference_data ref on (stats.category_id = ref.id)
WHERE ref.id = 2
VS
- Using Analytic data:
SELECT snippet_title, title, title
FROM rpt_youtube_statistics_categories
WHERE id = 2
- No joins, just a single filter to get desired data
QuickSight Dashboard
- Create QuickSight User
- Give access to S3 buckets
- Datasets -> New Dataset (import data from Athena to QuickSight)
- Athena
- Data source name: youtube_analytics_dashboard
- Validate Connection
- Database: krozic_youtube_analytics
- Table: final_analytics
- Edit/preview -> Save & Publish
- Athena
- Datasets -> final_analytics -> Create analysis
AWS Teardown
- Delete Bucket:
aws s3api delete-bucket `
--bucket krozic-youtube-raw-useast1-dev `
--region us-east-1
aws s3api delete-bucket `
--bucket krozic-youtube-raw-useast1-athena-job `
--region us-east-1
aws s3api delete-bucket `
--bucket krozic-youtube-clean-useast1-dev `
--region us-east-1
aws s3api delete-bucket `
--bucket krozic-youtube-analytics-useast1-dev `
--region us-east-1
- Delete Database:
aws glue delete-database --name "krozic_youtube_raw"
aws glue delete-database --name "krozic_youtube_clean"
aws glue delete-database --name "krozic_youtube_analytics"
- Delete Crawler:
aws glue delete-crawler --name "krozic-youtube-raw-glue-catalog-1"
- Delete Roles: need to detatch-role-policy first... and remove-role-from-instance-profile and delete-role-policy
aws iam detach-role-policy `
--role-name krozic-youtube-glue-s3-role `
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy `
--role-name krozic-youtube-glue-s3-role `
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam delete-role --role-name krozic-youtube-glue-s3-role
aws iam detach-role-policy `
--role-name krozic-youtube-lambda-s3-role `
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy `
--role-name krozic-youtube-lambda-s3-role `
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam delete-role --role-name krozic-youtube-lambda-s3-role