banner

Summary

Built storage & pipeline infrastructure with AWS CLI then loaded structured and semi-structured YouTube data into S3 Data Lake. Transformed .csv and .json data into Apache Parquet with python scripts on AWS Lambda and Glue ETL. Built data catalogs with Glue Crawler to form the schema of the Data Lake and built a Glue ETL to produce an Analytic Table. Visualized results in a dashboard using AWS QuickSight.

What this project entails?

  • Building a data lake in S3
    • Consists of disperate semi-structured and structured data
  • AWS Data Catalogue using glue
  • ETL in AWS Glue Spark jobs
    • Amazon SageMaker Jupyter Notebooks
  • Amazon SNS for alerting
  • SQL using Amazon Athena and Spark SQL
    • i.e. impact of querying the optimized data layers
  • Writing scripts to ingest changes incrementally
  • Build proper schema (also schema evolution)
  • BI dashboards in Amazon QuickSight

Decisions

  • Why Data Lake and not Data Warehouse?
    • Cost: Redshift costs a lot of money
      • Glue and Athena are serverless, so you pay per query/operation
    • Scalable
  • Why Parquet?
    • Column based storage rather than row based like .csv
    • When analyzing data, columns are often subset which are quicker with this file format
    • Similar to the storage system Redshift uses (column-oriented DBMS)
  • Trying something new
    • I already have projects working with conventional, on-premise relational databases

The Dataset

  • Top trending videos
    • based on user interactions:
      • Number of views
      • shares
      • comments
      • likes
  • kaggle

Cloud Benefits

  • 'A 10-node cluster running for 10 hours costs the same as a 100-node cluster running for one hour'
  • On-premise means:
    • Procurement, approval, licensing
    • Shipping, installation, electricity, cooling systems,...
    • Underutilized vs overutilized
    • Pay even when nobody is using the hardware
    • Sustainability issues

AWS

  • Contains 200+ services which can be used for many :
    • Compute
    • Storage and databases
    • AI and ML
    • Data lakes and Analytics
    • Internet of Things

Creating an Account

  • https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/
  • https://www.youtube.com/watch?v=kbtDJHgjrPc&t=294s

Best Practices

  • Always protect root account
    • Don't use it for projects
  • Enable MFA wherever you can
    • i.e. Google Authenticator
  • Rotate all keys and passwords periodically
  • Avoid "credential" files in shared computers
    • e.g. https://github.com/99designs/aws-vault
  • Follow least privilege principle always

Data Lake Infrastructure

Infrastructure As Code

  • AWS console
  • Programmatically
    • AWS CLI
    • SDK
      • Python, NodeJS, etc all software development kits
  • This project uses AWS CLI to build and configure the data lake
    • Some steps require manual interaction with the console (labelled as manual)
  • All AWS CLI follows this command reference

S3 for Data Storage

Naming Conventions for S3

s3://company-raw-awsregion-awsaccountID-env/source/source_region/tablename/year=yyyy/month=mm/day=dd/table_<yearmonthday>.<file_format>

env = dev, test, prod
source = name or indicator of source
source_region = region of data source

krozic-youtube-raw-useast1-dev

Creating Our S3 Bucket

Bucket creation:

aws s3api create-bucket `
	--bucket krozic-youtube-raw-useast1-dev `
	--region us-east-1 `
	--object-ownership BucketOwnerEnforced `
	--output text >> setup.log

Encryption:

aws s3api put-bucket-encryption `
	--bucket krozic-youtube-raw-useast1-dev `
	--server-side-encryption-configuration '{\"Rules\": [{\"ApplyServerSideEncryptionByDefault\": {\"SSEAlgorithm\": \"AES256\"}}]}'

Restrict Public Access:

aws s3api put-public-access-block `
	--bucket krozic-youtube-raw-useast1-dev `
	--public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Populating the S3 Bucket

  • json:
aws s3 cp . s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics_reference_data/ `
	--recursive `
	--exclude "*" `
	--include "*.json"
  • csv:
aws s3 cp CAvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=ca/
aws s3 cp DEvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=de/
aws s3 cp FRvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=fr/
aws s3 cp GBvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=gb/
aws s3 cp INvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=in/
aws s3 cp JPvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=jp/
aws s3 cp KRvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=kr/
aws s3 cp MXvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=mx/
aws s3 cp RUvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=ru/
aws s3 cp USvideos.csv s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics/region=us/

IAM Role

  • Must first create a 'Role' so that AWS Glue can interact with the S3 bucket
    • Manual
  • 'use cases for other AWS Services': Glue
  • Search for S3: AmazonS3FullAccess
  • Role name: krozic-youtube-glue-s3-role
  • Add Permission -> Attach Policies
  • Search 'Glue': AWSGlueServiceRole
    • For EC2, S3, and Cloudwatch Logs
aws iam create-role `
	--role-name krozic-youtube-glue-s3-role `
	--assume-role-policy-document file://policies/trust-glue.json

aws iam attach-role-policy `
	--role-name krozic-youtube-glue-s3-role `
	--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
	
aws iam attach-role-policy `
	--role-name krozic-youtube-glue-s3-role `
	--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

AWS Glue Catalog

  • Data Catalog: Information about our data and what it contains

  • Ex. Labeling all columns and what they are

  • Builds a metadata catalog that allows ETL to be performed

  • Create Database:

aws glue create-database `
	--database-input '{\"Name\": \"krozic_youtube_raw\", \"Description\": \"This database is created using AWS CLI\"}'
  • Create Crawler:
aws glue create-crawler `
	--name "krozic-youtube-raw-glue-catalog-1" `
	--database-name "krozic_youtube_raw" `
	--role "krozic-youtube-glue-s3-role" `
	--targets '{\"S3Targets\": [{\"Path\": \"s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics_reference_data\"}]}'

Run Crawler:

aws glue start-crawler --name "krozic-youtube-raw-glue-catalog-1"

AWS Athena

  • Used to query tables

  • Cannot use without defining an 'output location'

  • Create new S3 bucket:

aws s3api create-bucket `
	--bucket krozic-youtube-raw-useast1-athena-job `
	--region us-east-1 `
	--object-ownership BucketOwnerEnforced `
	--output text >> setup.log
	
aws s3api put-bucket-encryption `
	--bucket krozic-youtube-raw-useast1-athena-job `
	--server-side-encryption-configuration '{\"Rules\": [{\"ApplyServerSideEncryptionByDefault\": {\"SSEAlgorithm\": \"AES256\"}}]}'
	
aws s3api put-public-access-block `
	--bucket krozic-youtube-raw-useast1-athena-job `
	--public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
  • Attach to Athena as output
    • Manual
  • Get error because of json formatting
    • SerDe (Serialize deserialize)
  • Requirement: Data Cleansing
    • Create light ETL: JSON to Apache Parquet
      • Column and Row format

![[Pasted image 20220613162115.png]]

AWS Lambda

  • JSON data is stored on S3
  • Inside Lambda you can write python or another language to transform the data
  • Then build the Glue Crawler on this clean data

First create a new role: Manual

  • Add lambda role
  • Use policy AmazonS3FullAccess
  • Name: krozic-youtube-lambda-s3-role
  • programatic:
aws iam create-role `
	--role-name krozic-youtube-lambda-s3-role `
	--assume-role-policy-document file://policies/trust-lambda.json

aws iam attach-role-policy `
	--role-name krozic-youtube-lambda-s3-role `
	--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
	
aws iam attach-role-policy `
	--role-name krozic-youtube-lambda-s3-role `
	--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
  • Create lambda function: Not Done/working
  • Must create a zip file with python contents and dependencies
aws lambda create-function `
	--function-name krozic-youtube-raw-useast1-lambda-json-parquet2 `
	--runtime python3.8 `
	--environment file://lambda_environment.json `
	//--code file://lambda_function.py `
	--role arn:aws:iam::747879646971:role/krozic-youtube-lambda-s3-role

Environment Variables to add:

{
  "Variables": {"string": "string"
    ...}
}

Python Script

  • Uses awswrangler to read S3 files

  • Unnests the json data to extract just the category info

  • This python script uses 'awswrangler' and creates a 'glue catalog', acting like a glue crawler but in a lambda function.

  • Create new S3 to store clean data

aws s3api create-bucket `
	--bucket krozic-youtube-clean-useast1-dev `
	--region us-east-1 `
	--object-ownership BucketOwnerEnforced `
	--output text >> setup.log
	
aws s3api put-bucket-encryption `
	--bucket krozic-youtube-clean-useast1-dev `
	--server-side-encryption-configuration '{\"Rules\": [{\"ApplyServerSideEncryptionByDefault\": {\"SSEAlgorithm\": \"AES256\"}}]}'
	
aws s3api put-public-access-block `
	--bucket krozic-youtube-clean-useast1-dev `
	--public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
  • Create new DB for clean data:
aws glue create-database `
	--database-input '{\"Name\": \"krozic_youtube_clean\", \"Description\": \"This database is created using AWS CLI\"}'
  • Create script to clean data:
import awswrangler as wr
import pandas as pd
import urllib.parse
import os

# Temporary hard-coded AWS Settings; i.e. to be set as OS variable in Lambda
os_input_s3_clean_layer = os.environ['s3_clean_layer']
os_input_glue_catalog_db_name = os.environ['glue_catalog_db_name']
os_input_glue_catalog_table_name = os.environ['glue_catalog_table_name']
os_input_write_data_operation = os.environ['write_data_operation']


def lambda_handler(event, context):
    # Get the object from the event and show its content type
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    try:

        # Creating DF from content
        df_raw = wr.s3.read_json('s3://{}/{}'.format(bucket, key))

        # Extract required columns:
        df_step_1 = pd.json_normalize(df_raw['items'])
		
		# Change type to match raw_statistics
		df_step_1['id'] = df_step_1['id'].astype('int')

        # Write to S3
        wr_response = wr.s3.to_parquet(
            df=df_step_1,
            path=os_input_s3_clean_layer,
            dataset=True,
            database=os_input_glue_catalog_db_name,
            table=os_input_glue_catalog_table_name,
            mode=os_input_write_data_operation
        )

        return wr_response
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

Click 'Deploy' Manual

Configure test event: Manual

  • Modify Template: 's3-put'
  • Records: s3: bucket: name: 'krozic-youtube-raw-useast1-dev'
  • Records: s3: bucket: arn: arn:aws:s3:::krozic-youtube-raw-useast1-dev
  • Records: s3: object: key: youtube/raw_statistics_reference_data/US_category_id.json

After testing, an error showing no module 'awswrangler'...

Lambda Limitations

  • Deployment package is 50MB
    • You can use /tmp or mount EFS volumes: More storage and share it across executions
    • 10 GB for memory
    • 6 vCPUs
    • 15 min timeout

Lambda Layers

  • Convenient way to package libraries and other dependencies
  • Reduces the size of uploaded deployment, so it's faster to deploy your code
  • Promote code sharing and separation of responsibilities
    • Simpler iterations, so faster on writing business logic

Add layer: Manual

  • AWSDataWrangler-Python38 https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html https://awscli.amazonaws.com/v2/documentation/api/latest/reference/lambda/create-function.html

New Error: GetTable AccessDenied

  • Need to add 'AWSGlueServiceRole' to the lambda role

Crawling CSV Data

  • Create Crawler:
aws glue create-crawler `
	--name "krozic-youtube-raw-csv-crawler-1" `
	--database-name "krozic_youtube_raw" `
	--role "krozic-youtube-glue-s3-role" `
	--targets '{\"S3Targets\": [{\"Path\": \"s3://krozic-youtube-raw-useast1-dev/youtube/raw_statistics\"}]}'

Run Crawler:

aws glue start-crawler --name "krozic-youtube-raw-csv-crawler-1"
  • This crawler builds a 'partitioned' raw_statistics table schema based on the folder structure containing the many csv files
  • All the .csv can be queried as one big table
  • This table can also be joined to the .json data that was wrangled and glued earlier

Joining Data in Athena

  • raw_statistics 'category_id' is int, while clean_statistics_reference_data 'id' is string
  • Either can CAST() in Athena or can change the type
  • Actually went back and changed the lambda script to modify type
  • As data gets more complex, it will be slower and slower to run these queries
  • Time to convert to parquet in an ETL to increase efficiency

Glue ETL Manual

  • Glue -> ETL -> Jobs (legacy) -> Add Job
    • Name: krozic-youtube-clean-csv-to-parquet
    • Role: krozic-youtube-glue-s3-role
    • Job bookmark: Enable
    • Job metrics: [x]
    • Data source: raw_statistics
    • Data target:
      • Data store: S3
      • Format: Parquet
      • Target path: s3://krozic-youtube-clean-useast1-dev/youtube/raw_statistics/
    • Change all 'long' to 'bigint' where necessary

Optimizing PySpark Script

  • datasink4 writes a single file into S3 bucket destination
    • We want the data partitioned like the source
from awsglue.dynamicframe import DynamicFrame

# Convert dropnullfields3 into dataframe
datasink1 = dropnullfields3.toDF().coalesce(1)
# Build final output and create dynamic dataframe
df_final_output = DynamicFrame.fromDF(datasink1, glueContext, "df_final_output")

# Add '"partitionKeys": ["region"]' to data target:
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://krozic-youtube-clean-useast1-dev/youtube/raw_statistics/", "partitionKeys": ["region"]}, format = "parquet", transformation_ctx = "datasink4")
  • Save -> Run Job
  • Error: An error occurred while calling o143.pyWriteDynamicFrame. Unable to parse file: MXvideos.csv
    • Program cannot interpret different languages
    • Need to encode in UTF (to do)
    • For now, just filter out data with this problem
      • predicate_pushdown = "region in ('ca', 'gb', 'us')"
      • Add 'push_down_predicate = predicate_pushdown' to
  • Successfully creates parquet files, partitioned by country

Glue Crawler for Clean Parquet Data

  • Crawler name: krozic-youtube-clean-csv-to-parquet-etl

  • Include path: s3://krozic-youtube-clean-useast1-dev/youtube/raw_statistics/

  • Role: krozic-youtube-glue-s3-role

  • Database: krozic_youtube_clean

  • Create Crawler:

aws glue create-crawler `
	--name "krozic-youtube-clean-csv-to-parquet-etl" `
	--database-name "krozic_youtube_clean" `
	--role "krozic-youtube-glue-s3-role" `
	--targets '{\"S3Targets\": [{\"Path\": \"s3://krozic-youtube-clean-useast1-dev/youtube/raw_statistics\"}]}'

Run Crawler:

aws glue start-crawler --name "krozic-youtube-clean-csv-to-parquet-etl"

Lambda for ALL .json files

  • Automatically run Lambda code when files are added

  • Lambda -> Add Trigger -> S3

    • Bucket: krozic-youtube-raw-useast1-dev
    • Event type: All object create events
    • Prefix: youtube/raw_statistics_reference_data/
    • Suffix: .json
    • Recursive Invocation: I Acknowledge
  • Testing this method:

    • Delete all .json in raw_statistics_reference_data
    • Delete parquet files in clean folder
    • Re-upload all .json files to check lambda is triggered

Creating ETL for the Reporting Layer

  • As data grows larger, the data will be difficult to query everyday for analysis
  • Better to create another storage system for data meant for analytics
  • Glue -> Glue Studio -> Create
    • Name: krozic-youtube-parquet-analytics
    • Source: AWS Glue Data Catalog
      • Database: krozic_youtube_clean
      • Table: clean_statistics_reference_data
    • Source: AWS Glue Data Catalog
      • Database: krozic_youtube_clean
      • Table: raw_statistics
    • Transform: Join
      • Join type: Inner join
      • Node properties:
        • Node parents: both
      • Transform:
        • Add condition: category_id = id
    • Target: S3
      • Format: Parquet
      • Compression Type: Snappy
      • S3 Target Location: s3://krozic-youtube-analytics-useast1-dev
      • Database: krozic_youtube_analytics
      • Table name: final_analytics
      • Add a partition key: region
      • Add a partition key: category_id
    • Job Details:
      • Role: krozic-youtube-glue-s3-role
  • Had to create glue-assets bucket before saving (role issue? other glue ETL created folders...)
  • Save -> Run

To do before the above steps:

  • Create new bucket for analytical data
aws s3api create-bucket `
	--bucket krozic-youtube-analytics-useast1-dev `
	--region us-east-1 `
	--object-ownership BucketOwnerEnforced `
	--output text >> setup.log
	
aws s3api put-bucket-encryption `
	--bucket krozic-youtube-analytics-useast1-dev `
	--server-side-encryption-configuration '{\"Rules\": [{\"ApplyServerSideEncryptionByDefault\": {\"SSEAlgorithm\": \"AES256\"}}]}'
	
aws s3api put-public-access-block `
	--bucket krozic-youtube-analytics-useast1-dev `
	--public-access-block-configuration "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"
  • Create DB for analytics data:
aws glue create-database `
	--database-input '{\"Name\": \"krozic_youtube_analytics\", \"Description\": \"This database is created using AWS CLI\"}'

Benefits of Analytic Dataset

  • Using clean data:
SELECT ref.snippet_title, stats.title, stats.title
FROM raw_statistics stats
	INNER JOIN clean_statistics_reference_data ref on (stats.category_id = ref.id)
WHERE ref.id = 2

VS

  • Using Analytic data:
SELECT snippet_title, title, title
FROM rpt_youtube_statistics_categories
WHERE id = 2
  • No joins, just a single filter to get desired data

QuickSight Dashboard

  • Create QuickSight User
  • Give access to S3 buckets
  • Datasets -> New Dataset (import data from Athena to QuickSight)
    • Athena
      • Data source name: youtube_analytics_dashboard
      • Validate Connection
      • Database: krozic_youtube_analytics
      • Table: final_analytics
      • Edit/preview -> Save & Publish
  • Datasets -> final_analytics -> Create analysis

AWS Teardown

  • Delete Bucket:
aws s3api delete-bucket `
	--bucket krozic-youtube-raw-useast1-dev `
	--region us-east-1
	
aws s3api delete-bucket `
	--bucket krozic-youtube-raw-useast1-athena-job `
	--region us-east-1
	
aws s3api delete-bucket `
	--bucket krozic-youtube-clean-useast1-dev `
	--region us-east-1
	
aws s3api delete-bucket `
	--bucket krozic-youtube-analytics-useast1-dev `
	--region us-east-1
  • Delete Database:
aws glue delete-database --name "krozic_youtube_raw"
aws glue delete-database --name "krozic_youtube_clean"
aws glue delete-database --name "krozic_youtube_analytics"
  • Delete Crawler:
aws glue delete-crawler --name "krozic-youtube-raw-glue-catalog-1"
  • Delete Roles: need to detatch-role-policy first... and remove-role-from-instance-profile and delete-role-policy
aws iam detach-role-policy `
	--role-name krozic-youtube-glue-s3-role `
	--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy `
	--role-name krozic-youtube-glue-s3-role `
	--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam delete-role --role-name krozic-youtube-glue-s3-role

aws iam detach-role-policy `
	--role-name krozic-youtube-lambda-s3-role `
	--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy `
	--role-name krozic-youtube-lambda-s3-role `
	--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam delete-role --role-name krozic-youtube-lambda-s3-role