I came across #CloudGuruChallenge on LinkedIn and learnt that A Cloud Guru has recently started posting coding challenges every month which gives us a chance to test our cloud concepts/skill set and better them. This coding challenge is for the month of October and lets us use AWS + Python + IaC. I know I am quite late to the party, but I couldn’t resist trying this challenge. At each step of the challenge, you can take different paths and make decisions based on factors that matter to you. What excited me about October’s challenge is that it gave me a chance to practice concepts like Lambda and serverless framework.
The challenge basically is to create an ETL pipeline to intake Covid19 data from John Hopkins and New York Times datasets daily and create a reporting dashboard.
You can find the complete details of challenge here: https://acloudguru.com/blog/engineering/cloudguruchallenge-python-aws-etl
I will now go through each step of the challenge and explain what I did. Below is a diagram of the flow I used:
IaC (Terraform, Serverless Framework):
I love infrastructure as code and I am pretty comfortable with Terraform. So, I decided to use it to provision my PostgreSQL database. And since I wanted to learn/practice serverless framework, I decided to use it for creating Lambda functions and layers, SNS notifications and to trigger the Lambda function based on Cloudwatch events. I discovered that serverless framework is a powerful weapon that makes creating Lambda functions and it’s components extremely easy. Also, to integrate serverless framework and Terraform, I stored the Terraform DB variables/output in Parameter Store and easily referenced the SSM parameters in serverless framework.
We need to create a DB first to load transformed data in later stages. For database, I wanted something simple. After doing some research, it looked like DynamoDB had the below issues:
- Looks like the code to load data into DynamoDB is more complex than it is for PostgreSQL.
- AWS Quicksight doesn’t allow us to load DynamoDB dataset. That means we will have to store this data again in an S3 bucket and load that into Quicksight. Another option is to use a service like SaS Redash to create the dashboard.
To avoid these additional complexities, I decided to go with PostgreSQL DB.
Extract, transform, load (Lambda, Python):
The ETL process is a Lambda function written in Python. We are provided with two CSV URLs that we need to extract, transform and load into the DB daily.
- Extract: I extracted the two Covid19 datasets using Pandas python library and stored them as dataframes.
- Transform: For NY dataset, converted date column to date object and cases, deaths columns to integer. For John Hopkins dataset, filtered the data to show only US data, then converted date column to date object and recovered column to integer and dropped remaining columns. I merged the two datasets on the basis of date with NY dataset as the authoritative dataset.
- Load: The Lambda function then loads this data into PostgreSQL DB. We used boto3 to connect to SSM and retrieve DB parameters and used psycopg2 python library for connection and insertion to PostgreSQL DB. The Lambda function creates the DB table if it doesn’t exist and checks for number of rows in the DB table. Based on the output, the Lambda performs either of the two options:
* Initial load: If number of rows is 0, we save the merged dataframe on disk as a CSV file, load the CSV file and use copy_from() to copy data to the DB. I chose this method of saving dataframe to a CSV file based on this article: Reference
* Incremental load: If number of rows is >0, then we do a “diff” between the last date entry in the DB table and the last date entry in the merged dataframe. If the delta (diff.days) is equal to 0, that means the PostgreSQL DB table data is up to date. If the delta is >0, that means there are rows to be updated. To insert the new rows, we create a temporary table and load the merged dataframe to it. We then compare the DB table and temp table and insert new rows into the PostgreSQL DB table based on missing date entries.
The Lambda function uses boto3 to connect to SNS and send email alerts to the SNS topic’s subscribers about any errors/exceptions in our Lambda function. It also sends email notifications about initial/incremental load.
CI/CD (Github Actions):
I have used Jenkins and CircleCI for CI/CD in the past and the option of learning a new CI/CD service excited me. So, I decided to try Github Actions and created Github workflows. Every time code is pushed to Github, the workflow performs end-to-end operations be it creating/updating/destroying Terraform resources or serverless framework resources.
The workflow first checks if there are file changes and makes decisions on what jobs to run based on that. I setup the workflow jobs/steps to run in sequence based on certain conditions. While running the serverless job, our workflow reads the requirements.txt file which has a list of Python modules and installs these modules into a folder in runtime env which are then zipped up and uploaded as Lambda layers when serverless file is deployed.
The workflow also updates the Github Pull requests with TF plan. This is a feature usually possible with TF enterprise and I was pretty stoked that it is possible in Github Actions.
Reporting dashboard (AWS Quicksight):
And finally to our dashboard! We have transformed data siting it our database, but how do we view it? This is where we need a BI tool like Quicksight to do the job. I imported the DB dataset and created the dashboard manually. Though limited in function, it was super easy to setup for a home project.
I am really thankful to Forrest Brazeal for creating this challenge. I spent a lot of time on this challenge and at times it was frustrating. But it was also extremely rewarding. It gave me an opportunity to explore and understand the issues and best practices for Lambda + Layers, Serverless framework and Github Actions and also test my current AWS and Terraform skill set. Looking forward to more such challenges in future.
Feel free to connect with me on LinkedIn. :)