Integrate with cloud storage and data warehouses
Build an ETL pipeline that stores data in S3 and MotherDuck.
In this tutorial, you’ll create an ETL pipeline which processes and stores Major League Baseball (MLB) game data in S3 and MotherDuck.
Prerequisites
- A free Prefect Cloud account
- A free MotherDuck account
- An AWS account
Setup the environment
The code example for the tutorial can be found in the dev-day-zoom-out repository.
To kick things off, clone the dev-day-zoom-out
repository by running the following command:
Follow along with the README.md
file at the root of the repository to set up your environment.
After setting up your environment, navigate to the directory where all the code for this tutorial is located.
You’re more than welcome to work off the existing example, but if you’d like to start fresh, follow the tutorial end-to-end to build your very own pipeline!
Set up S3 and MotherDuck resources
Before building your pipeline, you’ll need to set up the necessary resources in AWS, MotherDuck, and Prefect Cloud. If you’re a seasoned AWS user and already have an IAM User and S3 bucket ready to go, you can skip to the Create a Prefect AWS Credentials block section.
Create AWS resources and Prefect blocks
The following steps will help you set up the AWS resources and Prefect blocks for your workflow. Sign in to the AWS Console to create an AWS IAM User and an AWS S3 bucket. Sign in to Prefect Cloud to create a Prefect AWS credentials block and an S3 bucket block.
Configure an AWS IAM User
Follow the AWS documentation to create an IAM User with AmazonS3FullAccess
permissions.
Save the Access Key ID and Secret Access Key - you’ll need these to create a Prefect AWS credentials block.
Create an AWS S3 bucket
Follow the AWS documentation to create an AWS S3 bucket.
Save the bucket name - you’ll need this to create a Prefect S3 bucket block.
Create a Prefect AWS Credentials block
The Access Key ID and Secret Access Key from the IAM User can be stored in a Prefect AWS Credentials block. This Prefect block allows your flows and tasks to connect to your AWS resources. You can create this block by using the Python SDK, or through the Prefect Cloud UI.
Create a new python file in your project directory and add the following code:
Make sure you have prefect-aws
installed in your environment before running the script.
Create a Prefect S3 bucket block
S3 bucket blocks are used to store your S3 bucket configuration. Your flows and tasks use this block to connect to your S3 bucket. Similarly to the AWS credentials block, you can create this block by using the Python SDK or Prefect Cloud UI.
Create a new python file in your project directory and add the following code:
Important: Make sure you have prefect-aws
installed in your environment before running the script.
Create a MotherDuck access token and secret block
This section will guide you through the process of creating a MotherDuck access token and a Prefect secret block to store the token.
Create a MotherDuck Access Token
Follow the MotherDuck documentation to create an access token.
Save the token—you’ll need this to create a Prefect Secret block.
Create a Prefect Secret block
Prefect secret blocks are used to store sensitive information, such as your MotherDuck access token. This secret block will allow your flows and tasks to connect to your MotherDuck database. You can create a secret block by using the Python SDK or Prefect Cloud UI.
Create a new python file in your project directory and add the following code:
Set up your project
Now that you have your Prefect blocks ready to go, it’s time to create a new project directory and build your pipeline.
Create a new project directory
Create a new directory on your computer to store your project files. Within the root of the directory, create three subfolders for storing the pre- and post-processed game data, and a python file for your Prefect flow:
Run the following commands to create the project directory and files:
Build the MLB pipeline
Once the project directory is set up, you can get to the fun part, building the MLB ETL pipeline!
Fetch the game data
First, import the necessary packages, and then create three tasks that fetch game data by using the statsapi package. You’ll also create a flow that calls these tasks and defines the dependencies between them.
If you run this flow, you’ll see the raw data saved to the raw_data
folder in your project directory.
Loading data to S3
For the next two tasks, you’ll upload the raw data file to s3, and pull it back down for further processing.
You’ll need to add these new tasks to your flow function as well.
Copy and paste the following code into your mlb_flow.py
file.
Update the mlb_flow
function to include the new tasks.
Once you have your flow function updated, you can run the the script to see the data file arrive in your S3 bucket.
Define data processing and MotherDuck tasks
Create a set of functions that clean up the raw data, run some basic statistics, and save the results to a MotherDuck database.
Copy and paste the following code into your mlb_flow.py
file.
Update the mlb_flow
function to include the new tasks.
Once you have your flow function updated, you can run the the script to see the processed data arrive in your MotherDuck database.
Add a markdown artifact
As a last step, add some visibility into the results of your flow.
You can add a task that creates a markdown artifact to view the results in the Prefect Cloud UI.
Copy and paste the following code into your mlb_flow.py
file.
Update the mlb_flow
function to include the new tasks.
After adding the new code, run the updated flow to see the markdown artifact!
Next steps
Now that you have a complete ETL pipeline, you can outfit it with some failure handling and data quality checks to make it more resilient. You might even consider deploying it to serverless infrastructure, so that you don’t have to keep your laptop running 24/7! Jump to the next tutorial to learn how you can use Prefect to do just that: Deploy resilient pipelines to serverless infrastructure
Was this page helpful?