In this tutorial, you’ll create an ETL pipeline which processes and stores Major League Baseball (MLB) game data in S3 and MotherDuck.

Prerequisites

Setup the environment

The code example for the tutorial can be found in the dev-day-zoom-out repository.

To kick things off, clone the dev-day-zoom-out repository by running the following command:

git clone https://github.com/PrefectHQ/dev-day-zoom-out.git

Follow along with the README.md file at the root of the repository to set up your environment. After setting up your environment, navigate to the directory where all the code for this tutorial is located.

cd dev-day-zoom-out/track_1_build_workflows/session_2_resilent_workflows/1_starting_flow

You’re more than welcome to work off the existing example, but if you’d like to start fresh, follow the tutorial end-to-end to build your very own pipeline!

Set up S3 and MotherDuck resources

Before building your pipeline, you’ll need to set up the necessary resources in AWS, MotherDuck, and Prefect Cloud. If you’re a seasoned AWS user and already have an IAM User and S3 bucket ready to go, you can skip to the Create a Prefect AWS Credentials block section.

Create AWS resources and Prefect blocks

The following steps will help you set up the AWS resources and Prefect blocks for your workflow. Sign in to the AWS Console to create an AWS IAM User and an AWS S3 bucket. Sign in to Prefect Cloud to create a Prefect AWS credentials block and an S3 bucket block.

Configure an AWS IAM User

Follow the AWS documentation to create an IAM User with AmazonS3FullAccess permissions.

Save the Access Key ID and Secret Access Key - you’ll need these to create a Prefect AWS credentials block.

Create an AWS S3 bucket

Follow the AWS documentation to create an AWS S3 bucket.

Save the bucket name - you’ll need this to create a Prefect S3 bucket block.

Create a Prefect AWS Credentials block

The Access Key ID and Secret Access Key from the IAM User can be stored in a Prefect AWS Credentials block. This Prefect block allows your flows and tasks to connect to your AWS resources. You can create this block by using the Python SDK, or through the Prefect Cloud UI.

Create a new python file in your project directory and add the following code:

from prefect_aws import AwsCredentials


AwsCredentials(
    aws_access_key_id="PLACEHOLDER",  # Replace this with your access key id.
    aws_secret_access_key="PLACEHOLDER",  # Replace this with your secret access key.
    region_name="us-east-2"  # Replace this with your region.
).save("BLOCK-NAME-PLACEHOLDER")  # Replace this with a descriptive block name.

Make sure you have prefect-aws installed in your environment before running the script.

Create a Prefect S3 bucket block

S3 bucket blocks are used to store your S3 bucket configuration. Your flows and tasks use this block to connect to your S3 bucket. Similarly to the AWS credentials block, you can create this block by using the Python SDK or Prefect Cloud UI.

Create a new python file in your project directory and add the following code:

from prefect_aws import S3Bucket, AwsCredentials

aws_credentials = AwsCredentials.load("NAME-OF-YOUR-AWS-CREDENTIALS-BLOCK") #Replace this with your AWS credentials block name.

S3Bucket(
    bucket_name="YOUR-S3-BUCKET-NAME", #Replace this with your S3 bucket name.
    credentials=aws_credentials #Replace this with your AWS credentials block name.
).save("BLOCK-NAME-PLACEHOLDER") #Replace this with a descriptive block name.

Important: Make sure you have prefect-aws installed in your environment before running the script.

Create a MotherDuck access token and secret block

This section will guide you through the process of creating a MotherDuck access token and a Prefect secret block to store the token.

Create a MotherDuck Access Token

Follow the MotherDuck documentation to create an access token.

Save the token—you’ll need this to create a Prefect Secret block.

Create a Prefect Secret block

Prefect secret blocks are used to store sensitive information, such as your MotherDuck access token. This secret block will allow your flows and tasks to connect to your MotherDuck database. You can create a secret block by using the Python SDK or Prefect Cloud UI.

Create a new python file in your project directory and add the following code:

from prefect_secrets import Secret

Secret(
    value="YOUR-MOTHERDUCK-TOKEN", #Replace this with your motherduck access token.
    name="motherduck-access-token" #Replace this with a name that will help you identify the secret.
).save("BLOCK-NAME-PLACEHOLDER") #Replace this with a descriptive block name.

Set up your project

Now that you have your Prefect blocks ready to go, it’s time to create a new project directory and build your pipeline.

Create a new project directory

Create a new directory on your computer to store your project files. Within the root of the directory, create three subfolders for storing the pre- and post-processed game data, and a python file for your Prefect flow:

.
└── mlb-data-project/
    ├── mlb_flow.py
    ├── boxscore_parquet
    ├── raw_data
    └── boxscore_analysis

Run the following commands to create the project directory and files:

mkdir mlb-data-project
cd mlb-data-project
mkdir boxscore_parquet
mkdir raw_data
mkdir boxscore_analysis
touch mlb_flow.py

Build the MLB pipeline

Once the project directory is set up, you can get to the fun part, building the MLB ETL pipeline!

Fetch the game data

First, import the necessary packages, and then create three tasks that fetch game data by using the statsapi package. You’ll also create a flow that calls these tasks and defines the dependencies between them.

If you run this flow, you’ll see the raw data saved to the raw_data folder in your project directory.

Loading data to S3

For the next two tasks, you’ll upload the raw data file to s3, and pull it back down for further processing. You’ll need to add these new tasks to your flow function as well. Copy and paste the following code into your mlb_flow.py file.

Update the mlb_flow function to include the new tasks.

Once you have your flow function updated, you can run the the script to see the data file arrive in your S3 bucket.

Define data processing and MotherDuck tasks

Create a set of functions that clean up the raw data, run some basic statistics, and save the results to a MotherDuck database. Copy and paste the following code into your mlb_flow.py file.

Update the mlb_flow function to include the new tasks.

Once you have your flow function updated, you can run the the script to see the processed data arrive in your MotherDuck database.

Add a markdown artifact

As a last step, add some visibility into the results of your flow. You can add a task that creates a markdown artifact to view the results in the Prefect Cloud UI. Copy and paste the following code into your mlb_flow.py file.

Update the mlb_flow function to include the new tasks.

After adding the new code, run the updated flow to see the markdown artifact!

Next steps

Now that you have a complete ETL pipeline, you can outfit it with some failure handling and data quality checks to make it more resilient. You might even consider deploying it to serverless infrastructure, so that you don’t have to keep your laptop running 24/7! Jump to the next tutorial to learn how you can use Prefect to do just that: Deploy resilient pipelines to serverless infrastructure