Medallion ETL Pipeline – Bronze, Silver, Gold Data Architecture

Overview

I built this project as part of my IS640 (Programming for Business Analytics) course, focusing on how to structure a data pipeline from raw ingestion through to analytics-ready outputs.

The starting point was messy transaction data coming in from CSV and JSON files with inconsistent formats and values. Instead of trying to clean everything at once, I broke the pipeline into stages using a Bronze, Silver, and Gold structure.

In the Bronze layer, I ingested the raw data into a unified dataset while preserving source information and keeping the schema flexible.

In the Silver layer, I cleaned and standardized the data by handling null values, fixing data types, and normalizing inconsistent text fields so everything followed a consistent structure.

In the Gold layer, I created aggregated datasets for reporting, including team performance, player metrics, and overall difficulty—outputs that are ready to plug into dashboards or analysis workflows.

What I Focused On

Structuring a pipeline using Bronze, Silver, and Gold layers
Working with messy, mixed-format data (CSV and JSON)
Cleaning and standardizing data (null handling, type conversion, categorical normalization)
Generating parquet outputs for efficient storage and downstream use
Producing datasets designed for reporting and analytics

Why This Matters

This project helped me think more in terms of systems—how data moves from raw inputs to something reliable and usable.

That same pattern shows up in AI and cloud workflows, where clean, structured data is what makes downstream modeling, automation, and decision-making possible.