My First Data Engineering Pipeline with Apache Airflow
I’ve spent most of my academic and professional journey working on data science and analytics. From exploring socio-economic datasets in R to deploying dashboards in ArcGIS and Looker Studio, I’ve always loved the process of turning raw data into actionable insights.
But over time, a recurring question kept nudging at me:
How can I make these workflows more automated, scalable, and production-ready?
That curiosity led me into the world of data engineering—and more specifically, into Apache Airflow.
🧠 Bridging Software Engineering with Data
While I’ve spent years working with data, my Python training has mostly been rooted in software engineering. I’ve learned how to write clean, modular code, follow best practices, use version control, and design systems that scale.
When I found data engineering, it felt like the perfect blend:
It lets me use all the software engineering skills I’ve built while working with data pipelines, automation, APIs, and cloud infrastructure. It was exactly the bridge between my two interests.
🛠️ Getting Started: Docker, VS Code & a Solid Course
To get hands-on experience, I set up my environment using:
- Docker: to create isolated containers for Airflow, Postgres, and supporting services. This setup made everything reproducible and easy to reset.
- Visual Studio Code: my primary IDE for writing DAGs and managing the project structure.
- A Udemy course to guide me through the process:
🎓 The Complete Hands-On Course to Master Apache Airflow by Marc Lamberti.
The course is incredibly thorough and practical—it helped me understand how real-world data pipelines are structured, and gave me confidence to build something of my own.
💡 What I Built: A Simple ETL DAG
To apply what I was learning, I created a DAG that:
- Creates a
users
table in a PostgreSQL database. - Uses a sensor to check whether a public API is available.
- Extracts fake user data from that API.
- Saves the data as a CSV file in a temporary directory.
- Loads the data into the database.
It’s not flashy, but it’s my first real data pipeline.
🔍 Code Walkthrough
Below is the full DAG that I created using Airflow’s SDK:
from airflow.sdk import dag, task
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from airflow.sdk.bases.sensor import PokeReturnValue
from airflow.providers.postgres.hooks.postgres import PostgresHook
@dag
def user_processing():
create_table = SQLExecuteQueryOperator(
task_id="create_table",
conn_id="postgres",
sql="""
CREATE TABLE IF NOT EXISTS users (
id INT PRIMARY KEY,
firstName VARCHAR(255),
lastName VARCHAR(255),
email VARCHAR(255),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)
""",
)
@task.sensor(poke_interval=5, timeout=300)
def is_api_availible() -> PokeReturnValue:
import requests
response = requests.get(
"https://raw.githubusercontent.com/marclamberti/datasets/refs/heads/main/fakeuser.json"
)
if response.status_code == 200:
return PokeReturnValue(is_done=True, xcom_value=response.json())
return PokeReturnValue(is_done=False)
@task
def extract_user(fake_user):
return {
"id": fake_user["id"],
"firstname": fake_user["personalInfo"]["firstName"],
"lastname": fake_user["personalInfo"]["lastName"],
"email": fake_user["personalInfo"]["email"],
}
@task
def process_user(user_info):
import csv
from datetime import datetime
user_info["created_at"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
with open("/tmp/user_info.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=user_info.keys())
writer.writeheader()
writer.writerow(user_info)
@task
def store_user():
hook = PostgresHook(postgres_conn_id="postgres")
hook.copy_expert(
sql="COPY users FROM STDIN WITH CSV HEADER",
filename="/tmp/user_info.csv"
)
process_user(extract_user(create_table >> is_api_availible())) >> store_user()
user_processing()
I kept my code in the ./dags folder and launched the services using:
docker compose up
From there, I could view the Airflow UI at localhost:8080 and watch my DAGs run!
🔄 What I Learned
- Here are a few takeaways from building my first Airflow pipeline:
- Writing DAGs is just writing Python — which made things easier coming from a CS background.
- Sensors are powerful — you can wait for conditions like an API being available.
- Docker is your best friend — I could reset everything with a single command.
- Airflow encourages clean architecture — breaking tasks into small, testable functions.
🚀 What’s Next
- ✅ Build a more advanced DAG with branching logic.
- ☁️ Integrate with Google Cloud (BigQuery + Cloud Storage).
- 🔁 Explore
dbt
andKafka
for batch and stream processing. - 🐳 Keep refining my Docker and deployment skills.