ETL pipeline basics for SMB engineers

Extract, transform, load — explained in plain language, with the architecture we use for SMB pipelines that need to be reliable, not webscale.

What ETL actually means

You have data in five places — Stripe, your CRM, your support tool, ad platforms, the production database. You want it all in one place so you can ask "how much did we make per channel last quarter."

ETL is the boring infrastructure that moves it there.

Extract: pull data from sources
Transform: clean it, normalize it, join it
Load: write it to a warehouse (Postgres, BigQuery, Snowflake)

That is the whole thing. Everything else is implementation detail.

ELT vs ETL

Modern warehouses are fast enough that you can load raw data first and transform inside the warehouse. This is called ELT. It is easier to debug because the raw data is there to inspect.

For SMBs, ELT is almost always the right call. Use the warehouse for transformation; do not build a transformation framework in Python.

The stack we ship

Source: APIs (Stripe, HubSpot), Postgres replicas, S3 dumps
Extract layer: small Python scripts on a schedule (Celery beat or cron)
Loading: COPY into Postgres, or the warehouse's native loader
Transform: SQL in the warehouse, version-controlled with dbt
Orchestration: Celery for simple pipelines, Prefect or Dagster when DAGs get complex

For most SMBs, Celery + cron + SQL is plenty. We add Prefect/Dagster only when there are 20+ tasks with real dependencies.

What goes wrong

Schema drift: a source adds a column, your loader breaks. Use schema-tolerant loaders that ignore unknown fields, log them, and let you decide.
Backfills: pretending you can always re-run from the source. APIs have retention limits. Snapshot to S3 first, then transform.
Timezones: the source sends created_at in their local time. Convert to UTC on ingest. Never store ambiguous timestamps.
Idempotency: a re-run should not double-count rows. Use upserts keyed on the source ID.

These four issues account for 80% of broken pipelines.

What we will not over-engineer

We do not ship Kafka for SMB workloads. We do not ship Airflow when Celery beat covers the schedule. We do not ship a streaming pipeline when nightly batch covers the use case.

Boring pipelines run themselves. Fancy pipelines wake you up.