ETL pipeline basics for SMB engineers
Extract, transform, load — explained in plain language, with the architecture we use for SMB pipelines that need to be reliable, not webscale.
What ETL actually means
You have data in five places — Stripe, your CRM, your support tool, ad platforms, the production database. You want it all in one place so you can ask "how much did we make per channel last quarter."
ETL is the boring infrastructure that moves it there.
- Extract: pull data from sources
- Transform: clean it, normalize it, join it
- Load: write it to a warehouse (Postgres, BigQuery, Snowflake)
That is the whole thing. Everything else is implementation detail.
ELT vs ETL
Modern warehouses are fast enough that you can load raw data first and transform inside the warehouse. This is called ELT. It is easier to debug because the raw data is there to inspect.
For SMBs, ELT is almost always the right call. Use the warehouse for transformation; do not build a transformation framework in Python.
The stack we ship
- Source: APIs (Stripe, HubSpot), Postgres replicas, S3 dumps
- Extract layer: small Python scripts on a schedule (Celery beat or cron)
- Loading:
COPYinto Postgres, or the warehouse's native loader - Transform: SQL in the warehouse, version-controlled with dbt
- Orchestration: Celery for simple pipelines, Prefect or Dagster when DAGs get complex
For most SMBs, Celery + cron + SQL is plenty. We add Prefect/Dagster only when there are 20+ tasks with real dependencies.
What goes wrong
- Schema drift: a source adds a column, your loader breaks. Use schema-tolerant loaders that ignore unknown fields, log them, and let you decide.
- Backfills: pretending you can always re-run from the source. APIs have retention limits. Snapshot to S3 first, then transform.
- Timezones: the source sends
created_atin their local time. Convert to UTC on ingest. Never store ambiguous timestamps. - Idempotency: a re-run should not double-count rows. Use upserts keyed on the source ID.
These four issues account for 80% of broken pipelines.
What we will not over-engineer
We do not ship Kafka for SMB workloads. We do not ship Airflow when Celery beat covers the schedule. We do not ship a streaming pipeline when nightly batch covers the use case.
Boring pipelines run themselves. Fancy pipelines wake you up.