All posts
·5 min read

ETL pipeline basics for SMB engineers

Extract, transform, load — explained in plain language, with the architecture we use for SMB pipelines that need to be reliable, not webscale.

ETLdatapipelines

What ETL actually means

You have data in five places — Stripe, your CRM, your support tool, ad platforms, the production database. You want it all in one place so you can ask "how much did we make per channel last quarter."

ETL is the boring infrastructure that moves it there.

  • Extract: pull data from sources
  • Transform: clean it, normalize it, join it
  • Load: write it to a warehouse (Postgres, BigQuery, Snowflake)

That is the whole thing. Everything else is implementation detail.

ELT vs ETL

Modern warehouses are fast enough that you can load raw data first and transform inside the warehouse. This is called ELT. It is easier to debug because the raw data is there to inspect.

For SMBs, ELT is almost always the right call. Use the warehouse for transformation; do not build a transformation framework in Python.

The stack we ship

  • Source: APIs (Stripe, HubSpot), Postgres replicas, S3 dumps
  • Extract layer: small Python scripts on a schedule (Celery beat or cron)
  • Loading: COPY into Postgres, or the warehouse's native loader
  • Transform: SQL in the warehouse, version-controlled with dbt
  • Orchestration: Celery for simple pipelines, Prefect or Dagster when DAGs get complex

For most SMBs, Celery + cron + SQL is plenty. We add Prefect/Dagster only when there are 20+ tasks with real dependencies.

What goes wrong

  • Schema drift: a source adds a column, your loader breaks. Use schema-tolerant loaders that ignore unknown fields, log them, and let you decide.
  • Backfills: pretending you can always re-run from the source. APIs have retention limits. Snapshot to S3 first, then transform.
  • Timezones: the source sends created_at in their local time. Convert to UTC on ingest. Never store ambiguous timestamps.
  • Idempotency: a re-run should not double-count rows. Use upserts keyed on the source ID.

These four issues account for 80% of broken pipelines.

What we will not over-engineer

We do not ship Kafka for SMB workloads. We do not ship Airflow when Celery beat covers the schedule. We do not ship a streaming pipeline when nightly batch covers the use case.

Boring pipelines run themselves. Fancy pipelines wake you up.

Got a workflow problem?

Let's talk about whether n8n, a custom backend, or a hybrid fits your case.

A 30-minute discovery call. Free, honest, you leave with a written direction either way.

Start QuizBook a Call