All posts
·5 min read

Dead letter queues: a practical introduction

What a DLQ is, why you need one for any production background job system, and how to set one up with Celery and Redis.

DLQCeleryreliability

The problem

A background job fails. You retry it three times. It still fails. Now what?

Without a dead letter queue, the job vanishes. You lose the data, the customer never gets their receipt, and you find out next quarter when finance asks where the missing orders went.

What a DLQ is

A second queue where failed jobs go to die — but not silently. Each failed job sits there with its original payload, error, and stack trace, waiting for a human (or a script) to look at it.

It is the difference between a job that "failed" and a job that "failed and is recoverable."

Why you need one

  • Auditability: every failed event is preserved
  • Replay: fix the bug, then re-run the failed jobs
  • Alerting: if the DLQ grows, something is wrong
  • No silent data loss

For any job that touches money, customers, or external systems — you need this.

Setting one up with Celery

Celery does not give you a DLQ out of the box. You build it.

@app.task(
    bind=True,
    autoretry_for=(Exception,),
    retry_backoff=True,
    max_retries=3,
)
def process_order(self, order_id):
    try:
        do_work(order_id)
    except Exception as e:
        if self.request.retries >= self.max_retries:
            dead_letter.delay(
                task="process_order",
                payload={"order_id": order_id},
                error=str(e),
            )
        raise

@app.task
def dead_letter(task, payload, error):
    db.execute(
        "insert into dead_letters (task, payload, error, created_at) values ($1, $2, $3, now())",
        task, payload, error,
    )

The DLQ here is a Postgres table. You could use a real queue, but a table is searchable, easy to replay, and the data is durable.

Replay

Build a small admin endpoint that lists dead letters and lets you re-enqueue them. Anything more elaborate is overkill until you have hundreds.

Alerting

If the dead letter table grows by more than 5 entries per hour, page someone. That threshold is wrong sometimes; tune it. The point is: the DLQ should be near-empty in steady state.

What we tell clients

Every production Celery deployment we ship has a DLQ from day one. We have seen too many "where did this job go?" investigations to skip it.

Got a workflow problem?

Let's talk about whether n8n, a custom backend, or a hybrid fits your case.

A 30-minute discovery call. Free, honest, you leave with a written direction either way.

Start QuizBook a Call