Retry and exponential backoff patterns that work

How to retry failed API calls and background jobs without making things worse — exponential backoff, jitter, and circuit breakers explained.

Retries are dangerous

A retry without backoff is a denial-of-service attack on the service you depend on. A retry without a cap is an infinite loop. A retry on a non-idempotent operation is a duplicate charge.

Get this right.

The pattern

Exponential backoff with jitter, capped retries, only on retriable errors.

That sentence contains four decisions. All four matter.

Exponential backoff

Wait 1s, then 2s, then 4s, then 8s. This gives the downstream service time to recover. Linear backoff (1s, 2s, 3s, 4s) does not give enough breathing room when something is genuinely down.

Jitter

Without jitter, every client retries at the same moment and hits the recovering service simultaneously. You created a thundering herd.

Add randomness: wait base * 2^attempt * random(0.5, 1.5). Now retries spread across time.

Capped retries

Three to five attempts is usually right. After that, accept the failure, log it, and move on. Infinite retries hide bugs and burn money.

Only on retriable errors

Retry: network timeouts, 502/503/504, 429 (with Retry-After)
Do not retry: 400, 401, 403, 404, 422 (these are bugs, not transient failures)

A 401 means your token is bad. Retrying will not fix it. Surface the error, fix the auth.

In Python

from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(initial=1, max=30),
    retry=retry_if_exception_type(httpx.HTTPStatusError),
)
async def fetch_user(user_id: str):
    ...

tenacity handles all of this. Use it.

In Celery

@app.task(
    autoretry_for=(httpx.HTTPStatusError,),
    retry_backoff=True,
    retry_backoff_max=600,
    retry_jitter=True,
    max_retries=5,
)
def sync_to_crm(record_id):
    ...

Built in. Use it.

Circuit breakers

If the downstream is consistently failing, stop hammering it. Open a circuit breaker after N failures, wait, then probe with a single request. Libraries like circuitbreaker handle this. Add one when you have a flaky third-party. Skip it for first-party calls.