Rate limiting strategies for SMB APIs

Token bucket, sliding window, and fixed window — when to use each, and how we implement rate limiting in FastAPI without Redis bloat.

Why you need it

Even friendly clients write bad code. A loop that "just hits the endpoint until it works" can take down a backend. Rate limiting is the polite "no" you owe yourself.

The three algorithms

Fixed window: count requests in a 60-second bucket. Simple, but allows bursts at the boundary (100 in the last second of one window, 100 in the first second of the next).

Sliding window: track requests in the last 60 seconds at any moment. No boundary bursts, but more memory.

Token bucket: tokens refill at a steady rate, each request consumes one. Allows bursts up to the bucket size, smooths over time. This is what AWS and most public APIs use.

For SMB backends, fixed window is usually fine. The boundary burst rarely matters.

Where to enforce it

Per IP for unauthenticated endpoints (login, password reset)
Per user for authenticated endpoints
Per API key for partner integrations
Per endpoint for expensive operations (search, export)

You will end up with several layers. That is correct.

Implementing it in FastAPI

The simplest production-grade approach: slowapi (a port of Flask-Limiter to Starlette).

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/login")
@limiter.limit("5/minute")
async def login(req: Request, payload: LoginIn):
    ...

Back it with Redis for multi-instance correctness. In-memory storage is fine for single-instance deployments and a great starting point.

What to return

429 status, a Retry-After header in seconds, and a body that explains the limit:

{ "error": "rate_limited", "retry_after_seconds": 42, "limit": "5 per minute" }

Clients can build sane retry logic against that. Without Retry-After, they will hammer you.

What we skip

We do not build per-endpoint dashboards for rate limits until a client asks. The application logs already tell us who is hitting the limit, and over-engineering this up front is wasted effort.