API Rate Limiting Explained
Strategies, algorithms, and best practices for keeping your API stable and secure.
Last updated: Mar 3, 2026
Part of the API Authentication Methods guide.
API rate limiting controls how many requests a client can make within a time window. It protects your infrastructure from overload, prevents abuse, and enforces fair usage across tenants. Limits are typically enforced at the API gateway or application layer using counters stored in a fast cache such as Redis.
TL;DR
- →Rate limiting sets a ceiling on requests per client per time window.
- →Fixed window is simple but allows burst spikes at window boundaries.
- →Sliding window is smoother; token bucket is more burst-tolerant.
- →Enforce limits at the gateway — not only in application code.
- →Always return 429 Too Many Requests with a Retry-After header.
- →Rate limiting ≠ throttling: throttling slows requests, limiting blocks them.
What Is API Rate Limiting?
Rate limiting is a policy that restricts how many API requests a client — identified by an API key, IP address, user ID, or tenant — can make within a defined period. When a client exceeds the limit, the server responds with HTTP 429 Too Many Requests and typically includes a Retry-After or X-RateLimit-Reset header telling the client when it can retry.
Limits are usually defined per resource (e.g., 1000 requests/minute for /v1/messages, 100 requests/minute for /v1/export) or at the account tier level (free: 60 req/min, paid: 600 req/min). Most production APIs use multiple overlapping limits — a per-second burst limit and a per-minute sustained limit.
Why Rate Limiting Matters
A single misbehaving client — a misconfigured retry loop, a scraped script, or a DDoS — can exhaust your database connections or backend threads. Rate limiting creates a hard ceiling that protects all other tenants.
Rate limits slow credential-stuffing attacks, enumeration attacks, and scraping attempts. Combined with authentication, they raise the cost for attackers and make automated abuse impractical.
On pay-per-use infrastructure (cloud functions, LLM APIs, third-party providers), unbounded usage directly translates to cost. Rate limits prevent a runaway client from generating unexpected bills.
Common Rate Limiting Strategies
Fixed Window
The simplest algorithm. Time is divided into fixed slots (e.g., every 60 seconds). A counter increments on each request and resets at the window boundary. If the counter exceeds the limit, requests are rejected until the next window.
Boundary burst problem
A client can send all 100 requests in the last second of window A, then another 100 in the first second of window B — 200 requests in 2 seconds, not the 100/minute you intended. Sliding window solves this.
Sliding Window
Instead of resetting at fixed intervals, the window slides with time. At any moment, only requests within the last N seconds count toward the limit. The count at time t is the sum of requests in the range [t − window, t].
A common approximation uses two fixed-window counters (current and previous) and weights the previous window proportionally to how much it still overlaps with the current sliding window. This is Redis's approach and is O(1) in both time and space.
Token Bucket
Each client has a bucket with a maximum capacity of N tokens. Tokens refill at a constant rate (e.g., 10 tokens/second up to a maximum of 100). Each request consumes one token. If the bucket is empty, the request is rejected.
Token bucket naturally handles bursts: a client that hasn't used the API recently has a full bucket and can fire off a burst of requests. This models real-world usage well and is used by AWS API Gateway and many CDNs.
Leaky Bucket
Requests enter a queue (the bucket) and are processed at a constant output rate, regardless of how fast they arrive. If the queue fills up, new requests overflow and are rejected.
The leaky bucket enforces a smooth, constant outflow rate — useful for protecting downstream services from traffic spikes. Unlike token bucket, it does not allow burst catch-up: the output rate is always constant.
| Algorithm | Burst Tolerance | Smoothness | Implementation |
|---|---|---|---|
| Fixed Window | Low (boundary burst) | Low | Simplest |
| Sliding Window | Medium | High | Moderate |
| Token Bucket | High (up to capacity) | Medium | Moderate |
| Leaky Bucket | None (queued) | Very High | Moderate |
Rate Limiting vs Throttling
The terms are often used interchangeably, but they describe different behaviours:
Hard rejection. Once the limit is reached, requests are refused with a 429 status code. The client must wait for the window to reset or the bucket to refill.
Suitable for enforcing quotas and preventing abuse.
Soft slowdown. Excess requests are delayed — queued or responded to with artificial latency — rather than rejected. The client's requests eventually go through, just more slowly.
Suitable for smoothing traffic spikes without breaking clients.
In practice, API gateways combine both: they throttle moderate spikes (queue requests briefly) and hard-limit severe spikes (reject with 429). Cloudflare, AWS API Gateway, and nginx rate limit modules all support this hybrid model.
Where to Enforce Rate Limits
Enforce limits at the edge before requests reach your application servers. API gateways (AWS API Gateway, Kong, Nginx, Cloudflare Workers) have built-in rate limiting that can key on IP, API key, or JWT claims. Rejecting at the gateway means your application servers are never hit — they're protected regardless of what language or framework they're written in.
The application layer enforces limits that require business context — per-user quotas tied to subscription tiers, per-endpoint limits for expensive operations, or limits based on computed usage (tokens, credits, API calls this month). These typically use a counter stored in Redis with an atomic increment and TTL.
Middleware in Express, FastAPI, or Rails is the typical enforcement point. Use a shared Redis instance so limits apply consistently across all app replicas.
Do not rely on in-process counters
An in-memory counter inside your application process is not shared across replicas. Each instance gets its own counter — a client can exceed the intended limit by a factor equal to the number of replicas. Always use an external store (Redis, Memcached) for counters.
Common Mistakes
Not returning Retry-After
A bare 429 with no timing information forces clients to guess when to retry, causing exponential backoff loops or random hammering. Always include Retry-After (seconds) or X-RateLimit-Reset (Unix timestamp) in your 429 response.
Keying limits on IP address only
Shared IPs (corporate NATs, mobile carriers, Cloudflare proxies) mean thousands of legitimate users can share a single IP. Key limits on authenticated identity (API key, user ID) for authenticated APIs. Reserve IP-based limits for unauthenticated endpoints.
Only enforcing limits in application code
Application-layer enforcement can be bypassed by overloading a single server before it can reject requests. Gateway-level enforcement drops traffic before it touches your app servers.
Setting limits too low and not communicating them
Limits that are too restrictive break legitimate use cases and generate support tickets. Publish your rate limits in your API documentation and expose them in response headers (X-RateLimit-Limit, X-RateLimit-Remaining) so clients can self-throttle.
Not logging or alerting on sustained 429s
If a client is consistently hitting your rate limits, it may indicate a bug in their implementation, a malicious actor, or a misconfigured limit on your side. Monitor 429 rates and alert on anomalies.
Treating all endpoints the same
A read endpoint that returns a cached result is far cheaper than a write endpoint that triggers a background job. Apply stricter limits to expensive operations and looser limits to cheap reads.
Test Rate-Limited APIs
Use the API Tester to send requests to any endpoint and inspect the full response including rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After). No setup required — runs in your browser.