Guide
API
Security
Developer Reference

API Rate Limiting Explained

Strategies, algorithms, and best practices for keeping your API stable and secure.

Last updated: Mar 3, 2026

Part of the API Authentication Methods guide.

API rate limiting controls how many requests a client can make within a time window. It protects your infrastructure from overload, prevents abuse, and enforces fair usage across tenants. Limits are typically enforced at the API gateway or application layer using counters stored in a fast cache such as Redis.

TL;DR

  • Rate limiting sets a ceiling on requests per client per time window.
  • Fixed window is simple but allows burst spikes at window boundaries.
  • Sliding window is smoother; token bucket is more burst-tolerant.
  • Enforce limits at the gateway — not only in application code.
  • Always return 429 Too Many Requests with a Retry-After header.
  • Rate limiting ≠ throttling: throttling slows requests, limiting blocks them.

What Is API Rate Limiting?

Rate limiting is a policy that restricts how many API requests a client — identified by an API key, IP address, user ID, or tenant — can make within a defined period. When a client exceeds the limit, the server responds with HTTP 429 Too Many Requests and typically includes a Retry-After or X-RateLimit-Reset header telling the client when it can retry.

Limits are usually defined per resource (e.g., 1000 requests/minute for /v1/messages, 100 requests/minute for /v1/export) or at the account tier level (free: 60 req/min, paid: 600 req/min). Most production APIs use multiple overlapping limits — a per-second burst limit and a per-minute sustained limit.

Why Rate Limiting Matters

Stability

A single misbehaving client — a misconfigured retry loop, a scraped script, or a DDoS — can exhaust your database connections or backend threads. Rate limiting creates a hard ceiling that protects all other tenants.

Abuse Prevention

Rate limits slow credential-stuffing attacks, enumeration attacks, and scraping attempts. Combined with authentication, they raise the cost for attackers and make automated abuse impractical.

Cost Control

On pay-per-use infrastructure (cloud functions, LLM APIs, third-party providers), unbounded usage directly translates to cost. Rate limits prevent a runaway client from generating unexpected bills.

Common Rate Limiting Strategies

Fixed Window

The simplest algorithm. Time is divided into fixed slots (e.g., every 60 seconds). A counter increments on each request and resets at the window boundary. If the counter exceeds the limit, requests are rejected until the next window.

Boundary burst problem

A client can send all 100 requests in the last second of window A, then another 100 in the first second of window B — 200 requests in 2 seconds, not the 100/minute you intended. Sliding window solves this.

Sliding Window

Instead of resetting at fixed intervals, the window slides with time. At any moment, only requests within the last N seconds count toward the limit. The count at time t is the sum of requests in the range [t − window, t].

A common approximation uses two fixed-window counters (current and previous) and weights the previous window proportionally to how much it still overlaps with the current sliding window. This is Redis's approach and is O(1) in both time and space.

Token Bucket

Each client has a bucket with a maximum capacity of N tokens. Tokens refill at a constant rate (e.g., 10 tokens/second up to a maximum of 100). Each request consumes one token. If the bucket is empty, the request is rejected.

Token bucket naturally handles bursts: a client that hasn't used the API recently has a full bucket and can fire off a burst of requests. This models real-world usage well and is used by AWS API Gateway and many CDNs.

Leaky Bucket

Requests enter a queue (the bucket) and are processed at a constant output rate, regardless of how fast they arrive. If the queue fills up, new requests overflow and are rejected.

The leaky bucket enforces a smooth, constant outflow rate — useful for protecting downstream services from traffic spikes. Unlike token bucket, it does not allow burst catch-up: the output rate is always constant.

AlgorithmBurst ToleranceSmoothnessImplementation
Fixed WindowLow (boundary burst)LowSimplest
Sliding WindowMediumHighModerate
Token BucketHigh (up to capacity)MediumModerate
Leaky BucketNone (queued)Very HighModerate

Rate Limiting vs Throttling

The terms are often used interchangeably, but they describe different behaviours:

Rate Limiting

Hard rejection. Once the limit is reached, requests are refused with a 429 status code. The client must wait for the window to reset or the bucket to refill.

Suitable for enforcing quotas and preventing abuse.

Throttling

Soft slowdown. Excess requests are delayed — queued or responded to with artificial latency — rather than rejected. The client's requests eventually go through, just more slowly.

Suitable for smoothing traffic spikes without breaking clients.

In practice, API gateways combine both: they throttle moderate spikes (queue requests briefly) and hard-limit severe spikes (reject with 429). Cloudflare, AWS API Gateway, and nginx rate limit modules all support this hybrid model.

Where to Enforce Rate Limits

API Gateway (Recommended First Line)

Enforce limits at the edge before requests reach your application servers. API gateways (AWS API Gateway, Kong, Nginx, Cloudflare Workers) have built-in rate limiting that can key on IP, API key, or JWT claims. Rejecting at the gateway means your application servers are never hit — they're protected regardless of what language or framework they're written in.

Application Layer (Business Logic Limits)

The application layer enforces limits that require business context — per-user quotas tied to subscription tiers, per-endpoint limits for expensive operations, or limits based on computed usage (tokens, credits, API calls this month). These typically use a counter stored in Redis with an atomic increment and TTL.

Middleware in Express, FastAPI, or Rails is the typical enforcement point. Use a shared Redis instance so limits apply consistently across all app replicas.

Do not rely on in-process counters

An in-memory counter inside your application process is not shared across replicas. Each instance gets its own counter — a client can exceed the intended limit by a factor equal to the number of replicas. Always use an external store (Redis, Memcached) for counters.

Common Mistakes

Not returning Retry-After

A bare 429 with no timing information forces clients to guess when to retry, causing exponential backoff loops or random hammering. Always include Retry-After (seconds) or X-RateLimit-Reset (Unix timestamp) in your 429 response.

Keying limits on IP address only

Shared IPs (corporate NATs, mobile carriers, Cloudflare proxies) mean thousands of legitimate users can share a single IP. Key limits on authenticated identity (API key, user ID) for authenticated APIs. Reserve IP-based limits for unauthenticated endpoints.

Only enforcing limits in application code

Application-layer enforcement can be bypassed by overloading a single server before it can reject requests. Gateway-level enforcement drops traffic before it touches your app servers.

Setting limits too low and not communicating them

Limits that are too restrictive break legitimate use cases and generate support tickets. Publish your rate limits in your API documentation and expose them in response headers (X-RateLimit-Limit, X-RateLimit-Remaining) so clients can self-throttle.

Not logging or alerting on sustained 429s

If a client is consistently hitting your rate limits, it may indicate a bug in their implementation, a malicious actor, or a misconfigured limit on your side. Monitor 429 rates and alert on anomalies.

Treating all endpoints the same

A read endpoint that returns a cached result is far cheaper than a write endpoint that triggers a background job. Apply stricter limits to expensive operations and looser limits to cheap reads.

Test Rate-Limited APIs

Use the API Tester to send requests to any endpoint and inspect the full response including rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After). No setup required — runs in your browser.

Open API Tester →

Frequently Asked Questions

Related Resources