API rate limiting is a technique that controls how many requests a client can make to an API within a given time window. When a client exceeds that limit, the server rejects additional requests until the window resets. It is the primary mechanism services use for request throttling, quota management, and protection against abuse or accidental overload.
Content Table
Why Rate Limiting Exists
An API without rate limiting is an open invitation to chaos. A single misbehaving client, a buggy retry loop, or a deliberate denial-of-service attack can consume all available server resources and bring down the service for everyone. Rate limiting solves several distinct problems at once:
- DoS and DDoS prevention: Caps the damage a single source can cause by flooding the server with traffic.
- Fair usage: Prevents one heavy user from starving other clients on shared infrastructure.
- Cost control: Cloud compute, database queries, and third-party calls all cost money. Limits keep bills predictable.
- API protection: Slows down credential-stuffing attacks and automated scraping that rely on high request volumes.
- SLA enforcement: Paid tiers can be enforced technically rather than just contractually.
How It Works: The Core Mechanics
At its simplest, a rate limiter sits in front of your API and tracks a counter for each client identity. That identity is usually one of:
- IP address (coarse, easy to spoof)
- API key or OAuth token (accurate, tied to an account)
- User ID (for authenticated endpoints)
- A combination, such as IP plus endpoint path
Every incoming request increments the counter. If the counter exceeds the configured limit before the time window expires, the server returns
HTTP 429 Too Many Requests
and the request is dropped or queued. When the window resets, the counter clears and the client can send again.
Where that counter lives matters a lot. In-memory counters on a single server are fast but break the moment you scale horizontally. Production systems almost always store rate limit state in a shared cache, most commonly Redis, so every node in the cluster sees the same count.
Common Rate Limiting Algorithms
The algorithm you pick determines how "bursty" traffic is handled. Each has real trade-offs.
| Algorithm | How It Works | Best For | Weakness |
|---|---|---|---|
| Fixed Window | Counter resets at fixed clock intervals (e.g., every minute on the minute) | Simple quota enforcement | Allows 2x the limit at window boundaries |
| Sliding Window Log | Stores a timestamp for every request; counts only those within the last N seconds | Accurate per-client limits | High memory usage at scale |
| Sliding Window Counter | Approximates the sliding window using two fixed-window buckets and a weighted average | Balanced accuracy and memory efficiency | Slightly approximate, not exact |
| Token Bucket | A bucket fills with tokens at a steady rate; each request consumes one token | Allowing controlled bursts | Burst can still spike server load |
| Leaky Bucket | Requests enter a queue and are processed at a fixed outflow rate | Smoothing traffic to a downstream service | Adds latency; queue can overflow |
The token bucket is the most widely deployed algorithm in real APIs. GitHub, Stripe, and AWS all use variants of it because it naturally accommodates short bursts (a user clicking several things quickly) without allowing sustained floods.
Rate Limit HTTP Headers You Will Actually See
When an API enforces rate limiting, it communicates the current state through response headers. There is no single universal standard, but two conventions dominate:
Legacy convention (used by Twitter/X, GitHub v3, and many others):
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 43
X-RateLimit-Reset: 1719878400
IETF draft standard (RateLimit header fields, draft-ietf-httpapi-ratelimit-headers):
RateLimit-Limit: 60
RateLimit-Remaining: 43
RateLimit-Reset: 17
The key difference: the legacy
X-RateLimit-Reset
is a Unix timestamp, while the IETF version is seconds until reset. Always check the docs for whichever API you are integrating.
When the limit is hit, the server should also return a
Retry-After
header telling the client exactly how many seconds to wait before trying again. Not every API sends it, but well-behaved ones do.
Quota Management vs. Rate Limiting
These two terms are related but not identical, and confusing them causes real integration bugs.
- Rate limiting controls the speed of requests: 100 requests per minute, 10 requests per second.
- Quota management controls the volume over a longer period: 10,000 requests per day, 1 million API calls per month.
A single API can enforce both simultaneously. Google Maps Platform, for example, applies per-second rate limits AND monthly quota caps. You can be well within your monthly quota but still get throttled for sending 50 requests in one second. Hitting the monthly quota returns a different error code (often
403 with a quota-exceeded message) than hitting the per-second rate limit (429).
Handling 429 Too Many Requests in Your Code
Getting a429 is not a fatal error. It is the API telling you to slow down. The correct response is an
exponential backoff with jitter.
Here is a minimal Python example:
import time
import random
import requests
def call_with_backoff(url, headers, max_retries=5):
for attempt in range(max_retries):
response = requests.get(url, headers=headers)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
jitter = random.uniform(0, 1)
time.sleep(retry_after + jitter)
continue
response.raise_for_status()
return response
raise Exception("Max retries exceeded")
Key points in this pattern:
-
Read
Retry-Afterfirst. If the API provides it, use it instead of guessing. - Add random jitter so multiple clients don't all retry at the exact same moment (the "thundering herd" problem).
- Cap the number of retries. An infinite retry loop is itself a form of abuse.
- Log the 429 responses. Frequent rate limiting is a signal you need to cache results, batch requests, or upgrade your plan.
The MDN Web Docs entry for HTTP 429 has a concise reference for the status code semantics if you want the spec-level detail.
Real-World Rate Limit Examples
Looking at how major APIs implement this makes the concepts concrete:
| API | Rate Limit | Scope | Notes |
|---|---|---|---|
| GitHub REST API | 5,000 requests/hour (authenticated) | Per token | 60/hour unauthenticated |
| Stripe | 100 read / 100 write requests per second | Per account | Token bucket; returns 429 with Retry-After |
| Twitter/X v2 | Varies by endpoint (e.g., 15 requests/15 min for user lookup) | Per app or per user | 15-minute fixed windows |
| OpenAI API | Tier-based (e.g., 500 RPM on Tier 1) | Per organization | Separate token-per-minute limits also apply |
| Cloudflare Workers | 100,000 requests/day (free tier) | Per account | Quota-style daily cap, not per-second throttle |
Best Practices for API Consumers and Providers
Whether you are building an API or consuming one, the same principles apply on both sides of the fence.
If you are consuming an API:
- Cache responses aggressively. The cheapest request is one you never send.
- Batch requests wherever the API supports it (e.g., fetch 100 records in one call instead of 100 individual calls).
- Use webhooks or server-sent events instead of polling when the API offers them.
-
Monitor your
X-RateLimit-Remainingheader proactively, not just when you hit 429. - Implement circuit breakers in long-running services so a rate-limited API doesn't cascade into a full application failure.
If you are building an API:
- Apply limits at the API gateway level (Kong, AWS API Gateway, Nginx), not inside each microservice.
- Differentiate limits by tier: free users get 100 RPM, paid users get 1,000 RPM.
-
Always return
Retry-Afterin 429 responses. Clients that respect it will back off cleanly. - Log rate limit events. A client hitting limits constantly either has a bug or is a bad actor.
- Document your limits clearly. Undocumented limits are a support nightmare.
Configure Your Server's Traffic Control Without Writing Code
Managing API rate limiting and traffic control often starts at the server configuration layer. DevToolBox gives developers free browser-based tools to generate, test, and copy server configs instantly, so you spend less time on boilerplate and more time on your actual API logic.
Try Our Free Tools →
Rate limiting hard-caps the number of requests in a window and rejects anything over the limit with a 429 error. Throttling is softer: it slows down excess requests by queuing them or adding deliberate delays rather than outright rejecting them. In practice, many developers use the terms interchangeably, and some APIs do both depending on how far over the limit you are.
The most common culprit is a burst of concurrent requests arriving at the same instant. Even if your total volume is fine, sending 50 requests simultaneously can trigger a per-second limit. Other causes include shared API keys across multiple services, a fixed-window boundary issue where two windows overlap, or undocumented secondary limits on specific endpoints. Add request logging to see the exact timing of your calls.
By capping requests per IP or per API key, rate limiting ensures no single source can monopolize server resources. A 429 response is cheap to generate compared to processing a full request, so the server stays responsive even under heavy attack traffic. That said, rate limiting alone is not sufficient against large-scale DDoS. It works best as one layer in a broader defense that includes CDN-level filtering and infrastructure-level traffic scrubbing.
The standard response is HTTP 429 Too Many Requests. Your code should not treat it as a permanent failure. Read the Retry-After header if present and wait that many seconds before retrying. If the header is missing, use exponential backoff starting at 1-2 seconds and doubling each attempt. Always add random jitter to avoid synchronized retries from multiple clients hitting the server at the same moment.
Yes, and this is actually the recommended approach for most APIs. Expensive endpoints like search or report generation often get tighter limits than simple read endpoints. Twitter's API is a good example: the user lookup endpoint has a different limit than the tweet search endpoint. Per-endpoint limits let you protect your most resource-intensive operations without unnecessarily restricting lightweight calls.
The token bucket algorithm imagines a bucket that fills with tokens at a fixed rate, say 10 tokens per second. Each API request consumes one token. If the bucket has tokens, the request goes through. If it is empty, the request is rejected. The key advantage is that it allows short bursts: if a client was idle for 5 seconds, it accumulates 50 tokens and can fire 50 requests at once. This matches real user behavior better than a rigid fixed-window counter.