API Rate Limiting Explained: How Services Control Request Traffic

Navigate

Back to blog

Published

May 07, 2026

Read time

8 min read

Written by

Sophie Parker

API rate limiting concept showing controlled request flow and traffic restriction

API rate limiting is a technique that controls how many requests a client can make to an API within a given time window. When a client exceeds that limit, the server rejects additional requests until the window resets. It is the primary mechanism services use for request throttling, quota management, and protection against abuse or accidental overload.

Content Table

Why Rate Limiting Exists
How It Works: The Core Mechanics
Common Rate Limiting Algorithms
Rate Limit HTTP Headers You Will Actually See
Quota Management vs. Rate Limiting
Handling 429 Too Many Requests in Your Code
Real-World Rate Limit Examples
Best Practices for API Consumers and Providers

Why Rate Limiting Exists

An API without rate limiting is an open invitation to chaos. A single misbehaving client, a buggy retry loop, or a deliberate denial-of-service attack can consume all available server resources and bring down the service for everyone. Rate limiting solves several distinct problems at once:

DoS and DDoS prevention: Caps the damage a single source can cause by flooding the server with traffic.
Fair usage: Prevents one heavy user from starving other clients on shared infrastructure.
Cost control: Cloud compute, database queries, and third-party calls all cost money. Limits keep bills predictable.
API protection: Slows down credential-stuffing attacks and automated scraping that rely on high request volumes.
SLA enforcement: Paid tiers can be enforced technically rather than just contractually.

DoS vs. DDoS: A denial-of-service (DoS) attack comes from a single source; a distributed denial-of-service (DDoS) comes from many. Rate limiting per IP helps against both, though DDoS at scale usually requires additional infrastructure-level mitigation.

How It Works: The Core Mechanics

At its simplest, a rate limiter sits in front of your API and tracks a counter for each client identity. That identity is usually one of:

IP address (coarse, easy to spoof)
API key or OAuth token (accurate, tied to an account)
User ID (for authenticated endpoints)
A combination, such as IP plus endpoint path

Every incoming request increments the counter. If the counter exceeds the configured limit before the time window expires, the server returns HTTP 429 Too Many Requests and the request is dropped or queued. When the window resets, the counter clears and the client can send again.

Where that counter lives matters a lot. In-memory counters on a single server are fast but break the moment you scale horizontally. Production systems almost always store rate limit state in a shared cache, most commonly Redis, so every node in the cluster sees the same count.

Common Rate Limiting Algorithms

The algorithm you pick determines how "bursty" traffic is handled. Each has real trade-offs.

Algorithm	How It Works	Best For	Weakness
Fixed Window	Counter resets at fixed clock intervals (e.g., every minute on the minute)	Simple quota enforcement	Allows 2x the limit at window boundaries
Sliding Window Log	Stores a timestamp for every request; counts only those within the last N seconds	Accurate per-client limits	High memory usage at scale
Sliding Window Counter	Approximates the sliding window using two fixed-window buckets and a weighted average	Balanced accuracy and memory efficiency	Slightly approximate, not exact
Token Bucket	A bucket fills with tokens at a steady rate; each request consumes one token	Allowing controlled bursts	Burst can still spike server load
Leaky Bucket	Requests enter a queue and are processed at a fixed outflow rate	Smoothing traffic to a downstream service	Adds latency; queue can overflow

The token bucket is the most widely deployed algorithm in real APIs. GitHub, Stripe, and AWS all use variants of it because it naturally accommodates short bursts (a user clicking several things quickly) without allowing sustained floods.

Rate Limit HTTP Headers You Will Actually See

When an API enforces rate limiting, it communicates the current state through response headers. There is no single universal standard, but two conventions dominate:

Legacy convention (used by Twitter/X, GitHub v3, and many others):

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 43
X-RateLimit-Reset: 1719878400

IETF draft standard (RateLimit header fields, draft-ietf-httpapi-ratelimit-headers):

RateLimit-Limit: 60
RateLimit-Remaining: 43
RateLimit-Reset: 17

The key difference: the legacy X-RateLimit-Reset is a Unix timestamp, while the IETF version is seconds until reset. Always check the docs for whichever API you are integrating.

When the limit is hit, the server should also return a Retry-After header telling the client exactly how many seconds to wait before trying again. Not every API sends it, but well-behaved ones do.

Quota Management vs. Rate Limiting

These two terms are related but not identical, and confusing them causes real integration bugs.

Rate limiting controls the speed of requests: 100 requests per minute, 10 requests per second.
Quota management controls the volume over a longer period: 10,000 requests per day, 1 million API calls per month.

A single API can enforce both simultaneously. Google Maps Platform, for example, applies per-second rate limits AND monthly quota caps. You can be well within your monthly quota but still get throttled for sending 50 requests in one second. Hitting the monthly quota returns a different error code (often 403 with a quota-exceeded message) than hitting the per-second rate limit (429).

Quota resets are not always midnight UTC. Some APIs reset daily quotas at a specific hour in a specific timezone. Others use rolling 24-hour windows. Check the documentation before assuming a quota will reset when you expect it to.

Handling 429 Too Many Requests in Your Code

Getting a429 is not a fatal error. It is the API telling you to slow down. The correct response is an exponential backoff with jitter.

Here is a minimal Python example:

import time
import random
import requests

def call_with_backoff(url, headers, max_retries=5):
    for attempt in range(max_retries):
        response = requests.get(url, headers=headers)
        if response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
            jitter = random.uniform(0, 1)
            time.sleep(retry_after + jitter)
            continue
        response.raise_for_status()
        return response
    raise Exception("Max retries exceeded")

Key points in this pattern:

Read Retry-After first. If the API provides it, use it instead of guessing.
Add random jitter so multiple clients don't all retry at the exact same moment (the "thundering herd" problem).
Cap the number of retries. An infinite retry loop is itself a form of abuse.
Log the 429 responses. Frequent rate limiting is a signal you need to cache results, batch requests, or upgrade your plan.

The MDN Web Docs entry for HTTP 429 has a concise reference for the status code semantics if you want the spec-level detail.

Real-World Rate Limit Examples

Looking at how major APIs implement this makes the concepts concrete:

API	Rate Limit	Scope	Notes
GitHub REST API	5,000 requests/hour (authenticated)	Per token	60/hour unauthenticated
Stripe	100 read / 100 write requests per second	Per account	Token bucket; returns 429 with Retry-After
Twitter/X v2	Varies by endpoint (e.g., 15 requests/15 min for user lookup)	Per app or per user	15-minute fixed windows
OpenAI API	Tier-based (e.g., 500 RPM on Tier 1)	Per organization	Separate token-per-minute limits also apply
Cloudflare Workers	100,000 requests/day (free tier)	Per account	Quota-style daily cap, not per-second throttle

Best Practices for API Consumers and Providers

Whether you are building an API or consuming one, the same principles apply on both sides of the fence.

If you are consuming an API:

Cache responses aggressively. The cheapest request is one you never send.
Batch requests wherever the API supports it (e.g., fetch 100 records in one call instead of 100 individual calls).
Use webhooks or server-sent events instead of polling when the API offers them.
Monitor your X-RateLimit-Remaining header proactively, not just when you hit 429.
Implement circuit breakers in long-running services so a rate-limited API doesn't cascade into a full application failure.

If you are building an API:

Apply limits at the API gateway level (Kong, AWS API Gateway, Nginx), not inside each microservice.
Differentiate limits by tier: free users get 100 RPM, paid users get 1,000 RPM.
Always return Retry-After in 429 responses. Clients that respect it will back off cleanly.
Log rate limit events. A client hitting limits constantly either has a bug or is a bad actor.
Document your limits clearly. Undocumented limits are a support nightmare.

Gateway-level rate limiting is the industry standard approach. Tools like Nginx's limit_req module let you enforce traffic control at the server level before a request ever reaches your application code, which is far more efficient than handling it in app logic.

What is the difference between rate limiting and throttling?

Rate limiting hard-caps the number of requests in a window and rejects anything over the limit with a 429 error. Throttling is softer: it slows down excess requests by queuing them or adding deliberate delays rather than outright rejecting them. In practice, many developers use the terms interchangeably, and some APIs do both depending on how far over the limit you are.

Why do I keep getting 429 errors even when I think I'm under the limit?

The most common culprit is a burst of concurrent requests arriving at the same instant. Even if your total volume is fine, sending 50 requests simultaneously can trigger a per-second limit. Other causes include shared API keys across multiple services, a fixed-window boundary issue where two windows overlap, or undocumented secondary limits on specific endpoints. Add request logging to see the exact timing of your calls.

How does rate limiting protect against DDoS attacks?

By capping requests per IP or per API key, rate limiting ensures no single source can monopolize server resources. A 429 response is cheap to generate compared to processing a full request, so the server stays responsive even under heavy attack traffic. That said, rate limiting alone is not sufficient against large-scale DDoS. It works best as one layer in a broader defense that includes CDN-level filtering and infrastructure-level traffic scrubbing.

What HTTP status code does rate limiting return, and what should my code do with it?

The standard response is HTTP 429 Too Many Requests. Your code should not treat it as a permanent failure. Read the Retry-After header if present and wait that many seconds before retrying. If the header is missing, use exponential backoff starting at 1-2 seconds and doubling each attempt. Always add random jitter to avoid synchronized retries from multiple clients hitting the server at the same moment.

Can rate limiting be applied per endpoint rather than globally?

Yes, and this is actually the recommended approach for most APIs. Expensive endpoints like search or report generation often get tighter limits than simple read endpoints. Twitter's API is a good example: the user lookup endpoint has a different limit than the tweet search endpoint. Per-endpoint limits let you protect your most resource-intensive operations without unnecessarily restricting lightweight calls.

What is the token bucket algorithm and why is it popular for API rate limiting?

The token bucket algorithm imagines a bucket that fills with tokens at a fixed rate, say 10 tokens per second. Each API request consumes one token. If the bucket has tokens, the request goes through. If it is empty, the request is rejected. The key advantage is that it allows short bursts: if a client was idle for 5 seconds, it accumulates 50 tokens and can fire 50 requests at once. This matches real user behavior better than a rigid fixed-window counter.