API rate limiting workflow: detect 429s, throttle, and retry safely

By Chris Moen • Published 2026-02-24

Looking for an API rate limiting workflow? Here’s a practical pattern: detect limits via 429 and headers, throttle with token or leaky buckets, and retry with backoff and jitter—plus how to orchestrate it reliably in Breyta.

Quick answer: An effective API rate limiting workflow continuously reads provider headers (for example, remaining, reset, and Retry-After), proactively throttles requests with a token/leaky bucket and concurrency caps, and on limits uses exponential backoff with jitter—always honoring Retry-After—to recover without 429 storms.

Definition: API rate limiting restricts how many requests a client can make in a set time window.

What is API rate limiting?

API rate limiting controls request volume in a time window. Providers use it to protect resources and ensure fair use. Your client should adapt to each API’s rules.

Limits may apply per token, user, IP, or endpoint.

Windows can be fixed or sliding.

Responses often include 429 Too Many Requests and headers with guidance.

Why you need an API rate limiting workflow

A workflow prevents outages and bad user experiences. It also helps you comply with provider rules.

Avoid cascading 429s and retry storms.

Keep queues short and work predictable under load.

Detecting when you are rate limited

Detect limits fast and decide whether to retry, delay, or drop.

Look for HTTP 429.

Read Retry-After if present.

Read remaining and reset headers (for example, X-RateLimit-Remaining and X-RateLimit-Reset).

Treat 5xx with care. If the API signals to slow down, apply backoff.

Headers to read and cache

Read limit, remaining, reset, and Retry-After. Cache them briefly to drive throttling.

Limit shows quotas per window.

Remaining shows proximity to the cap.

Reset gives the next window timestamp.

Retry-After tells you how long to wait before trying again.

Safe retry strategy (exponential backoff with jitter)

Use safe retry patterns with exponential backoff with jitter. Set a cap and stop after a few tries. Fail fast if the call is not worth waiting on.

Retry only idempotent or otherwise safe calls.

Use exponential backoff with jitter.

Respect Retry-After and reset times.

Proactive throttling before sending requests

Throttle proactively to avoid 429s. Use a client-side token bucket or leaky bucket.

Set per-token and per-endpoint rates.

Track concurrency and cap in-flight calls.

Adapt send rates using live header data.

Queueing, batching, and collapsing requests

Queue overflow instead of dropping. Batch when supported. Collapse duplicate work.

Use small, bounded queues per priority.

Batch compatible reads within API limits.

Coalesce identical requests and fan out the result.

Concurrency and prioritization

Cap concurrency and reserve capacity for critical paths.

Use a global cap plus per-endpoint caps.

Create priority lanes; promote aging tasks if needed.

Pause low-priority queues when limits are tight.

UX and job runner behavior under limits

Inform users and degrade gracefully. Keep work durable.

Surface remaining quota and expected delays where appropriate.

Persist queued work so it survives restarts.

Prefer partial progress over hard failures during limit windows.

Putting it together: a simple API rate limiting workflow

Before each call: check a local token/leaky bucket and a concurrency guard. If empty/full, delay until tokens or slots free up.

Send the request with context for timeout and cancellation.

After each response: update local counters from limit/remaining/reset headers.

On 429 or explicit slow-down: compute delay = max(Retry-After, backoff(current_attempt)) + jitter, enqueue the request, and wait.

On success: release slots, optionally accelerate slightly if the window has headroom.

On repeated failures or non-idempotent operations: stop retrying and surface a clear, actionable error.

Implementing this workflow with Breyta

Breyta is a workflow orchestration platform for coding agents. It is built for multi-step automations, long-running jobs, approval-heavy flows, and agent orchestration.

Deterministic runtime behavior helps you implement predictable throttling, waits, and backoff timers.

Explicit waits and approvals let you pause flows until reset windows or manual review for high-priority overrides.

Versioned flow definitions and reusable templates make your rate limiting pattern easy to roll out and iterate safely.

Clear run history helps you audit 429s, delays, and success rates over time.

Resource references help you manage provider credentials or per-token quotas cleanly.

Orchestrate local agents or VM-backed agents over SSH to run throttled workers close to the APIs you call.

Breyta is the workflow layer around the coding agent you already use. Use it to run this rate limiting workflow reliably across services and environments.