DLQ Replay Workflow: Safe Redrive Patterns for SQS and Kafka

By Chris Moen • Published 2026-02-24

A practical guide to build a safe DLQ replay workflow—triage, redrive queues, rate limits, idempotency, and audit—with concrete patterns for Amazon SQS and Apache Kafka.

Breyta workflow automation

A safe dead-letter queue (DLQ) replay workflow lets you fix root causes, triage failures, and redeliver messages without risking loops, broken ordering, or downstream overload. This guide focuses on implementation details for Amazon SQS and Apache Kafka.

Quick answer: How to build a safe DLQ replay workflow

  • Triage first: sample messages, classify errors (transient vs non-retryable), and decide replay, discard, or patch-and-replay.
  • Use a dedicated replay path: publish to a replay queue/topic to avoid interleaving with live traffic.
  • Throttle replays: set batch sizes, in-flight limits, and backoff; pause on error spikes.
  • Preserve order and idempotency: keep keys/message groups and make consumers idempotent.
  • Audit everything: approvals for bulk actions, run history, and who did what/when.

What is a DLQ replay workflow?

A DLQ replay workflow is a controlled process to inspect, decide, and redeliver failed messages. It prevents infinite retry loops, protects ordering requirements, and avoids duplicates through explicit guardrails and documentation.

  • Do not replay blind; always triage and scope.
  • Preserve keys or message groups when order matters.
  • Throttle redelivery to match downstream capacity.
  • Design for idempotency to safely handle duplicates.
  • Keep clear run history and approvals for replays.

Why DLQ replays go wrong

  • Blindly dumping DLQ messages into the hot path at full speed.
  • Breaking FIFO or partition order by interleaving with live traffic.
  • Replaying fundamentally invalid messages.
  • Missing rate limits, backpressure, or kill switches.
  • Non-idempotent consumers that multiply side effects.

Triage: Decide replay, discard, or patch

  • Identify error type: transient, non-retryable, or unknown.
  • Inspect headers/attributes for attempt counts, codes, and timestamps.
  • Verify if a code hotfix or data correction exists; test in staging.
  • Find clusters by key, partition, or message group to scope replays.
  • Sample before bulk actions; document reasoning and outcome.

Note: In some Kafka setups, only non-retryable errors should reach the DLQ. In others, DLQ is used after bounded retries. Choose and document your model.

Architecture: Decouple the replay path

Keep replay traffic separate from the live path and add explicit controls.

  • Flow: DLQ → triage job/UI → optional patch/transform → dedicated replay queue/topic → controlled consumers → audit log.
  • Preserve keys/message groups and add a replay marker in headers/attributes.
  • Do not push directly back into the hot path without throttling and backoff.

Amazon SQS: Implementing a DLQ replay workflow

Use redrive policies and a dedicated replay queue to protect production.

  • Set a redrive policy on the source queue that routes failed messages to a DLQ with longer retention than the source queue.
  • Choose a safe maxReceiveCount for the source queue to avoid loops.
  • For FIFO, preserve MessageGroupId to maintain order during replay.
  • Redrive from the DLQ to a separate replay queue to avoid interleaving with live traffic.
  • Control rate via batch size, consumer concurrency, and visibility timeouts.

Reference: See the Amazon SQS dead-letter queue documentation for configuration details and redrive behavior in the console and APIs (AWS SQS DLQ docs).

Apache Kafka: Implementing a DLQ replay workflow

Use bounded retry topics and a DLQ topic, then replayer jobs that write to a replay topic.

  • Bound retries with retry topics; forward exhausted failures to a DLQ topic.
  • Add error metadata in headers; keep the original record key to preserve partitioning.
  • Build a replayer that reads from DLQ, optionally patches, and writes to a replay topic.
  • Throttle replays with batch sizes, in-flight limits, and backoff/jitter.
  • Make consumers idempotent by key or event ID to handle duplicates safely.

Prevent poison pills and loops

  • Track and cap replay attempts per message; stop and flag beyond the cap.
  • Add denylists for known bad schemas, keys, or versions.
  • Require approvals for bulk replays; start with small scoped batches.
  • Alert on repeat failures and auto-pause on error spikes.

Rate limiting and downstream protection

  • Limit in-flight messages per consumer and per key/group when needed.
  • Use small batches, backoff, and jitter; pause if p95/p99 latency spikes.
  • Expose a kill switch to stop replay jobs quickly.
  • Track backlog age and drain rate; ensure steady progress without saturating dependencies.

Idempotency and consistency

  • Prefer upserts or idempotent updates over side-effect-heavy operations.
  • For FIFO/partition order, drain per key/group and avoid interleaving with live traffic that enforces strict ordering.
  • Aim for at-least-once processing paired with idempotency; exactly-once effects are rare and complex.

Monitoring and audit

  • DLQ backlog size and age distribution.
  • Replay throughput, success rate, and top error codes.
  • Time-to-clear for a DLQ batch and trend over time.
  • Audit trail of who initiated replays, scopes, approvals, and timings.

Operations runbook

  • Confirm incident scope; freeze producers if necessary.
  • Sample 10–50 DLQ messages; reproduce and validate fixes in staging.
  • Choose replay scope and rate; require approval for bulk runs.
  • Start with a small batch in production; watch errors and downstream health.
  • Increase rate gradually if stable; pause and escalate on repeat failures.
  • Close with a postmortem and permanent fixes.

Policies to set

  • Max retries before DLQ and max replay attempts per message.
  • DLQ retention longer than the source queue/topic’s effective processing window.
  • Approval thresholds for bulk replays and production data patches.
  • Criteria for discard vs patch vs replay; ensure documentation and ticket links.
  • Postmortems for recurring DLQ causes with assigned owners and timelines.

Where Breyta fits

Breyta is a workflow and agent orchestration platform for coding agents—the workflow layer around the coding agent you already use. For DLQ replays, teams can use Breyta to orchestrate multi-step replay jobs with deterministic execution, explicit auditable approvals and waits, versioned flow definitions, clear run history, and reusable templates. Breyta can coordinate local agents and VM-backed agents over SSH to run your replayer scripts, patch routines, and validation steps with controlled rates and auditable approvals.

FAQ

How many times should I retry before sending to DLQ?

Set a small, defined limit that covers common transients without creating loops. Many teams use a handful of retries, then DLQ.

Should I replay all DLQ messages?

No. Replay only messages that can succeed after a fix or retry. Discard or archive messages that are invalid by design.

How do I keep order during replays?

Preserve the original key or message group, and isolate replays to a dedicated input. Drain per key/group when strict order matters; avoid interleaving with live traffic if ordering is strict.

What if replays cause duplicates?

Make consumers idempotent. Use idempotency keys, dedupe checks, and idempotent updates.

How long should I keep messages in the DLQ?

Longer than the source queue/topic retention and long enough to cover detection, debugging, and replay windows for your team.