Troubleshooting Long-Running Workflows: A Practical Guide

By Chris Moen • Published 2026-03-27

Learn how to troubleshoot long-running workflows by understanding common failure modes, implementing robust timeout and retry strategies, and leveraging callback patterns for reliable execution.

Breyta workflow automation

Quick answer

Troubleshoot long-running workflows by separating business timers from infrastructure timeouts, making retries idempotent, and using start-then-callback patterns. Track state and progress at each step, and surface clear run history so you can see where it stalled. Breyta supports this shape with waits, approvals, callbacks, versioned flows, and deterministic run history.

What “long-running” means in practice

Long-running workflows span time and systems. They wait on humans, VMs, third-party APIs, or scheduled windows. Examples:

  • Overnight jobs and backfills
  • Human-in-the-loop approvals
  • VM-backed agents that report back later
  • Multi-step API pipelines with external webhooks

They need structure and state you can inspect. They also need graceful pause and resume.

Why it matters for production

The failure modes are different from short tasks. You see stalls, partial side effects, duplicate attempts, and lost callbacks. You need:

  • Business timers that fire cleanup or fallback
  • Retries that do not duplicate side effects
  • Callback correlation that survives restarts
  • Run history with step outputs you can trust

Common failure signals to watch

Group and label failures the same way across systems.

  • Timeouts
  • Business SLA timers vs worker or activity timeouts
  • Heartbeat or state-machine timeouts such as Step Functions’ States.Timeout or States.HeartbeatTimeout patterns you can catch and branch on (example discussion)
  • Advice from workflow engines to use in-workflow timers for business rules, not global kill timers (Temporal guidance)
  • Retries
  • Replayed steps that repeat side effects
  • Missing idempotency keys or dedupe checks
  • No backoff or caps, no dead-letter path
  • Stalls that look like work in progress but are not (see progress and stall detection ideas from a BullMQ guide on long-running jobs with checkpointing and stall detection (overview))
  • Callbacks
  • Lost or duplicated webhook deliveries
  • No correlation IDs or signed verification
  • Callback arrives after the workflow already moved on
  • Throughput and backlogs
  • Queue growth and starved workers
  • Runs pinned to old code or config

Timeouts that work

Time out in layers, with intent.

  • Put business timers inside the workflow
  • Use timers for SLAs and cleanup. This avoids relying on a hard workflow timeout that just kills execution. That advice is standard in workflow engines like Temporal, which recommend in-workflow timers for business timeouts (guidance).
  • Distinguish infrastructure timeouts
  • Activity, heartbeat, and state-machine timeouts protect the platform. Catch them and branch to remediation when the engine supports it, such as handling States.Timeout in Step Functions (example).
  • Bound total execution
  • Set an explicit upper bound for durable execution to prevent surprise cost and drift. AWS guidance for durable functions stresses meaningful overall limits with clear cleanup on timeout (best-practices note).
  • Checkpoint progress
  • Persist checkpoints and emit progress so you can resume or compensate. Many job systems document patterns for progress tracking and checkpointing to recover from stalls (overview).

In Breyta

  • Use wait and approval steps to pause safely. Pair with sleep where a simple timer is fine.
  • Keep state in the workflow while you wait. Resume deterministically when the wait ends.

Retries you can trust

Retries are safe only if side effects are safe.

  • Make steps idempotent for safe retries
  • Use idempotency keys on writes and creates.
  • Guard side effects behind dedupe checks.
  • Persist “already done” flags in a reliable store.
  • Control retry behavior
  • Use exponential backoff and jitter.
  • Cap attempts. Send to dead letter when max is reached.
  • Add guardrails so a single poison input does not block the lane.
  • Capture context
  • Record attempt count, last error, and timestamps.
  • Emit metrics or counters operators can scan for spikes. Some guides call out retry counters, dead letters, and dashboards as core to long-running reliability (overview).

In Breyta

  • You get deterministic runtime behavior and clear run history with step outputs.
  • Separate connections and secrets from logic so retries do not require re-plumbing credentials.
  • Use versioned releases so a retry never flips to a new definition mid-run. Runs are pinned to the resolved release at start time.

Callbacks without surprises

Callbacks connect external work to your workflow. Treat them as first-class.

  • Start-then-callback pattern
  • Kick off remote work. Pause the workflow. Resume only when the worker posts back to a callback URL with a correlation token.
  • Many durable systems rehydrate runs on callback and replay logic deterministically. Completed steps return cached results so you do not re-run them (durable callback behavior example).
  • Make callbacks reliable
  • Use signed callbacks or secret tokens.
  • Deduplicate by correlation ID.
  • Handle late and duplicate deliveries.

In Breyta

  • The remote-agent pattern is documented. Start remote work over SSH. Pause with a wait step. Resume when the worker posts back to your callback URL.
  • This lets long-running agents work without keeping one fragile SSH session open.
  • Approvals and waits are first-class, so human checkpoints fit the same pattern.

Visibility and state you can operate

You cannot fix what you cannot see. Make the state obvious.

  • Run history
  • Show every step input and output. Track resumes and approvals. Expose resource references for large artifacts.
  • Artifacts as resources
  • Persist large outputs as resources and pass compact refs. This keeps state lean and inspectable later.
  • Clear boundaries
  • Keep secrets and connections outside step logic. Make it easy to swap accounts without code edits.

Breyta focuses on deterministic run behavior, clear run history, and explicit waits and approvals. It treats large outputs as resources with res:// refs, so you can inspect artifacts without bloating state. The CLI is agent-first and returns stable JSON that agents or scripts can parse. Flows are versioned, with a draft vs live split and immutable releases, so you can test changes, approve them, and promote safely.

How Breyta fits this use case

Breyta is a workflow and agent orchestration platform for coding agents. Teams use it to build, run, and publish reliable workflows, agents, and autonomous jobs.

What maps directly to long-running troubleshooting:

  • Explicit structure. Flows have triggers, steps, waits and approvals, resource refs, run history, and versioned releases.
  • Remote and local agents. Orchestrate agents on VMs over SSH, pause with a wait, and resume on callback. Or hand work to a local runner, then continue when it completes.
  • Approvals and human checkpoints. Pause, notify, and resume with state intact.
  • Deterministic execution. Inspect step-by-step outputs and re-run with confidence.
  • Draft vs live. Iterate in draft. Promote to live only when you approve the behavior.
  • Resources for large outputs. Persist and pass compact refs, then fetch via signed URLs when needed.
  • CLI built for agents. Operate flows, runs, and resources with stable JSON for coding agents like Codex, Claude Code, Cursor, or Gemini CLI.

Troubleshooting in Breyta, practically:

  • Check run history and step outputs to find the stall point.
  • For waits, confirm the callback arrived or the approval is pending.
  • For long-running agents, verify the remote job got the callback URL and correlation token.
  • If outputs are large, fetch the resource refs and inspect the artifact.
  • If you changed logic, push to draft, re-run, and only release to live after review.

Practical patterns to keep

  • Put business timeouts inside the workflow. Use a timer to trigger cleanup or a fallback path (Temporal guidance).
  • Catch infrastructure timeouts and branch to remediation when the engine supports it, such as States.Timeout handlers in Step Functions (example).
  • Set a sensible overall execution boundary for durable runs, with a clear exit path (best-practices note).
  • Make retries idempotent and capped. Use dead letters for poison inputs and add stall detection or checkpoints where supported (overview).

FAQ

Should I rely on a global workflow timeout for cleanup?

No. Many engines treat global timeouts like a hard kill. Use in-workflow timers for business logic, and run cleanup on timer fire instead of on kill signals (guidance).

How do I avoid duplicate side effects on retry?

Use idempotency keys and dedupe checks at the boundary where the side effect is applied. Persist checkpoints and only proceed when the state is not already applied.

Do I keep SSH sessions open for long agents?

Avoid it. Start the agent, pause the workflow, and resume on callback. Breyta documents this remote-agent pattern with SSH, waits, and callbacks so you keep state without holding an open session.