Troubleshooting Long-Running Workflows: A Practical Guide

By Chris Moen • Published 2026-03-27

Learn how to troubleshoot long-running workflows by understanding common failure modes, implementing robust timeout and retry strategies, and leveraging callback patterns for reliable execution.

Quick answer

Troubleshoot long-running workflows by separating business timers from infrastructure timeouts, making retries idempotent, and using start-then-callback patterns. Track state and progress at each step, and surface clear run history so you can see where it stalled. Breyta supports this shape with waits, approvals, callbacks, versioned flows, and deterministic run history.

What “long-running” means in practice

Long-running workflows span time and systems. They wait on humans, VMs, third-party APIs, or scheduled windows. Examples:

Overnight jobs and backfills
Human-in-the-loop approvals
VM-backed agents that report back later
Multi-step API pipelines with external webhooks

They need structure and state you can inspect. They also need graceful pause and resume.

Why it matters for production

The failure modes are different from short tasks. You see stalls, partial side effects, duplicate attempts, and lost callbacks. You need:

Business timers that fire cleanup or fallback
Retries that do not duplicate side effects
Callback correlation that survives restarts
Run history with step outputs you can trust

Common failure signals to watch

Group and label failures the same way across systems.

Timeouts
Business SLA timers vs worker or activity timeouts
Heartbeat or state-machine timeouts such as Step Functions’ States.Timeout or States.HeartbeatTimeout patterns you can catch and branch on (example discussion)
Advice from workflow engines to use in-workflow timers for business rules, not global kill timers (Temporal guidance)

Retries
Replayed steps that repeat side effects
Missing idempotency keys or dedupe checks
No backoff or caps, no dead-letter path
Stalls that look like work in progress but are not (see progress and stall detection ideas from a BullMQ guide on long-running jobs with checkpointing and stall detection (overview))

Callbacks
Lost or duplicated webhook deliveries
No correlation IDs or signed verification
Callback arrives after the workflow already moved on

Throughput and backlogs
Queue growth and starved workers
Runs pinned to old code or config

Timeouts that work

Time out in layers, with intent.

Put business timers inside the workflow
Use timers for SLAs and cleanup. This avoids relying on a hard workflow timeout that just kills execution. That advice is standard in workflow engines like Temporal, which recommend in-workflow timers for business timeouts (guidance).

Distinguish infrastructure timeouts
Activity, heartbeat, and state-machine timeouts protect the platform. Catch them and branch to remediation when the engine supports it, such as handling States.Timeout in Step Functions (example).

Bound total execution
Set an explicit upper bound for durable execution to prevent surprise cost and drift. AWS guidance for durable functions stresses meaningful overall limits with clear cleanup on timeout (best-practices note).

Checkpoint progress
Persist checkpoints and emit progress so you can resume or compensate. Many job systems document patterns for progress tracking and checkpointing to recover from stalls (overview).

In Breyta

Use wait and approval steps to pause safely. Pair with sleep where a simple timer is fine.
Keep state in the workflow while you wait. Resume deterministically when the wait ends.

Retries you can trust

Retries are safe only if side effects are safe.

Make steps idempotent for safe retries
Use idempotency keys on writes and creates.
Guard side effects behind dedupe checks.
Persist “already done” flags in a reliable store.

Control retry behavior
Use exponential backoff and jitter.
Cap attempts. Send to dead letter when max is reached.
Add guardrails so a single poison input does not block the lane.

Capture context
Record attempt count, last error, and timestamps.
Emit metrics or counters operators can scan for spikes. Some guides call out retry counters, dead letters, and dashboards as core to long-running reliability (overview).

In Breyta

You get deterministic runtime behavior and clear run history with step outputs.
Separate connections and secrets from logic so retries do not require re-plumbing credentials.
Use versioned releases so a retry never flips to a new definition mid-run. Runs are pinned to the resolved release at start time.

Callbacks without surprises

Callbacks connect external work to your workflow. Treat them as first-class.

Start-then-callback pattern
Kick off remote work. Pause the workflow. Resume only when the worker posts back to a callback URL with a correlation token.
Many durable systems rehydrate runs on callback and replay logic deterministically. Completed steps return cached results so you do not re-run them (durable callback behavior example).

Make callbacks reliable
Use signed callbacks or secret tokens.
Deduplicate by correlation ID.
Handle late and duplicate deliveries.

In Breyta

The remote-agent pattern is documented. Start remote work over SSH. Pause with a wait step. Resume when the worker posts back to your callback URL.
This lets long-running agents work without keeping one fragile SSH session open.
Approvals and waits are first-class, so human checkpoints fit the same pattern.

Visibility and state you can operate

You cannot fix what you cannot see. Make the state obvious.

Run history
Show every step input and output. Track resumes and approvals. Expose resource references for large artifacts.

Artifacts as resources
Persist large outputs as resources and pass compact refs. This keeps state lean and inspectable later.

Clear boundaries
Keep secrets and connections outside step logic. Make it easy to swap accounts without code edits.

Breyta focuses on deterministic run behavior, clear run history, and explicit waits and approvals. It treats large outputs as resources with res:// refs, so you can inspect artifacts without bloating state. The CLI is agent-first and returns stable JSON that agents or scripts can parse. Flows are versioned, with a draft vs live split and immutable releases, so you can test changes, approve them, and promote safely.

How Breyta fits this use case

Breyta is a workflow and agent orchestration platform for coding agents. Teams use it to build, run, and publish reliable workflows, agents, and autonomous jobs.

What maps directly to long-running troubleshooting:

Explicit structure. Flows have triggers, steps, waits and approvals, resource refs, run history, and versioned releases.
Remote and local agents. Orchestrate agents on VMs over SSH, pause with a wait, and resume on callback. Or hand work to a local runner, then continue when it completes.
Approvals and human checkpoints. Pause, notify, and resume with state intact.
Deterministic execution. Inspect step-by-step outputs and re-run with confidence.
Draft vs live. Iterate in draft. Promote to live only when you approve the behavior.
Resources for large outputs. Persist and pass compact refs, then fetch via signed URLs when needed.
CLI built for agents. Operate flows, runs, and resources with stable JSON for coding agents like Codex, Claude Code, Cursor, or Gemini CLI.

Troubleshooting in Breyta, practically:

Check run history and step outputs to find the stall point.
For waits, confirm the callback arrived or the approval is pending.
For long-running agents, verify the remote job got the callback URL and correlation token.
If outputs are large, fetch the resource refs and inspect the artifact.
If you changed logic, push to draft, re-run, and only release to live after review.

Practical patterns to keep

Put business timeouts inside the workflow. Use a timer to trigger cleanup or a fallback path (Temporal guidance).
Catch infrastructure timeouts and branch to remediation when the engine supports it, such as States.Timeout handlers in Step Functions (example).
Set a sensible overall execution boundary for durable runs, with a clear exit path (best-practices note).
Make retries idempotent and capped. Use dead letters for poison inputs and add stall detection or checkpoints where supported (overview).

FAQ

Should I rely on a global workflow timeout for cleanup?

No. Many engines treat global timeouts like a hard kill. Use in-workflow timers for business logic, and run cleanup on timer fire instead of on kill signals (guidance).

How do I avoid duplicate side effects on retry?

Use idempotency keys and dedupe checks at the boundary where the side effect is applied. Persist checkpoints and only proceed when the state is not already applied.

Do I keep SSH sessions open for long agents?

Avoid it. Start the agent, pause the workflow, and resume on callback. Breyta documents this remote-agent pattern with SSH, waits, and callbacks so you keep state without holding an open session.