Mastering Long-Running Workflows: Waits, Retries, and Timeouts
By Chris Moen • Published 2026-04-08
Discover essential strategies for managing long-running workflows with explicit waits, smart retries, and scoped timeouts to ensure reliability and efficiency.
The quick answer
Long-running workflows need explicit waits, smart retries, and scoped timeouts. Use waits to pause without compute, retries for transient errors, and timers for business time limits. Then track each run with clear history so you can see what happened.
What “long-running” means in practice
A long-running workflow is any flow that keeps state while work happens elsewhere or over time. You may wait on a human decision, a remote job, a rate limit reset, or a slow import. That is normal. As the Medusa docs put it, you start work and do not get output right away, often because steps wait for an external action to finish before resuming execution (Medusa definition of long‑running workflows).
Why timeouts, retries, and waits matter
- Timeouts prevent indefinite hangs and expose stuck steps. Retries help recover from transient issues like network flaps or 5xx errors. These are standard goals in workflow engines (timeouts and retries overview).
- Business timeouts should be explicit inside the workflow. Do not rely on a platform kill switch for business logic. Use timers inside the flow to handle cleanup or compensations when time is up (Temporal guidance on timers).
Where timeouts come from
Common causes include:
- Slow or flaky third‑party APIs. Large payload processing. Rate limits that stall calls. These are frequent reasons for step timeouts in serverless builders (Pipedream thread on timeouts).
- Platform execution limits on per‑step or per‑run time. Default limits differ by trigger type in many tools.
- Connector or action failures in vendor services. Misconfigured retry policies. Bad credentials. These show up often in cloud workflow platforms (Azure Logic Apps troubleshooting).
- Worker restarts or crashes during long tasks. Lack of heartbeat or resume logic can make a step appear stuck until a timeout fires. Many engines advise adding heartbeat checks so cancellations and restarts are detected quickly (Temporal activity guidance).
How to troubleshoot fast
Start simple:
- Pinpoint the scope
- Did a single step time out?
- Did the whole workflow hit a max duration?
- Did a wait expire with no callback?
- Check run history
- Look for last successful step. Note error codes, durations, and response sizes.
- Confirm which retry attempt failed.
Then apply the fix that matches the cause:
- External dependency issues
- Add retries with exponential backoff and jitter.
- Respect 429 or Retry‑After. Switch to waits plus scheduled resume when you hit limits.
- Reduce payload size. Chunk large work into smaller calls.
- Internal compute issues
- Profile slow code paths. Stream or batch large transforms.
- Move heavy work to a worker that can run out of band. Have the workflow wait for a callback.
- Business time windows
- Use a workflow timer to end or compensate after N minutes. Do not rely on global run timeouts for business logic (use timers for business timeouts).
- Long tasks and cancellations
- Add heartbeats so the platform can cancel quickly on shutdown or timeout.
- Make the task idempotent so retries are safe.
- Webhooks and callbacks
- Use idempotency keys. Dedupe repeats.
- Validate signatures. Log each callback with the related run ID.
- Storage pressure and large artifacts
- Persist big outputs to storage. Pass lightweight references between steps instead of raw blobs.
Durable patterns that work
- Webhook or callback pattern
- Kick off work. Return a callback URL.
- The worker posts results to the callback.
- The workflow resumes and continues.
- Remote worker pattern
- Start a job on a VM or remote host over SSH.
- Use a wait in the workflow.
- Resume when the remote job calls back. This avoids holding an SSH session open.
- Human‑in‑the‑loop
- Pause for approval. Notify reviewers.
- Resume only on explicit approve or reject.
- Backoff and retry
- Use capped exponential backoff for 5xx, 429, and network errors.
- Fail fast on 4xx that indicate permanent issues like auth errors.
How Breyta fits this use case
Breyta is a workflow and agent orchestration platform for coding agents. It gives you a clear structure for long-running jobs.
- Deterministic flows with versioned releases
- Flows are versioned definitions with a draft and live split. You can run, inspect, and promote when ready.
- Runs are pinned to the resolved release at start time.
- First‑class waits and approvals
- Use wait steps to pause for callbacks or time windows.
- Add approvals to keep a human in the loop.
- Orchestrate local and remote agents
- Start remote work on a VM over SSH with documented steps.
- Pause with a wait and resume when the worker posts back. This is the remote‑agent pattern used for long-running jobs.
- You can also hand off to a local agent and resume on completion.
- Resource model for large outputs
- Persist large artifacts. Pass compact res:// references so state stays light and inspectable.
- Operational visibility and control
- Inspect run history and step outputs.
- Configure connections and secrets outside the flow logic.
- Use an agent‑first CLI with stable JSON so your coding agent can author and operate flows.
- Billing details that matter for long runs
- Triggers, waits, and approval steps do not count as billable step executions. This helps when a flow must pause for long periods.
The agent does the task. Breyta runs the workflow around it.
What to look for in any long‑running workflow engine
- Durable waits that do not burn compute.
- External callbacks and webhook triggers.
- Timers you can use for business deadlines.
- Clear run history with step outputs and artifacts.
- Versioned releases with draft and live targets.
- Human approvals and notifications.
- Ability to orchestrate agents on VMs over SSH.
- A resource model for large outputs.
- A CLI or API that is stable and scriptable by coding agents.
Quick fixes by symptom
- A step times out while calling a third‑party API
- Increase per‑step timeout within reason.
- Add retries with jittered backoff. Respect 429 and Retry‑After headers.
- Switch to a wait plus scheduled retry instead of tight polling (rate‑limit advice example).
- The whole workflow times out
- Replace global timeouts with an internal timer that performs cleanup or rollback (use timers inside workflows).
- A long task gets lost on worker restart
- Add heartbeats and ensure the activity can resume or retry safely (Temporal heartbeat note).
- Large payloads slow everything down
- Persist files and media. Pass references instead of inlining blobs.
- Human approval takes hours
- Use a wait that pauses the run without compute. In Breyta, waits and approvals do not count as billable step executions.
- Connector errors in vendor services
- Check connector status, auth, and retry policy. Review per‑action limits and payload caps (Azure Logic Apps troubleshooting).
Bottom line
Make waits explicit. Use retries for transient issues. Scope timeouts to business rules with timers. Give your team run history they can trust. Breyta provides these primitives for agent and automation workflows, including human-in-the-loop workflows and long-running and VM‑backed jobs.