Fixing Long-Running Workflows: Timeouts, Hangs, and How to Keep Them Alive

By Chris Moen • Published 2026-05-14

Long-running workflows can time out or hang, causing disruptions and hidden risks. Learn to diagnose and fix these issues by identifying wait points, aligning time budgets, and using asynchronous patterns.

Quick answer

Long-running workflows time out or hang when a step waits too long, holds a fragile connection, or stops sending signals. Fix them by finding the exact wait point, aligning time budgets across components, and using asynchronous patterns that do not keep sockets open. Use heartbeats, callbacks, and chunked work to keep runs alive and visible.

What this means in practice

Timeout: A step or service exceeds its allowed wait time and aborts.
Hang: Work appears stuck. No failure is recorded. Often a blocked thread, stalled I/O, or a closed socket without error propagation.
Silent stall during result transfer: The job finishes but the response or artifact is too large or too slow to ship.

Real systems face all three. The fix starts with knowing which one you have.

Why it matters for production workflows

Stuck runs hide risk. They hold state and block follow-on work.
Timeouts can leave partial changes if the target is not atomic.
Re-running blindly can duplicate changes or spend more budget.
Teams lose trust when they cannot see what happened.

Where problems hide

Check each layer. Many “timeouts” start elsewhere.

Client and SDKs
Overly short connect or read timeouts.
No retry or no jitter.
Misuse of async APIs that still block.

Network and intermediaries
Load balancer or proxy idle timeouts cut long-lived sockets.
High latency or packet loss during transfer.
VPN or firewall rules that drop long polls.

External services
Long tasks that never send an interim signal or heartbeat.
Result uploads that exceed default transfer windows. See the NVIDIA FLARE notes on heartbeat, task fetch, and result submission deadlines in its Timeout Troubleshooting Guide (for example, heartbeat and result submission) in NVFLARE’s guide.

Worker or agent
Heavy startup before first init. Large models or imports can exceed “pre-init” budgets. NVFLARE calls this external pre-init timeout in its guide.
CPU starvation or thread pool exhaustion.
Blocking I/O without deadlines.

Orchestrator
Synchronous waits for long jobs that should be async.
Missing task tokens or callbacks, so the state machine never resumes. AWS Step Functions documents stuck activities and the need for timeouts and task tokens in its troubleshooting page.

Fast triage

Use quick, scoped checks. Keep changes small and reversible.

1) Locate the stall

Open the execution history. Find the last step with output.
Confirm if it is a compute step, a transfer, or a wait.

2) Reproduce with a tighter scope

Re-run the failing step alone with trace logging.
Capture start, first-byte, and completion timestamps.

3) Separate execution from transport

Run the job but write output to object storage. Return only a small receipt.
If this works, the issue is transfer or response size.

4) Align time budgets

Make sure client, proxy, and backend timeouts form a chain. The outer layer must wait longer than the inner layer.
Add heartbeats or interim progress events if supported.

5) Decide: increase timeout or refactor

If the task is near the limit but stable, raise the timeout with a cap.
If the task blocks sockets or lacks heartbeats, switch to async callbacks.

Common timeout and hang patterns

Slow start or heavy model load
Symptom: The job “never started.” Logs show no init message.
Fix: Increase pre-init time or emit an early init signal. NVFLARE documents adjusting external pre-init for heavy imports here.

Long compute with no heartbeat
Symptom: System marks the worker dead during a long step.
Fix: Send periodic heartbeats. Ensure heartbeat interval proxy timeout > backend compute budget.
Add exponential backoff and a max cap on retries.

How Breyta fits this use case

Breyta is a workflow and agent orchestration platform for coding agents. It helps teams build, run, and publish reliable workflows, agents, and autonomous jobs with deterministic execution, run history, versioned flow definitions, approvals, waits, and an agent-first CLI.

Breyta supports long-running work with clear patterns:

Remote-agent pattern
Kick off remote work over SSH with a step.
Pause the workflow with a wait step.
Resume when the remote worker posts to the callback URL.
Approvals and human checkpoints
Pause for review and resume with state intact.
Run history and step outputs
Inspect each step and artifact after the fact.
Resources and persistence
Use persist for large outputs. Pass compact res:// refs instead of huge payloads.
Versioned releases
Iterate in draft. Promote to live when approved. Runs are pinned to the resolved release at start.

This lets you avoid fragile long-lived connections. The agent does the task. Breyta runs the workflow around it.

Troubleshooting timeouts and hangs in Breyta

Find the blocking step
Open the run history. Review step outputs and timestamps.
Switch to the remote-agent pattern when needed
If an SSH compute step holds a socket open, kick off the job, then pause with a wait step. Resume on callback.
Avoid large inline payloads
Store big outputs with persist. Pass res:// refs.
Use approvals and waits instead of idle sockets
Pause the flow for manual gates. Let the system hold state while humans decide.
Keep changes safe
Test fixes in draft. Promote a versioned release when ready.
Check connections and secrets
Update tokens or API credentials in connections. Keep logic separate from secret material.
Right-size concurrency
Review the flow’s concurrency policy to prevent starvation or overload.
Use notifications
Notify on waits and approvals so runs do not stall unseen.
Lean on the CLI
Use the agent-first CLI for stable JSON, scripting, and resource inspection.

FAQ

Should I just raise timeouts?

Only if the work is near the limit and sends regular signals. If the step blocks a socket or never heartbeats, refactor to async with a callback and a wait.

How do I handle very large results?

Do not return them inline. Write to storage, return a small reference, and fetch on demand. In Breyta, use persist and res:// refs.

Can Breyta run long jobs that span hours or days?

Breyta workflows can pause with wait steps and resume later with state intact. Use callbacks, approvals, and notifications to keep runs reliable and inspectable.