Troubleshooting a failed or timed-out run

When a CloudRadial AutomationAI run fails or times out, the run detail in History tells you which step stopped it and why. This article covers reading step logs, how step timeouts work, common failure causes, splitting long work into multiple nodes, and rerunning. It is for Admin and Owner roles.

Reading Step Logs
Step Timeouts
Common Failure Causes
Splitting Long Work Into Multiple Nodes
Rerunning

Reading Step Logs

Open the run from History (/runs) to see its steps in order. The failing step carries a failed or timed-out status; open it and read its sections:

Output — what the step produced, if anything, before it stopped
Log — any log text the step captured during execution
Error — the error message, shown when the step failed

Work down the steps until you reach the first one that did not succeed — that is where the run stopped and where the Error and Log sections explain the cause.

Step Timeouts

Each step has a time budget. A step runs for at most its configured timeoutSeconds, which defaults to five minutes, and the hard cap is 30 minutes — the Designer will not let a node's timeout exceed it. If a step runs past its budget, AutomationAI marks the run timed out and stops it. The budget is enforced as a lease: when a runner leases a step the control plane sets the lease to expire at the lease time plus the step's timeout plus a short grace, and a periodic sweep fails any run still running past that point.

Common Failure Causes

When a run fails or times out, check for:

A script error in the step — read the Error section for the message the runner reported
A step that needs longer than its timeout allows, which surfaces as a timed-out run rather than a script error
The deployment's runner being offline or revoked, so the work is never picked up — confirm the runner is healthy on the Runners page
Missing input or a binding that did not resolve, so the step received nothing to act on

Splitting Long Work Into Multiple Nodes

Because a single step cannot exceed the 30-minute hard cap, work that needs longer must be modeled as several nodes rather than one long-running job — the Azure Functions HTTP path the runner uses would not survive a single step that long anyway. Break the work into stages, let each node finish well inside its timeout, and pass results between nodes so the next stage continues where the last left off.

Rerunning

After you have addressed the cause, start the deployment again with the Run action — each run is a fresh execution, so a new run picks up the current pinned version on the runner's next poll. If you changed the workflow itself, publish the new version and re-pin the deployment to it before rerunning, since a deployment always runs its pinned published version.

Articles in this section

Reading Step Logs

Step Timeouts

Common Failure Causes

Splitting Long Work Into Multiple Nodes

Rerunning

Comments