Documentation Index
Fetch the complete documentation index at: https://docs.arkor.ai/llms.txt
Use this file to discover all available pages before exploring further.
Early stopping on diverging loss
If the loss starts climbing or NaNs out, the rest of the run is wasted compute. Arkor does not have built-in early stopping; it does have everything you need to bolt it on in a few lines of TypeScript. This recipe uses three primitives together:onLogto watch the loss as it streams from the backend.- An
AbortControllerwhose signal is wired into the trainer. trainer.cancel()to stop the run on the backend after we abort locally.
The pattern
arkor start (Studio’s “Run training” or the CLI), the export above is enough: the controller still aborts wait(), but you cannot wrap the CLI’s await in your own try / catch. The pattern above is the safe one when you want guaranteed backend cancellation.
Why both abortSignal and cancel?
abortSignal and cancel do different things, and the docs say so explicitly because mixing them up wastes spend.
abortSignalstops your localwait()loop (SDK § Trainer control). It does not callcancel, does not message the backend, and the job keeps running on the managed GPU.trainer.cancel()asks the backend to stop the job. Best effort: the request may reject if the job is already in a terminal state (completed, failed, cancelled). Wrap intry / catchif you call it speculatively.
abortSignal alone is enough. For “I do not want to keep paying”, call cancel after the abort.
Variations
Smoothed threshold. A single bad step can be a noisy outlier. Track a rolling window inside the closure:controller works from outside the trainer file. A Next.js API route, a SIGINT handler, or a parent process can call controller.abort() to stop the run on demand.
What to keep in mind
- Do not throw inside the callback. A throw is caught by the SSE reconnect loop and the run keeps going (see SDK § Lifecycle callbacks). Use the controller; that is the deterministic path.
abortSignaldoes not cancel the backend. This is the most common gotcha. Always paircontroller.abort()withtrainer.cancel()if cost matters.- The
lossfield isnumber | null. Backends only fill in the loss on metric steps; non-metric frames carrynull. TheNumber.isFinitecheck also rejectsNaN, which is the more common divergence signal in practice.