Documentation Index
Fetch the complete documentation index at: https://docs.arkor.ai/llms.txt
Use this file to discover all available pages before exploring further.
Lifecycle callbacks
Arkor fires five callbacks as a training run progresses. They are all optional, and each one is a plain TypeScript function that runs in your process. This is what makes the loop feel like the rest of your application: no notebook, no out-of-band dashboard, no separate config language.When each callback fires
wait(). If you call start() without later calling wait(), no callbacks fire, even though the run is still happening on the backend. arkor start calls wait() for you; if you wire training into your own code outside the CLI, make sure you do too.
onCompleted and onFailed are mutually exclusive: at most one of them fires per run. If wait() throws before a terminal event arrives (for example when abortSignal is aborted, or reconnect attempts are exhausted), it is possible that neither one fires.
Returning a promise from any callback is fine. Arkor awaits it before moving on, so you can do async work (writing to a database, posting to Slack, calling infer) without races.
onStarted({ job })
Fires once, when the SSE stream opened by wait() reports a training.started event. Note that this is not the same moment start() resolves: start() only submits the job and returns its jobId.
onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })
Fires repeatedly as training progresses. Each numeric field can be null when the backend has not produced that metric yet (for example evalLoss only fires on eval steps).
- Forward to your own metrics pipeline (e.g. PostHog, Datadog).
- Detect divergence early: if
lossis climbing, abort the run. - Implement custom early stopping logic.
onCheckpoint({ step, adapter, job, infer, artifacts })
Fires when a checkpoint is saved on the backend, while the run is still going.
adapter is a small object identifying the checkpoint ({ kind: "checkpoint", jobId, step }). infer is a function: it takes a chat-style request and returns a raw Response. You call await res.text() (or res.json(), or stream the body) to read it.
This is the most useful callback in practice. It lets you sanity-check the model mid-run rather than waiting until the end. If the checkpoint is already worse than the base model, you know to stop.
onCompleted({ job, artifacts })
Fires once, on success. artifacts lists what the backend produced for this run. Use it to:
- Persist the final adapter ID where the rest of your app can find it.
- Run a final smoke test before promoting the model.
- Send a “training done” notification.
onFailed({ job, error })
Fires once if the backend reports a failure. Note that error is a string (the message the backend sent), not an Error instance:
onFailed is for backend-reported failures only. A thrown exception inside one of your own callbacks does not route through onFailed, and it is also not guaranteed to fail wait() cleanly: the current runtime catches errors thrown during event dispatch and treats them as SSE failures, which feed into the reconnect loop. The original exception can end up retried or skipped rather than surfacing where you would expect. If you need deterministic behavior, catch errors inside the callback and decide what to do (log, abort, persist) before they escape.