Lifecycle callbacks

Arkor fires five callbacks as a training run progresses. They are all optional, and each one is a plain TypeScript function that runs in your process. This is what makes the loop feel like the rest of your application: no notebook, no out-of-band dashboard, no separate config language.

createTrainer({
  name: "support-bot-v1",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  callbacks: {
    onStarted, onLog, onCheckpoint, onCompleted, onFailed,
  },
});

When each callback fires

.start()                 ── submits the job, returns { jobId }. No callbacks here.
   │
   ▼
.wait()                  ── opens the SSE event stream. Callbacks dispatch from here.
   │
   ▼
onStarted ─── once, when the stream reports `training.started`
   │
   ▼
onLog     ─── many times, once per training-step batch of metrics
   │
   ▼
onCheckpoint ── several times, when an adapter checkpoint is saved
   │
   ▼
onCompleted  ── once, on successful finish
or
onFailed     ── once, on backend-reported failure

All five callbacks are dispatched from inside wait(). If you call start() without later calling wait(), no callbacks fire, even though the run is still happening on the backend. arkor start calls wait() for you; if you wire training into your own code outside the CLI, make sure you do too. onCompleted and onFailed are mutually exclusive: at most one of them fires per run. If wait() throws before a terminal event arrives (for example when abortSignal is aborted, or reconnect attempts are exhausted), it is possible that neither one fires. Returning a promise from any callback is fine. Arkor awaits it before moving on, so you can do async work (writing to a database, posting to Slack, calling infer) without races.

`onStarted({ job })`

Fires once, when the SSE stream opened by wait() reports a training.started event. Note that this is not the same moment start() resolves: start() only submits the job and returns its jobId.

onStarted: ({ job }) => {
  console.log(`Run ${job.id} accepted`);
},

Use it for log lines, metric counters, or sending a “training started” notification.

`onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })`

Fires repeatedly as training progresses. Each numeric field can be null when the backend has not produced that metric yet (for example evalLoss only fires on eval steps).

onLog: ({ step, loss, evalLoss }) => {
  if (loss !== null) {
    console.log(`step=${step} loss=${loss.toFixed(4)} evalLoss=${evalLoss ?? "-"}`);
  }
},

Common uses:

Forward to your own metrics pipeline (e.g. PostHog, Datadog).
Detect divergence early: if loss is climbing, abort the run.
Implement custom early stopping logic.

`onCheckpoint({ step, adapter, job, infer, artifacts })`

Fires when a checkpoint is saved on the backend, while the run is still going.

onCheckpoint: async ({ step, infer }) => {
  const res = await infer({
    messages: [{ role: "user", content: "Can't log in" }],
  });
  console.log(`step=${step} sample=`, await res.text());
},

adapter is a small object identifying the checkpoint ({ kind: "checkpoint", jobId, step }). infer is a function: it takes a chat-style request and returns a raw Response. You call await res.text() (or res.json(), or stream the body) to read it. This is the most useful callback in practice. It lets you sanity-check the model mid-run rather than waiting until the end. If the checkpoint is already worse than the base model, you know to stop.

`onCompleted({ job, artifacts })`

Fires once, on success. artifacts lists what the backend produced for this run. Use it to:

Persist the final adapter ID where the rest of your app can find it.
Run a final smoke test before promoting the model.
Send a “training done” notification.

onCompleted: ({ job, artifacts }) => {
  console.log(`Run ${job.id} done, ${artifacts.length} artifacts`);
},

`onFailed({ job, error })`

Fires once if the backend reports a failure. Note that error is a string (the message the backend sent), not an Error instance:

onFailed: ({ job, error }) => {
  console.error(`Run ${job.id} failed: ${error}`);
},

onFailed is for backend-reported failures only. A thrown exception inside one of your own callbacks does not route through onFailed, and it is also not guaranteed to fail wait() cleanly: the current runtime catches errors thrown during event dispatch and treats them as SSE failures, which feed into the reconnect loop. The original exception can end up retried or skipped rather than surfacing where you would expect. If you need deterministic behavior, catch errors inside the callback and decide what to do (log, abort, persist) before they escape.

A complete example

import { createTrainer } from "arkor";

export const trainer = createTrainer({
  name: "support-bot-v1",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  lora: { r: 16, alpha: 16 },
  maxSteps: 100,
  callbacks: {
    onStarted: ({ job }) => console.log(`started ${job.id}`),
    onLog: ({ step, loss }) => {
      if (loss !== null) console.log(`step=${step} loss=${loss.toFixed(4)}`);
    },
    onCheckpoint: async ({ step, infer }) => {
      const res = await infer({
        messages: [{ role: "user", content: "Hello!" }],
      });
      console.log(`ckpt @ ${step}:`, await res.text());
    },
    onCompleted: ({ job }) => console.log(`done ${job.id}`),
    onFailed: ({ error }) => console.error(`failed: ${error}`),
  },
});

Once you understand callbacks, almost everything else in Arkor is just configuration on top.

Get started

Concepts

CLI

SDK

Lifecycle callbacks

Lifecycle callbacks

When each callback fires

`onStarted({ job })`

`onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })`

`onCheckpoint({ step, adapter, job, infer, artifacts })`

`onCompleted({ job, artifacts })`

`onFailed({ job, error })`

A complete example

Get started

Concepts

CLI

SDK

Documentation Index

​Lifecycle callbacks

​When each callback fires

​onStarted({ job })

​onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })

​onCheckpoint({ step, adapter, job, infer, artifacts })

​onCompleted({ job, artifacts })

​onFailed({ job, error })

​A complete example

Lifecycle callbacks

When each callback fires

`onStarted({ job })`

`onLog({ step, loss, evalLoss, learningRate, epoch, samplesPerSecond, job })`

`onCheckpoint({ step, adapter, job, infer, artifacts })`

`onCompleted({ job, artifacts })`

`onFailed({ job, error })`

A complete example