Trainer

createTrainer is the heart of an Arkor project. Everything Arkor knows about a fine-tuning run, from the base model down to the LoRA rank, is on the object you pass in.

Minimal example

import { createTrainer } from "arkor";

export const trainer = createTrainer({
  name: "support-bot-v1",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
});

That alone is a valid trainer. Defaults are filled in for everything else.

Required fields

Field	Type	Notes
`name`	`string`	Job name shown in Studio and the managed backend.
`model`	`string`	A model identifier the backend recognizes (Gemma today).
`dataset`	`DatasetSource`	See Dataset sources.

Dataset sources

DatasetSource is a discriminated union on type:

type DatasetSource =
  | { type: "huggingface"; name: string; split?: string; subset?: string }
  | { type: "blob";        url: string;  token?: string };

HuggingFace. The most common shape. Arkor pulls the dataset by name. Use split to override the default split and subset for datasets that publish multiple subsets.
Blob URL. Any HTTPS URL the backend can fetch. Pass token if the URL needs an Authorization: Bearer header.

LoRA configuration

Pass lora to control LoRA settings. All four fields are typed:

lora?: {
  r: number;             // LoRA rank
  alpha: number;         // LoRA alpha
  maxLength?: number;    // Maximum sequence length
  loadIn4bit?: boolean;  // QLoRA quantization
}

If you omit lora, the backend applies sensible defaults. r: 16, alpha: 16 is a good starting point for the bundled templates.

Common hyperparameters

Field	Type	What it does
`maxSteps`	`number`	Cap on training steps. Often the simplest knob to turn.
`numTrainEpochs`	`number`	Alternative to `maxSteps`: number of dataset passes.
`learningRate`	`number`	Step size for the optimizer.
`batchSize`	`number`	Per-device training batch size.
`optim`	`string`	Optimizer name (the backend list governs valid values).
`lrSchedulerType`	`string`	LR schedule (linear, cosine, etc).
`weightDecay`	`number`	Regularization weight.

If you only set maxSteps, the rest stay at backend defaults. That is usually what you want for the first few runs.

Smoke testing with `dryRun`

createTrainer({
  name: "smoke",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  dryRun: true,
});

dryRun: true tells the backend to run a minimal end-to-end smoke test of the trainer: the full pipeline (including the training loop) executes against a truncated dataset / capped step count so it finishes quickly. It still uses GPU time, just much less of it. Useful in CI or when wiring up callbacks for the first time.

What `createTrainer` returns

interface Trainer {
  readonly name: string;
  start(): Promise<{ jobId: string }>;
  wait(): Promise<TrainingResult>;
  cancel(): Promise<void>;
}

start() submits the job to the managed backend and resolves with the assigned jobId. It does not wait for completion, and it does not dispatch any callbacks on its own.
wait() opens the SSE event stream for the run and returns once the run finishes (or fails). All registered callbacks fire from inside wait(); if you call start() without later calling wait(), no callbacks ever run.
cancel() asks the backend to stop the run. This is a best-effort request: the backend may return an error if the run is already in a terminal state (completed, failed, or already cancelled), so be prepared to catch.

arkor start calls start() and wait() for you (it is what Studio’s “Run training” button spawns under the hood). arkor dev does not run the trainer; it only boots the Studio UI. Call start() and wait() directly only if you wire training into your own code outside the CLI.

Stopping `wait()` with `AbortSignal`

const controller = new AbortController();
const trainer = createTrainer({
  name: "cancellable",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  abortSignal: controller.signal,
});

// Later, from anywhere:
controller.abort();

abortSignal is only about your local wait(): aborting it stops the SSE event-stream fetch and any retry / backoff delays inside wait(). The current implementation throws on abort (the backoff delay rejects with signal.reason, and the failure handler re-throws when the signal is aborted), so wait() rejects rather than resolving cleanly. Wrap it in try / catch if you abort:

try {
  await trainer.wait();
} catch (err) {
  if (controller.signal.aborted) {
    // expected: we asked wait() to stop
  } else {
    throw err;
  }
}

abortSignal does not call cancel() and does not ask the backend to stop the run; the job keeps using GPU time on the managed side. If you want to actually stop training (and the cost), call trainer.cancel() separately:

controller.abort();          // local wait() rejects
await trainer.cancel();      // asks the backend to stop the job

Use abortSignal for “I no longer care about waiting on this run” (a request timed out, a parent process is exiting). Use cancel() for “stop the run on the backend”.

Reacting to events

The whole point of doing this in TypeScript is that you can hook into the run with lifecycle callbacks: onStarted, onLog, onCheckpoint, onCompleted, onFailed. That is the next concept to read.

Get started

Concepts

CLI

SDK

Trainer

Trainer

Minimal example

Required fields

Dataset sources

LoRA configuration

Common hyperparameters

Smoke testing with `dryRun`

What `createTrainer` returns

Stopping `wait()` with `AbortSignal`

Reacting to events

Get started

Concepts

CLI

SDK

Documentation Index

​Trainer

​Minimal example

​Required fields

​Dataset sources

​LoRA configuration

​Common hyperparameters

​Smoke testing with dryRun

​What createTrainer returns

​Stopping wait() with AbortSignal

​Reacting to events

Trainer

Minimal example

Required fields

Dataset sources

LoRA configuration

Common hyperparameters

Smoke testing with `dryRun`

What `createTrainer` returns

Stopping `wait()` with `AbortSignal`

Reacting to events