Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.arkor.ai/llms.txt

Use this file to discover all available pages before exploring further.

Trainer

createTrainer is the heart of an Arkor project. Everything Arkor knows about a fine-tuning run, from the base model down to the LoRA rank, is on the object you pass in.

Minimal example

import { createTrainer } from "arkor";

export const trainer = createTrainer({
  name: "support-bot-v1",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
});
That alone is a valid trainer. Defaults are filled in for everything else.

Required fields

FieldTypeNotes
namestringJob name shown in Studio and the managed backend.
modelstringA model identifier the backend recognizes (Gemma today).
datasetDatasetSourceSee Dataset sources.

Dataset sources

DatasetSource is a discriminated union on type:
type DatasetSource =
  | { type: "huggingface"; name: string; split?: string; subset?: string }
  | { type: "blob";        url: string;  token?: string };
  • HuggingFace. The most common shape. Arkor pulls the dataset by name. Use split to override the default split and subset for datasets that publish multiple subsets.
  • Blob URL. Any HTTPS URL the backend can fetch. Pass token if the URL needs an Authorization: Bearer header.

LoRA configuration

Pass lora to control LoRA settings. All four fields are typed:
lora?: {
  r: number;             // LoRA rank
  alpha: number;         // LoRA alpha
  maxLength?: number;    // Maximum sequence length
  loadIn4bit?: boolean;  // QLoRA quantization
}
If you omit lora, the backend applies sensible defaults. r: 16, alpha: 16 is a good starting point for the bundled templates.

Common hyperparameters

FieldTypeWhat it does
maxStepsnumberCap on training steps. Often the simplest knob to turn.
numTrainEpochsnumberAlternative to maxSteps: number of dataset passes.
learningRatenumberStep size for the optimizer.
batchSizenumberPer-device training batch size.
optimstringOptimizer name (the backend list governs valid values).
lrSchedulerTypestringLR schedule (linear, cosine, etc).
weightDecaynumberRegularization weight.
If you only set maxSteps, the rest stay at backend defaults. That is usually what you want for the first few runs.

Smoke testing with dryRun

createTrainer({
  name: "smoke",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  dryRun: true,
});
dryRun: true tells the backend to run a minimal end-to-end smoke test of the trainer: the full pipeline (including the training loop) executes against a truncated dataset / capped step count so it finishes quickly. It still uses GPU time, just much less of it. Useful in CI or when wiring up callbacks for the first time.

What createTrainer returns

interface Trainer {
  readonly name: string;
  start(): Promise<{ jobId: string }>;
  wait(): Promise<TrainingResult>;
  cancel(): Promise<void>;
}
  • start() submits the job to the managed backend and resolves with the assigned jobId. It does not wait for completion, and it does not dispatch any callbacks on its own.
  • wait() opens the SSE event stream for the run and returns once the run finishes (or fails). All registered callbacks fire from inside wait(); if you call start() without later calling wait(), no callbacks ever run.
  • cancel() asks the backend to stop the run. This is a best-effort request: the backend may return an error if the run is already in a terminal state (completed, failed, or already cancelled), so be prepared to catch.
arkor start calls start() and wait() for you (it is what Studio’s “Run training” button spawns under the hood). arkor dev does not run the trainer; it only boots the Studio UI. Call start() and wait() directly only if you wire training into your own code outside the CLI.

Stopping wait() with AbortSignal

const controller = new AbortController();
const trainer = createTrainer({
  name: "cancellable",
  model: "unsloth/gemma-4-E4B-it",
  dataset: { type: "huggingface", name: "arkorlab/triage-demo" },
  abortSignal: controller.signal,
});

// Later, from anywhere:
controller.abort();
abortSignal is only about your local wait(): aborting it stops the SSE event-stream fetch and any retry / backoff delays inside wait(). The current implementation throws on abort (the backoff delay rejects with signal.reason, and the failure handler re-throws when the signal is aborted), so wait() rejects rather than resolving cleanly. Wrap it in try / catch if you abort:
try {
  await trainer.wait();
} catch (err) {
  if (controller.signal.aborted) {
    // expected: we asked wait() to stop
  } else {
    throw err;
  }
}
abortSignal does not call cancel() and does not ask the backend to stop the run; the job keeps using GPU time on the managed side. If you want to actually stop training (and the cost), call trainer.cancel() separately:
controller.abort();          // local wait() rejects
await trainer.cancel();      // asks the backend to stop the job
Use abortSignal for “I no longer care about waiting on this run” (a request timed out, a parent process is exiting). Use cancel() for “stop the run on the backend”.

Reacting to events

The whole point of doing this in TypeScript is that you can hook into the run with lifecycle callbacks: onStarted, onLog, onCheckpoint, onCompleted, onFailed. That is the next concept to read.