Documentation Index
Fetch the complete documentation index at: https://docs.arkor.ai/llms.txt
Use this file to discover all available pages before exploring further.
Trainer
createTrainer is the heart of an Arkor project. Everything Arkor knows about a fine-tuning run, from the base model down to the LoRA rank, is on the object you pass in.
Minimal example
Required fields
| Field | Type | Notes |
|---|---|---|
name | string | Job name shown in Studio and the managed backend. |
model | string | A model identifier the backend recognizes (Gemma today). |
dataset | DatasetSource | See Dataset sources. |
Dataset sources
DatasetSource is a discriminated union on type:
- HuggingFace. The most common shape. Arkor pulls the dataset by name. Use
splitto override the default split andsubsetfor datasets that publish multiple subsets. - Blob URL. Any HTTPS URL the backend can fetch. Pass
tokenif the URL needs anAuthorization: Bearerheader.
LoRA configuration
Passlora to control LoRA settings. All four fields are typed:
lora, the backend applies sensible defaults. r: 16, alpha: 16 is a good starting point for the bundled templates.
Common hyperparameters
| Field | Type | What it does |
|---|---|---|
maxSteps | number | Cap on training steps. Often the simplest knob to turn. |
numTrainEpochs | number | Alternative to maxSteps: number of dataset passes. |
learningRate | number | Step size for the optimizer. |
batchSize | number | Per-device training batch size. |
optim | string | Optimizer name (the backend list governs valid values). |
lrSchedulerType | string | LR schedule (linear, cosine, etc). |
weightDecay | number | Regularization weight. |
maxSteps, the rest stay at backend defaults. That is usually what you want for the first few runs.
Smoke testing with dryRun
dryRun: true tells the backend to run a minimal end-to-end smoke test of the trainer: the full pipeline (including the training loop) executes against a truncated dataset / capped step count so it finishes quickly. It still uses GPU time, just much less of it. Useful in CI or when wiring up callbacks for the first time.
What createTrainer returns
start()submits the job to the managed backend and resolves with the assignedjobId. It does not wait for completion, and it does not dispatch any callbacks on its own.wait()opens the SSE event stream for the run and returns once the run finishes (or fails). All registered callbacks fire from insidewait(); if you callstart()without later callingwait(), no callbacks ever run.cancel()asks the backend to stop the run. This is a best-effort request: the backend may return an error if the run is already in a terminal state (completed, failed, or already cancelled), so be prepared to catch.
arkor start calls start() and wait() for you (it is what Studio’s “Run training” button spawns under the hood). arkor dev does not run the trainer; it only boots the Studio UI. Call start() and wait() directly only if you wire training into your own code outside the CLI.
Stopping wait() with AbortSignal
abortSignal is only about your local wait(): aborting it stops the SSE event-stream fetch and any retry / backoff delays inside wait(). The current implementation throws on abort (the backoff delay rejects with signal.reason, and the failure handler re-throws when the signal is aborted), so wait() rejects rather than resolving cleanly. Wrap it in try / catch if you abort:
abortSignal does not call cancel() and does not ask the backend to stop the run; the job keeps using GPU time on the managed side. If you want to actually stop training (and the cost), call trainer.cancel() separately:
abortSignal for “I no longer care about waiting on this run” (a request timed out, a parent process is exiting). Use cancel() for “stop the run on the backend”.
Reacting to events
The whole point of doing this in TypeScript is that you can hook into the run with lifecycle callbacks:onStarted, onLog, onCheckpoint, onCompleted, onFailed. That is the next concept to read.