Inside sekft: a shell-operator training pipeline#

This is the how-it-works companion to the experiment From seed to weights: fine-tuning a shell operator. The experiment page is the why and the results. This page is the how: architecture, the four data-factory stages, the trainer, how to read a run, and the hardware constraints. It is meant for a colleague picking up the sekft repo for the first time.

Architecture: how the pieces fit#

The pipeline is the sekft repo. Data flows one direction. The dash-in-Docker backend is shared by the stages that touch a real shell.

taxonomy.py + schema.py        the scenario vocabulary (axes; bundle dataclass)
       |
       v
generate.py   [Stage A]        a teacher authors scenario bundles; each is
  (+ reference gate)           admitted only if its own reference solution
       |                       makes its checker pass in a real shell
       v
scenarios/*.json
       |
       v
rollout.py    [Stages B-D]     an operator model acts in a fresh dash-docker
  (+ dashdocker.py)            container; verify against final state; record
       |                       the trajectory imperative-free
       v
trajectories/*.json            kept = correct terminal AND verified AND command-only
       |
       v
sft.py        [train]          assistant-only LoRA SFT -> adapter (base untouched)
       |
       v
adapter + base
       |
       v
eval.py       [eval]           drop the tuned model into HELD-OUT scenarios with
  (+ dashdocker.py)            no scaffold; report operate / terminate / verified

Module map:

  • taxonomy.py: the axes of variation (task / provider / announcement / doc-depth / difficulty) as pure data. No model, no container.

  • schema.py: the Scenario bundle: provider, announcement, directives, fixtures, a deterministic checker, and a reference solution.

  • generate.py: Stage A. Sample a combo, prompt a teacher to author the bundle, gate on solvability.

  • dashdocker.py: the execution backend (below).

  • rollout.py: Stages B-D. Roll an operator through a scenario, verify, record.

  • sft.py: the trainer.

  • eval.py: the behavioural eval.

The data factory, stage by stage#

The priming apparatus, with the seed swapped for a verifier, is a training-data generator: a strong teacher operates the shell, the environment grades the result, and only clean self-terminating runs survive. Four stages, A through D.

Stage A: author the world (a model writes it). A teacher samples a scenario across the five axes and emits a bundle: the directive-provider program with its own --help, the announcement that points at it (never the task itself), the fixtures, a deterministic checker that inspects filesystem state, and a reference solution. Authoring and acting are kept as separate roles. The same model is never trusted to both invent a task and certify it.

Stage A, the reference gate. A bundle is admitted only if its own reference solution, run in a fresh container, makes its checker pass. This throws out unsolvable or self-inconsistent scenarios before they waste a rollout. Never train toward a task whose solution you have not executed.

Stage B: roll out (a model acts). Each bundle is materialised in a fresh dash-in-Docker container and an operator model runs it with only a disposable scaffold prompt and no task. It discovers, operates, terminates.

Stage C: verify (code, not a model). The checker runs against the container’s final state: the effect the commands produced. Checking the effect is what makes the labels trustworthy; the model’s own claim is ignored.

Stage D: record (imperative-free). A trajectory is kept only if the effect is present and the terminal is correct. The scaffold is then stripped, leaving the orientation, the login, and the alternating prompt / command / output turns ending in the terminal. The scaffold is how we reliably elicit the behaviour from a teacher; dropping it from the record is what makes the student learn the behaviour with no instruction. It distils an instructed policy into an un-instructed one.

The execution backend (dash-in-Docker)#

dashdocker.py runs commands in disposable Alpine+dash containers. Two entry points: run(fixtures, script) (one-shot, for the reference gate) and session(fixtures) (a persistent container an operator drives one command at a time). Three decisions worth knowing:

  • One container per rollout, disposable. Fresh state per trajectory, free parallelism (independent containers, no shared-rootfs contention), safe disposal of whatever the model did.

  • State-replay via per-command exec. Each command runs as its own docker exec ... dash -c. A single long-lived dash reading a pipe is block-buffered without a tty, which makes output-boundary detection flaky; per-command exec means the process exits (flushing output) and docker exec’s own return code is the command’s exit code. cwd and exported env are saved after each command and restored before the next.

  • Containment. --network none, --cap-drop ALL, no-new-privileges, memory/cpu/pids caps, removed on teardown. The model is running arbitrary commands; the container is the boundary.

Building it surfaced (and survived) three POSIX hazards a smoke test would otherwise have shipped: dash lives at /usr/bin/dash not /bin/dash; a { cmd; } group is a syntax error from a stray ; after a newline; and (the subtle one) . and cd are POSIX special builtins whose failure exits a non-interactive shell, so state-replay needs a guarded [ -f ] && . file or the wrapper dies on the first command.

The trainer in detail#

sft.py is standard causal-LM supervised fine-tuning with two sekft-specific pieces: a render format and a loss mask.

LoRA, and why the base is never overwritten. Training uses low-rank adapters (PEFT): the base weights are loaded read-only, and only small adapter matrices on the attention projections are trained. save_pretrained writes only the adapter. On Mistral-7B that is a 54 MB adapter_model.safetensors, against 14 GB of frozen base. So a run never touches the base model on disk; the only thing that overwrites is your own output directory across runs, so name each run’s --out. Adapters are cheap to keep, and (importantly for iteration) cheap to swap on a resident base.

Render format. Mistral’s native [INST] template cannot carry our trajectories (no system role, no consecutive user turns; we have both), and the model deploys via the ccpty’s role-tagged stream anyway, bypassing Mistral’s template. So the trainer renders an explicit role-delimited format (<|system|> / <|user|> / <|assistant|> + an end marker). The exact markers are a placeholder to reconcile with the ccpty wire format; that is the one knob, and the operate-and-exit loop transfers regardless.

Assistant-only loss mask. This is the single most important correctness detail. Loss is computed on the assistant turns only (the commands and the terminal token), with the orientation, prompts, and command output masked to -100. If you do not mask, the model learns to predict the environment’s replies, i.e. to hallucinate command output. The mask is built by offset-mapping (tokenize once, label a token only if its character span falls inside an assistant turn), which is robust to sub-word boundaries. The end marker after each command is trained too, so the model learns to stop. On a worked trajectory the mask trains ~18% of tokens (commands + terminals); everything else is -100.

Hyperparameters that matter. --lora-r (adapter rank, capacity vs size), --lr (LoRA tolerates higher, ~2e-4), --epochs, and the effective batch (--batch x --accum). On the V100: fp16 (the V100 has no bf16), gradient checkpointing on to fit 7B, batch 1 x accum 8.

How to read a training run#

Train loss is only a smoke alarm. Two habits make the difference:

Overfit a small set first. Before any real run, train on a handful of examples and confirm loss drives toward ~0. If it cannot, the mask, the learning rate, or the data plumbing is broken, and you learn that in seconds. A clean monotonic descent to near-zero means the machinery works; it says nothing yet about generalisation.

The metric that matters is behavioural. “Loss went down” does not mean “the model now exits.” The signal is: load the adapter, drop the model into held-out scenarios with no scaffold, and measure the rates that count. eval.py does exactly this. It reuses the rollout loop with a local operator (the model formats and generates in the same render it was trained on, so train == eval == deploy) and reports:

  • operate_rate: reached command-mode (issued real commands, no prose)

  • terminate_rate: ended with exit or panic

  • verified_rate: the checker passed against final state

  • clean_rate: correct terminal AND verified AND command-only

Diagnostics to watch: loss, grad-norm (spikes = instability / lr too high; a flat loss with healthy grad-norm = lr too low), and the lr schedule. sft.py logs to TensorBoard (--logdir <out>/runs) and dumps a greppable log_history.jsonl (loss / lr / grad-norm per step). For a remote box, forward TensorBoard over ssh (ssh -L 6006:localhost:6006) or just pull the JSONL and plot locally.

Infrastructure and its constraints#

  • Target model: mistralai/Mistral-7B-Instruct-v0.2 (HF format), the strongest non-llama operator in the priming study and the one that most clearly lacked termination. The cleanest place to test whether SFT supplies the missing axis.

  • GPU: a single Tesla V100-SXM2 32 GB. fp16 throughout (no bf16 on sm_70). LoRA fits 7B comfortably: 13.6 M trainable params (0.19% of 7.26 B), no OOM.

  • Teacher (data generation): a strong model over a local Ollama / litellm endpoint; the environment is the source of truth, and the teacher is never trusted as one.

The link is the bottleneck, and it shapes the workflow. The V100 is attached over OcuLink (external PCIe 3.0 x4), and in practice weights stream in at the low hundreds of MB/s, so a cold load of the 14 GB base is on the order of two minutes, paid every time a process starts. The consequences drive how you iterate:

  • Pay the big transfer once; swap tiny adapters. Load the base into a long-lived process once and cycle 54 MB LoRA adapters and scenarios through it, sparing a 14 GB reload per experiment. resident.py does this: a 4-bit base held resident at ~4.7 GB VRAM, with fit and evaluate cycling adapters against it.

  • QLoRA (4-bit) to shrink what crosses the link. Loading the base in 4-bit drops ~14 GB to ~4 GB to move, with LoRA on top; it still trains, and frees headroom for rank or sequence length.

  • Warm the page cache (cat model-*.safetensors > /dev/null) so the disk read is not a second bottleneck on reload; safetensors is mmap’d already.

  • Tiny model for the dev loop. Iterate masking/format/eval logic against a 135 M model (loads in a second); pay the Mistral load only for real runs.

Iterating fast#

The mature loop is: tiny-model smoke -> overfit sanity -> a resident base that cycles adapters -> behavioural eval after every run -> only then scale epochs and data. Change one variable at a time (lr or rank or data), name each run so its config is recoverable, and trust the behavioural metric over the loss curve.

Reproduction#

The pipeline is the sekft repo. Five stages plus the shared backend.

git clone https://git.code.tiararodney.com/tiararodney/sekft
cd sekft

# execution sandbox: strict-POSIX dash in a disposable container.
# Doubles as the solvability gate for authored scenarios.
docker build -t sekft-dash .

# Stage A: author scenarios with a strong teacher (direct Ollama or a proxy)
SEKFT_MODEL=qwen2.5:32b SEKFT_URL=http://localhost:11434/v1 SEKFT_KEY=... \
  python generate.py --n 150 --out ./scenarios

# Stages B-D: roll an operator through each, verify, record
SEKFT_OP_MODEL=qwen2.5:32b \
  python rollout.py --scenarios ./scenarios --out ./trajectories --samples 3

# train: fine-tune on the kept trajectories (assistant-only loss, LoRA fp16/4-bit)
python sft.py --data ./trajectories --base <hf-model-dir> --out ./ckpt-run1

# eval: behavioural eval on HELD-OUT scenarios (no scaffold)
python eval.py --base <hf-model-dir> --adapter ./ckpt-run1 \
  --scenarios ./holdout-scenarios --n 20

Model calls go through the teacher endpoint; the container work is CPU/disk only and fans out across independent containers. Training and eval need torch + transformers + peft on a CUDA box.

To prove the trainer end to end without a teacher or a large model, swap in hand-authored stub trajectories and a tiny base model:

python make_stubs.py --out ./stub_trajectories --copies 4
python sft.py --data ./stub_trajectories --base HuggingFaceTB/SmolLM2-135M-Instruct --out ./ckpt-smoke
python sft.py --data ./stub_trajectories --base <hf-model-dir> --inspect   # mask stats, no GPU

Comments

Feel free to leave a public comment on my Inside sekft: a shell-operator training pipeline blog post.

Before you comment...

If you don't have an account at accounts.tiararodney.com yet, feel free to create one during sign in, after you've read and agreed to my Privacy and Acceptable Use Policy