Inside sekft: a shell-operator training pipeline#
This is the how-it-works companion to the experiment
From seed to weights: fine-tuning a shell operator.
The experiment page is the why and the results. This page is the how:
architecture, the four data-factory stages, the trainer, how to read a run, and
the hardware constraints. It is meant for a colleague picking up the sekft
repo for the first time.
Architecture: how the pieces fit#
The pipeline is the sekft repo. Data flows one direction. The dash-in-Docker
backend is shared by the stages that touch a real shell.
taxonomy.py + schema.py the scenario vocabulary (axes; bundle dataclass)
|
v
generate.py [Stage A] a teacher authors scenario bundles; each is
(+ reference gate) admitted only if its own reference solution
| makes its checker pass in a real shell
v
scenarios/*.json
|
v
rollout.py [Stages B-D] an operator model acts in a fresh dash-docker
(+ dashdocker.py) container; verify against final state; record
| the trajectory imperative-free
v
trajectories/*.json kept = correct terminal AND verified AND command-only
|
v
sft.py [train] assistant-only LoRA SFT -> adapter (base untouched)
|
v
adapter + base
|
v
eval.py [eval] drop the tuned model into HELD-OUT scenarios with
(+ dashdocker.py) no scaffold; report operate / terminate / verified
Module map:
taxonomy.py: the axes of variation (task / provider / announcement / doc-depth / difficulty) as pure data. No model, no container.schema.py: theScenariobundle: provider, announcement, directives, fixtures, a deterministic checker, and a reference solution.generate.py: Stage A. Sample a combo, prompt a teacher to author the bundle, gate on solvability.dashdocker.py: the execution backend (below).rollout.py: Stages B-D. Roll an operator through a scenario, verify, record.sft.py: the trainer.eval.py: the behavioural eval.
The data factory, stage by stage#
The priming apparatus, with the seed swapped for a verifier, is a training-data generator: a strong teacher operates the shell, the environment grades the result, and only clean self-terminating runs survive. Four stages, A through D.
Stage A: author the world (a model writes it). A teacher samples a scenario
across the five axes and emits a bundle: the directive-provider program with its
own --help, the announcement that points at it (never the task itself), the
fixtures, a deterministic checker that inspects filesystem state, and a
reference solution. Authoring and acting are kept as separate roles. The same
model is never trusted to both invent a task and certify it.
Stage A, the reference gate. A bundle is admitted only if its own reference solution, run in a fresh container, makes its checker pass. This throws out unsolvable or self-inconsistent scenarios before they waste a rollout. Never train toward a task whose solution you have not executed.
Stage B: roll out (a model acts). Each bundle is materialised in a fresh dash-in-Docker container and an operator model runs it with only a disposable scaffold prompt and no task. It discovers, operates, terminates.
Stage C: verify (code, not a model). The checker runs against the container’s final state: the effect the commands produced. Checking the effect is what makes the labels trustworthy; the model’s own claim is ignored.
Stage D: record (imperative-free). A trajectory is kept only if the effect is present and the terminal is correct. The scaffold is then stripped, leaving the orientation, the login, and the alternating prompt / command / output turns ending in the terminal. The scaffold is how we reliably elicit the behaviour from a teacher; dropping it from the record is what makes the student learn the behaviour with no instruction. It distils an instructed policy into an un-instructed one.
The execution backend (dash-in-Docker)#
dashdocker.py runs commands in disposable Alpine+dash containers. Two
entry points: run(fixtures, script) (one-shot, for the reference gate) and
session(fixtures) (a persistent container an operator drives one command at
a time). Three decisions worth knowing:
One container per rollout, disposable. Fresh state per trajectory, free parallelism (independent containers, no shared-rootfs contention), safe disposal of whatever the model did.
State-replay via per-command exec. Each command runs as its own
docker exec ... dash -c. A single long-lived dash reading a pipe is block-buffered without a tty, which makes output-boundary detection flaky; per-command exec means the process exits (flushing output) anddocker exec’s own return code is the command’s exit code. cwd and exported env are saved after each command and restored before the next.Containment.
--network none,--cap-drop ALL,no-new-privileges, memory/cpu/pids caps, removed on teardown. The model is running arbitrary commands; the container is the boundary.
Building it surfaced (and survived) three POSIX hazards a smoke test would
otherwise have shipped: dash lives at /usr/bin/dash not /bin/dash; a
{ cmd; } group is a syntax error from a stray ; after a newline; and (the
subtle one) . and cd are POSIX special builtins whose failure
exits a non-interactive shell, so state-replay needs a guarded
[ -f ] && . file or the wrapper dies on the first command.
The trainer in detail#
sft.py is standard causal-LM supervised fine-tuning with two
sekft-specific pieces: a render format and a loss mask.
LoRA, and why the base is never overwritten. Training uses low-rank
adapters (PEFT): the base weights are loaded read-only, and only small adapter
matrices on the attention projections are trained. save_pretrained writes
only the adapter. On Mistral-7B that is a 54 MB
adapter_model.safetensors, against 14 GB of frozen base. So a run never
touches the base model on disk; the only thing that overwrites is your own
output directory across runs, so name each run’s --out. Adapters are cheap
to keep, and (importantly for iteration) cheap to swap on a resident base.
Render format. Mistral’s native [INST] template cannot carry our
trajectories (no system role, no consecutive user turns; we have both), and the
model deploys via the ccpty’s role-tagged stream anyway, bypassing Mistral’s
template. So the trainer renders an explicit role-delimited format
(<|system|> / <|user|> / <|assistant|> + an end marker). The exact
markers are a placeholder to reconcile with the ccpty wire format; that is the
one knob, and the operate-and-exit loop transfers regardless.
Assistant-only loss mask. This is the single most important correctness
detail. Loss is computed on the assistant turns only (the commands and the
terminal token), with the orientation, prompts, and command output masked to
-100. If you do not mask, the model learns to predict the environment’s
replies, i.e. to hallucinate command output. The mask is built by
offset-mapping (tokenize once, label a token only if its character span falls
inside an assistant turn), which is robust to sub-word boundaries. The end
marker after each command is trained too, so the model learns to stop. On a
worked trajectory the mask trains ~18% of tokens (commands + terminals);
everything else is -100.
Hyperparameters that matter. --lora-r (adapter rank, capacity vs size),
--lr (LoRA tolerates higher, ~2e-4), --epochs, and the effective batch
(--batch x --accum). On the V100: fp16 (the V100 has no bf16),
gradient checkpointing on to fit 7B, batch 1 x accum 8.
How to read a training run#
Train loss is only a smoke alarm. Two habits make the difference:
Overfit a small set first. Before any real run, train on a handful of examples and confirm loss drives toward ~0. If it cannot, the mask, the learning rate, or the data plumbing is broken, and you learn that in seconds. A clean monotonic descent to near-zero means the machinery works; it says nothing yet about generalisation.
The metric that matters is behavioural. “Loss went down” does not mean “the
model now exits.” The signal is: load the adapter, drop the model into held-out
scenarios with no scaffold, and measure the rates that count. eval.py does
exactly this. It reuses the rollout loop with a local operator (the model
formats and generates in the same render it was trained on, so train == eval ==
deploy) and reports:
operate_rate: reached command-mode (issued real commands, no prose)terminate_rate: ended withexitorpanicverified_rate: the checker passed against final stateclean_rate: correct terminal AND verified AND command-only
Diagnostics to watch: loss, grad-norm (spikes = instability / lr too high;
a flat loss with healthy grad-norm = lr too low), and the lr schedule.
sft.py logs to TensorBoard (--logdir <out>/runs) and dumps a greppable
log_history.jsonl (loss / lr / grad-norm per step). For a remote box,
forward TensorBoard over ssh (ssh -L 6006:localhost:6006) or just pull the
JSONL and plot locally.
Infrastructure and its constraints#
Target model:
mistralai/Mistral-7B-Instruct-v0.2(HF format), the strongest non-llama operator in the priming study and the one that most clearly lacked termination. The cleanest place to test whether SFT supplies the missing axis.GPU: a single Tesla V100-SXM2 32 GB. fp16 throughout (no bf16 on sm_70). LoRA fits 7B comfortably: 13.6 M trainable params (0.19% of 7.26 B), no OOM.
Teacher (data generation): a strong model over a local Ollama / litellm endpoint; the environment is the source of truth, and the teacher is never trusted as one.
The link is the bottleneck, and it shapes the workflow. The V100 is attached over OcuLink (external PCIe 3.0 x4), and in practice weights stream in at the low hundreds of MB/s, so a cold load of the 14 GB base is on the order of two minutes, paid every time a process starts. The consequences drive how you iterate:
Pay the big transfer once; swap tiny adapters. Load the base into a long-lived process once and cycle 54 MB LoRA adapters and scenarios through it, sparing a 14 GB reload per experiment.
resident.pydoes this: a 4-bit base held resident at ~4.7 GB VRAM, withfitandevaluatecycling adapters against it.QLoRA (4-bit) to shrink what crosses the link. Loading the base in 4-bit drops ~14 GB to ~4 GB to move, with LoRA on top; it still trains, and frees headroom for rank or sequence length.
Warm the page cache (
cat model-*.safetensors > /dev/null) so the disk read is not a second bottleneck on reload; safetensors is mmap’d already.Tiny model for the dev loop. Iterate masking/format/eval logic against a 135 M model (loads in a second); pay the Mistral load only for real runs.
Iterating fast#
The mature loop is: tiny-model smoke -> overfit sanity -> a resident base that cycles adapters -> behavioural eval after every run -> only then scale epochs and data. Change one variable at a time (lr or rank or data), name each run so its config is recoverable, and trust the behavioural metric over the loss curve.
Reproduction#
The pipeline is the sekft repo. Five stages plus the shared backend.
git clone https://git.code.tiararodney.com/tiararodney/sekft
cd sekft
# execution sandbox: strict-POSIX dash in a disposable container.
# Doubles as the solvability gate for authored scenarios.
docker build -t sekft-dash .
# Stage A: author scenarios with a strong teacher (direct Ollama or a proxy)
SEKFT_MODEL=qwen2.5:32b SEKFT_URL=http://localhost:11434/v1 SEKFT_KEY=... \
python generate.py --n 150 --out ./scenarios
# Stages B-D: roll an operator through each, verify, record
SEKFT_OP_MODEL=qwen2.5:32b \
python rollout.py --scenarios ./scenarios --out ./trajectories --samples 3
# train: fine-tune on the kept trajectories (assistant-only loss, LoRA fp16/4-bit)
python sft.py --data ./trajectories --base <hf-model-dir> --out ./ckpt-run1
# eval: behavioural eval on HELD-OUT scenarios (no scaffold)
python eval.py --base <hf-model-dir> --adapter ./ckpt-run1 \
--scenarios ./holdout-scenarios --n 20
Model calls go through the teacher endpoint; the container work is CPU/disk only and fans out across independent containers. Training and eval need torch + transformers + peft on a CUDA box.
To prove the trainer end to end without a teacher or a large model, swap in hand-authored stub trajectories and a tiny base model:
python make_stubs.py --out ./stub_trajectories --copies 4
python sft.py --data ./stub_trajectories --base HuggingFaceTB/SmolLM2-135M-Instruct --out ./ckpt-smoke
python sft.py --data ./stub_trajectories --base <hf-model-dir> --inspect # mask stats, no GPU
Comments
Feel free to leave a public comment on my Inside sekft: a shell-operator training pipeline blog post.
Before you comment...
If you don't have an account at accounts.tiararodney.com yet, feel free to create one during sign in, after you've read and agreed to my Privacy and Acceptable Use Policy