From seed to weights: fine-tuning a shell operator#
- Status:
two cycles complete. At archetype-level holdout (n=16, task types absent from training), fine-tuning lifts Mistral termination from 0/16 (base) to 9/16 (tuned), same harness, only the adapter differing. The operate / terminate mechanism generalises to unseen archetypes; task competence (verified 0.31) stays archetype-local. One model, one seed; signal clean.
- Depends on:
Note
The synthesized dataset is available on HuggingFace.
Source code of the training pipeline can be found here.
A companion article about the architecture of the training pipeline can be found here.
Motivation#
The predecessor experiment (Scrollback priming: can synthetic history run a shell?) ran a synthetic-history
seed against six models and split the result along two axes. Operation
(issuing commands, recovering from errors) transferred broadly: the non-tool
mistral:7b tunes operated the shell about as well as llama3.1:8b.
Termination (closing the session with exit) did not transfer at all: it
appeared in llama and essentially nowhere else, 2 clean exits in 154 non-llama
runs. Neither scale nor tool-call training explained the gap. The leading read
was that the seed is overfit to llama, the model it was developed on.
A model-specific prime is a dead end for a system that must run many models. The constructive move is to stop priming and move the behaviour into the weights: fine-tune a non-llama model on verified shell-operation trajectories so that operation and termination are intrinsic to the model, holding without a seed hand-fitted to one model.
Question#
Can supervised fine-tuning on verified, self-terminating shell trajectories
give mistral:7b-instruct-v0.2 (the best non-llama operator from the
priming study, which never learned to exit) reliable termination, and do
it as a mechanism that generalises to unseen task sources, holding up beyond
a memorised script?
Thesis#
The seed elicits a behaviour; weights generalise it. The priming study already showed the behaviour is in-distribution for these models (they operate the shell from the seed alone), and the behaviour survives only with a model-specific prime. Training removes the prime from the critical path: the model should land in a shell with no imperative, find its assignments, do them, and leave, with nothing in its context but the environment.
This is the self-directed citizen model of integration, and it is worth dwelling on because it inverts the usual one. The standard way to give an LLM agency hands it tools through a typed API: the model is a function caller, and the integrator decides in advance what it may do. The self-directed citizen gets an account on a real system instead. It logs in, finds its own assignments in the environment, carries them out with whatever the system provides, and logs out. Authority lives in the world; the directives are discovered; the limit on what the model can do is the system itself.
The profundity is in the collapse of the integration surface. A function-calling
model needs a bespoke tool schema per capability, authored by a human. A citizen
needs one thing: a shell. The move that made UNIX composable, every program a
citizen speaking a single text interface, makes the model composable the same
way. The model becomes a user of the operating system, with the reach and the
responsibility that implies. sek exists to be the system such a citizen
inhabits, and this experiment is teaching one to live there.
The behaviour being trained#
The target is a general mechanism. There is no task in the prompt. The model lands in a shell and must run the routine:
expect an announcement of where directives live (motd / banner / env / file)
understand the provider from its own self-documentation (
--help/man)retrieve the directives
execute them, then terminate (
exiton success,panicwhen blocked)
Bind the convention (there is an announcement at the entry point; tools are
self-documenting), free everything else. To train the general mechanism, every
example varies the directive-provider (its name, flags, help
text) and holds only the four-step routine constant. The diversity is what
forces the model to read the --help each time, generalising past any one
invocation. A model that
learns this tolerates an unstable userland because it re-learns the interface
each session.
One structural consequence matters for the data: the imperative arrives deep in the session. In ordinary instruction-tuning the directive is turn one. Here it arrives several turns in, as the output of a program the model had to find. The model is trained that directives come from the world, and finding them is step one.
The pipeline in brief#
The training data is generated, not hand-written. A strong teacher model, Master Foo, authors verified shell scenarios and then operates them to produce trajectories; the smaller student, Mistral, is fine-tuned on the clean ones. The data factory has four stages, A through D:
A. author a scenario (a directive-provider, an announcement, a checker, a reference solution) and admit it only if its own reference solution passes its checker in a real shell.
B. roll out an operator through it in a disposable dash-in-Docker container.
C. verify against the container’s final state, the effect the commands produced, never the transcript.
D. record the trajectory with the instruction stripped out, so the student learns the behaviour from shape alone.
Then assistant-only LoRA fine-tuning (sft.py), and a behavioural eval that
drops the tuned model into held-out scenarios with no scaffold (eval.py).
The full architecture, the trainer internals, and how to read a run are in the
companion page:
Inside sekft: a shell-operator training pipeline.
Why “Master Foo”#
The teacher is named after ESR’s Unix Koans of Master Foo, in which a master teaches a novice through the shell rather than through lecture. The analogy lands more literally than intended. Master Foo (qwen2.5:32b) really does instruct the novice (Mistral) by demonstration in a live shell, and the novice learns the discipline of operating and leaving, not a body of told facts. A Unix koan is the right register for a micro-kernel whose thesis is that the shell is the teacher.
Findings so far#
The machinery, validated:
The data factory runs end to end: authoring (taxonomy + reference-solution gate), the dash-in-Docker backend, and the rollout-and-record loop. The keep-gate correctly rejects premature exit, budget overrun, wrong-terminal, and prose drift.
The execution backend survived three real POSIX hazards (dash path, the
{ cmd; }newline, special-builtin exit), caught by smoke-testing.The loss mask is correct: offset-mapping labels only the commands and the terminal token.
The trainer is proven on Mistral-7B. LoRA SFT drives loss cleanly to ~0.01 on a small set (the overfit sanity check); 7B fp16 / 4-bit + LoRA fit the V100; TensorBoard +
log_historywritten. A 4-bit resident base (~4.7 GB) cycles adapters without reloading, so the iterate loop pays the OcuLink transfer once per session.
Building the live pipeline surfaced bugs a unit test would not:
Announcement delivery. Env-var announcements were malformed and non-motd announcements showed nothing at login, so the operator landed blind. Fixed: a readable entry breadcrumb every session.
Operator scaffold. The teacher was
cat-ing providers when it should have run them, and panicking after one failed probe. Tightened: providers are programs to run; do not give up on a single failure.A gate gap. The gate checks solvability (reference -> checker) but not discoverability (can the operator retrieve the directives through the provider). A scenario can be solvable yet have a broken discovery path; the authoring prompt now requires the provider to actually serve its directives.
First end-to-end cycle on real data#
Master Foo (qwen2.5:32b) authored 38 verified scenarios (38/60 admitted, all nine archetypes 3-5x), then operated them into 40 clean trajectories (40/114 rollouts kept, from 16/38 scenarios). A Mistral-7B LoRA was fine-tuned on 26 of those trajectories and evaluated on 5 held-out scenarios (the tuned model itself operating, no scaffold) against a base-Mistral control on the same scenarios:
metric |
base |
tuned |
lift |
|---|---|---|---|
operate (reached command-mode) |
0.2 |
1.0 |
+0.8 |
terminate (exit / panic) |
0.0 |
0.4 |
+0.4 |
verified (checker passed) |
0.4 |
0.6 |
+0.2 |
clean (operate + verify + exit) |
0.0 |
0.4 |
+0.4 |
Base Mistral was incomplete on all five: it never typed exit, exactly
the priming-study finding (operates, never terminates). The fine-tune, same
model and same render format, terminated on 2/5 and produced 2/5 fully clean
runs, on scenarios it never trained on. The within-holdout control removes the
scenario-difficulty confound: the only difference is the adapter.
Reading. The thesis in miniature: SFT on real shell trajectories installed termination (the capability the priming study showed was llama-specific and unreachable by priming) into a non-llama model, and it generalised to held-out scenarios, from ~0.
Caveats, plainly: n=5 (noisy), one run, the holdout is instance-level (the archetypes appeared in training, only these instances did not), and the adapter overfit (loss ~0.01 on 26 trajectories). The within-holdout control is clean. The magnitude needs a larger, harder test. The scaled cycle below provides it.
Scaled cycle: archetype-level holdout#
The scaled run authored 112 verified scenarios (112/150, all nine archetypes),
held out two whole task types (search_count and file_transform, 16
scenarios), rolled Master Foo through the other seven archetypes for 94 clean
trajectories (94/288 rollouts, from 47/96 scenarios), fine-tuned Mistral on
those 94, and evaluated on the 16 held-out scenarios of unseen archetypes. At
n=16 the signal is far less noisy than the first cycle:
metric |
base |
tuned |
lift |
|---|---|---|---|
operate (reached command-mode) |
0.06 |
1.0 |
+0.94 |
terminate (exit / panic) |
0.0 |
0.56 |
+0.56 |
verified (checker passed) |
0.06 |
0.31 |
+0.25 |
clean (operate + verify + exit) |
0.0 |
0.12 |
+0.12 |
Base terminated on 0 of 16. The tuned model terminated on 9 of 16, across two archetypes it never trained on. The termination lift is larger than the first cycle’s 0.4, on a harder test (whole task types held out), with less noise. Base barely functions in the render format (operate 0.06); the tuned model operates all 16. Termination is now a generalising capability in a non-llama model, reaching task types absent from training.
The decomposition is the substantive finding, and it is what the thesis predicts. The mechanism transfers: operate (1.0) and terminate (0.56) on unseen archetypes, so the discover / operate / exit routine generalises. The task competence lags: verified is 0.31, because the model can only operate the loop around a task type it never trained on, not solve it. A model can generalise the operating loop while still needing exposure to a kind of task to perform it well. Archetype coverage in training closes the competence gap; nothing here suggests a wall.
Caveats: one model, one run, one seed; verified is low; search_count was
hard even for Master Foo (zero rollout keepers), so the tuned model operating
it at all is notable. The within-holdout control is clean and n=16 is a solid
sample. final loss was 0.089, a healthier fit on the larger set than the
first cycle’s 0.01.
The motivating result stands (priming study): operation is broad and roughly model-agnostic; termination is llama-only. This experiment is converting that asymmetry from a property of one model into a property of a dataset, and the first cycle shows it works.
What success looks like#
Termination installed. Re-run the priming ladder against the fine-tuned mistral: clean exits should appear where the base model scored ~0.
Mechanism generalises. Hold out whole provider and task archetypes from training; the model should still discover and operate them at test time by reading their self-documentation.
Operation preserved. Fine-tuning for termination must not cost the command-mode the base model already had.
Honest unknowns: whether a 7B LoRA on a few thousand trajectories is enough for a reliable terminal habit; whether the discovery mechanism generalises or collapses to memorised invocations; and whether training in dash transfers to sek’s slightly different command semantics. Each one will be measured directly.
Status / next#
Done: the full pipeline on the GPU box (generate, rollout, fit, eval) with QLoRA + a resident base for the slow link. Two cycles complete; the archetype-level cycle (n=16) shows termination 0 -> 0.56 on unseen task types, with the operate / terminate mechanism generalising and task competence staying archetype-local.
Now: close the competence gap with broader archetype coverage and more trajectories per archetype, and add more held-out task types to the eval.
Then: lift rollout yield (the teacher’s operating quality,
qwen, is a ceiling: premature exit / early panic limit how much clean data it produces); vary the seed and a second target model to test robustness; and re-run the priming ladder on the tuned model to measure termination directly against the original llama-only asymmetry.
Comments
Feel free to leave a public comment on my From seed to weights: fine-tuning a shell operator blog post.
Before you comment...
If you don't have an account at accounts.tiararodney.com yet, feel free to create one during sign in, after you've read and agreed to my Privacy and Acceptable Use Policy