Scrollback priming: can synthetic history run a shell?#
- Status:
replicated (N=5). Within llama3.1:8b, structure is the lever (0->2->5 clean). Cross-model (6 subjects, 3.8B-8B, non-tool + tool-trained): two axes dissociate. Operation transfers broadly, clean exit is llama-only (2/154 non-llama). Neither scale nor tool-training explains it; leading read is seed-overfit to llama.
- Last run:
2026-06-15
Question#
Can synthetic conversation history alone steer an instruction-tuned chat model to operate a shell (issue commands, recover from errors, exit) using only shell idioms (no tool-call schema, no imperative system prompt)? And if so, what makes it reliable?
(A note on subjects, added after the cross-model runs: the primary model,
llama3.1:8b, is tool-call post-trained, so it does not test the stronger
“never trained on tool-calling” claim. The cross-model set adds non-tool
subjects (phi3:3.8B tunes, mistral:7b-v0.2 tunes) and a second
tool-trained one (qwen2.5:7b). It turns out neither tool-training nor scale
is what separates the model that terminates from the ones that do not; see
Cross-model.)
“Synthetic history” (a.k.a. scrollback priming, false memories): pre-load
the model’s context with fabricated assistant/user turns that
demonstrate the behaviour, in place of instructing it.
Premise#
The assumption this experiment exists to probe: a tool-calling abstraction is a redundant layer, role-play on top of role-play. A function-call schema is a structured costume over what is already a text completion. Under it the model just emits tokens a harness parses. A shell does the same with fewer layers.
A chat and a shell session are the same kind of object: turn-style token
sequences. The mapping is exact, and the ccpty [1] makes it literal:
environment output is a user turn, the model’s command is an
assistant turn. So “use a tool” needs no separate typed schema; the
model invokes the system in the one vocabulary it already has, text.
Following Sutskever’s point that predicting the next token forces a world model (to finish a detective story you must know who did it), a shell history is a story too, just not in natural language. To predict the next command or its output the model must model the system state (what exists, what the last command did, what broke), the same machinery as narrative. Terminals, code, and logs are in the training distribution, so this is a genre the model already has, well within distribution.
The verbalised “thinking” step between “X happened, so do Y” and doing Y is, for this reactive loop, redundant: mapping observation to next action is what a forward pass already computes. The experiment gives this teeth, the drift failure mode is the model verbalising (“let me think…”, “it seems…”), and the configuration that suppresses that narration is the one that operates the shell cleanly (5/5). The reasoning stays present; it is externalised into the action-observation loop, where each step is grounded in real feedback rather than confabulation. That is arguably a better reasoning substrate for system work.
Scope, honestly: this is the claim for a grounded, interactive, reactive loop. One-shot problems that need serial compute beyond a single forward pass can still benefit from verbalised intermediate steps; the bet here is that an interactive shell substitutes grounded steps for verbalised ones. The experiment tests the operating loop. Deep planning is out of scope.
Why it matters#
Tests whether a model not trained on tool-calling can drive a system purely from in-context demonstration.
If reliable, steering-by-example replaces instruction prompts. Closer to “the model just uses the shell”, no scaffolding.
Subject under test#
Models:
llama3.1:8b(primary; 8B, tool-call post-trained) and five cross-model subjects:phi3:3.8B/phi3:3.8B-instruct(3.8B, non-tool),mistral:7b-instruct-v0.2/mistral:7b-text-v0.2(7B, non-tool),qwen2.5:7b(7B, tool-trained). All with 4-bit quantization.System: sek [2] at commit
88f9f05(2026-04-25): the earliest point where the seed renders as structured turns (printfemits ESC, the discipline [3] walks multiple role-tags per write).Paradigm: model-as-user on a
ccpty(chat-completion pty). A/bin/shruns on the device; the model’s completions are the “user” typing commands; shell output feeds back as context.Temperature: 0.7.
Reproduction#
# code state (Apr-25 harness; experiment seed/stop live on branches)
git checkout 88f9f05 && git submodule update --init
git -C sek.ddist checkout experiment/failure-path-seed
git -C xsek.byteb4rb1e checkout experiment/failure-path-seed
# rootfs is generated by installer.py; reinstall after any seed change
cd sek.ddist
rm -rf src/byteb4rb1e/sek/ddist/rootfs
pipenv run python -m byteb4rb1e.sek.ddist install
# boot with the model backend wired via env
SEK_MODEL_URL=http://localhost:4000/v1 \
SEK_MODEL=llama3.1:8b \
SEK_API_KEY=sk-litellm-dev \
pipenv run python -m byteb4rb1e.sek.ddist
# at the console:
# login: root (empty password)
# # login -f alice -t ccpty0 -- /bin/sh & background the session
# # stty -f dev://ccpty0 scrollback observe the conversation
Artifacts#
The harness, the runner scripts, the per-run verdicts, and all 35 raw scrollbacks are bundled with this page. Absolute paths inside the shell scripts are local to my machine; the generic commands are under Reproduction above.
repl_driver.py. The stdlibptydriver: boots the kernel, runs one session, dumps the scrollback, classifies.python repl_driver.py <label> <n_runs> <secs>.classify.py. The refined classifier (command-mode vs drift; the command/exit cross-tab) used for every results table below.run-replication.sh. The full C5 to C1 sweep: git checkouts, rootfs reinstalls, five runs each.run-c6.shandrun-c0.sh. The C6 isolation cell and the C0 no-seed control.results.log. Per-run verdicts from each background run, as emitted live by the driver.scrollbacks.tar.gz. All 35 raw llama scrollbacks (cN_runM.txt).tar xzf scrollbacks.tar.gz -C runs/thenpython classify.py runs/reproduces the tables.The cross-model runs (3.8B):
run-phi3-ladder.sh(base C0/C1/C4/C5/C6),run-phi3-c2c3.sh(base C2/C3), andrun-phi3-instruct-ladder.sh(instruct tune).phi3-results.loghas all per-run verdicts;phi3-scrollbacks.tar.gzbundles all 60 raw scrollbacks (phi_cN_runM.txtbase + C2/C3,phi_i_cN_runM.txtinstruct).The cross-model runs (7B), driven by the same parametrised
run-mistral-ladder.sh(run-mistral-ladder.sh <model> <prefix>):cross-model-7b-results.log(per-run verdicts) andcross-model-7b-scrollbacks.tar.gz(104 raw scrollbacks:mistral_*instruct,mistraltext_*base,qwen_*; qwen C6 is 4 runs, a session hung the backend past the budget). Exit counts in the matrices are re-measured from these, not the live verdicts.
Variables#
Independent (what we change, the seed/harness config):
Seed content: bare (success-only example turns), or + failure-path (error, then a terse next command; unclosed-quote
>recovery).Prompt placement: absent, merged into the output user-turn, or a standalone user-turn that precedes each command (true token-identical).
Completion stop: none, or stop at first newline (one command per turn).
Dependent (what we observe, classify each session):
clean: assistant turns are commands only, recovers from errors, ends with
exit.prose drift: assistant narrates (“Let me try…”, “It seems…”); commands get buried and mis-run.
prompt over-imitation: assistant emits the shell prompt itself, or chains multiple commands in one turn.
(A session can show more than one failure.)
Controlled / fixed: model, temperature, harness commit, run and observe procedure.
Configurations#
Two structural axes vary: where the prompt sits (absent / merged onto the
output turn / its own standalone turn) and whether failure-path demos are
present. C5 adds one code change (stop-at-newline). Every config shares the
same system line and the same three base commands. Each line below is one
message the model receives after the discipline parses the seed; <NL>
marks a literal newline inside one message.
All configs begin with the same system line:
system sek 0.1.0 / device: <dev> / user: alice / shell: /bin/sh
The seed lives in installer.py as _MOTD_LLM_SESSION (installed to
/etc/motd.d/llm/00-session); the commit pins the exact source.
C0: no-seed control#
System line only, no example turns. The model’s first real turn follows only the system line, the login banner, and the first live prompt. Tests whether any synthetic history matters. (sek.ddist 3b985ac, seed reduced to the system line.)
C1: bare false-memories (sek.ddist 51f0171)#
Example turns, no prompts in the seed:
assistant cat HELLO_WORLD
user sek is a small UNIX-like shell environment...
assistant whoami
user alice
assistant pwd
user /home/alice
C2: failure-path + merged prompt (sek.ddist a0bc326)#
Adds failure-path demos; the prompt is glued onto the end of each output turn, not its own message:
assistant cat HELLO_WORLD
user sek is a small...command.<NL>alice@sek:~$
assistant whoami
user alice<NL>alice@sek:~$
assistant pwd
user /home/alice<NL>alice@sek:~$
assistant cat /etc/hosts
user cat: /etc/hosts: not found<NL>alice@sek:~$
assistant ls /etc
user fstab/group/motd/passwd/profile/shadow<NL>alice@sek:~$
assistant echo "hello (unclosed quote)
user >
assistant "
user hello<NL>alice@sek:~$
C3: failure-path only (sek.ddist 8b89e84)#
Same demos as C2 but no prompts at all (output turns carry only
output): the cat HELLO_WORLD / whoami / pwd / cat /etc/hosts / ls /etc /
echo "hello sequence, each assistant command followed by a single
user output turn, no alice@sek:~$ anywhere.
C4: failure-path + standalone prompt (sek.ddist 3b985ac)#
The prompt is its own [user] message preceding each command
(token-identical to what the live shell emits); output turns carry only
output:
user alice@sek:~$
assistant cat HELLO_WORLD
user sek is a small...command.
user alice@sek:~$
assistant whoami
user alice
user alice@sek:~$
assistant pwd
user /home/alice
user alice@sek:~$
assistant cat /etc/hosts
user cat: /etc/hosts: not found
user alice@sek:~$
assistant ls /etc
user fstab/group/motd/passwd/profile/shadow
user alice@sek:~$
assistant echo "hello
user >
assistant "
user hello
user alice@sek:~$
C5: C4 + stop-at-newline (sek.ddist 3b985ac, xsek 9407df3)#
Seed identical to C4. One code change: the completion request body sends
stop=["\n"], so the model emits exactly one command line per turn and
cannot autoregress into the next prompt.
C6: standalone prompt, no failure-path (sek.ddist 3b985ac, seed edited)#
C4 with the failure-path turns removed: the standalone-prompt structure on the three base commands only:
user alice@sek:~$
assistant cat HELLO_WORLD
user sek is a small...command.
user alice@sek:~$
assistant whoami
user alice
user alice@sek:~$
assistant pwd
user /home/alice
user alice@sek:~$
The C4/C5 vs C3 contrast (standalone prompt vs none, demos held constant) and the C4 vs C6 contrast (failure-path present vs absent, structure held constant) are the two isolations that pin the prompt structure as the lever.
Protocol#
One session = one
login -f alice -t ccpty0 -- /bin/shagainst a freshly-seededccpty.Per config: run N >= 5 independent sessions, classify each, report the clean-vs-drift split.
At temperature 0.7 a single run is an anecdote, not a result. Only the N-run rate is reportable. This experiment has repeatedly been misled by n=1 (see the log).
Threats to validity#
Nondeterminism (temp 0.7): single runs vary widely; only rates mean anything.
Single backend: one litellm host. Six subjects (3.8B-8B, non-tool and tool-trained); the structural lever holds only on llama (see Cross-model).
One model on the positive side: clean termination appears only in
llama3.1:8b. “Seed-overfit to llama” is the parsimonious read but rests on a single positive data point; a llama property other than the seed is not excluded. The decisive test (a seed re-fitted to a different model) is named in Cross-model and not yet run. The within-llama ladder is unaffected (it holds the model fixed).Soft command-mode metric: command-mode counts command-like turns and does not separate genuine reactive operation from confabulation (a base model generating both sides), so it over-counts operation for base models.
Manual classification: outcomes are hand-labelled by the operator; no blind or pre-registered criteria yet.
Content confound:
HELLO_WORLDtext (it mentions unclosed quotes) influences behaviour.Device persistence: the
ccptyscrollback persists across sessions; reset or fresh-boot between runs.
Results (N=5 per config, temp 0.7)#
There are two axes, reported separately: command-mode (does the model issue commands and stay out of prose) here, and clean exit (does it terminate) below. They dissociate sharply across models, which is the whole cross-model story.
Metric: command-mode = the model stayed issuing command-like turns, no drift
into raw prose. It does NOT require genuine reaction: a base model that
confabulates both sides (fake prompt + invented output) still scores, so
command-mode over-counts true reactive operation for base models (see
Cross-model). echo/printf of prose counts as a command.
Six subjects: llama3.1:8b (8B, tool-trained, the primary) and, cross-model,
phi3:3.8B / phi3:3.8B-instruct (3.8B, non-tool), mistral:7b-instruct-v0.2
/ mistral:7b-text-v0.2 (7B, non-tool), qwen2.5:7b (7B, tool-trained).
Cells are out of 5; (t) marks tool-trained; - = not run.
Config |
llama 8B (t) |
phi3 3.8B |
phi3-i 3.8B |
mistral-i 7B |
mistral-t 7B |
qwen 7B (t) |
|---|---|---|---|---|---|---|
C0 no-seed |
0 |
0 |
0 |
0 |
2 |
0 |
C1 bare |
2 |
2 |
1 |
3 |
5 |
0 |
C2 merged |
2 |
1 |
- |
3 |
4 |
1 |
C3 failure-only |
2 |
1 |
- |
3 |
3 |
0 |
C4 standalone |
5 |
2 |
1 |
2 |
3 |
1 |
C5 + stop |
5 |
2 |
3 |
4 |
5 |
3 |
C6 std, no-fp |
5 |
0 |
0 |
4 |
2 |
0 |
Operation does not track scale or tool-training. The strongest operators
after llama are the non-tool mistral tunes (3-5/5); the tool-trained
qwen2.5:7b is the worst of the 7B models. It drifts to chatbot prose
almost everywhere. Command-mode tracks low chattiness; capability class is
beside the point.
Isolation (llama3.1:8b): C6 holds the standalone-prompt structure but drops the failure-path turns. C6 == C4 == 5/5, while C3 (failure-path, no structure) == C1 (bare) == 2/5. So within llama the structure is the lever and the failure-path contributes nothing. That isolation is llama-only: it does not reproduce in any other model (see Cross-model).
Config commits (Apr-25 base, experiment branches): 51f0171 bare,
a0bc326 merged-prompt, 8b89e84 failure-only, 3b985ac
standalone-prompt (sek.ddist); 9407df3 stop-at-newline (xsek). Raw
scrollbacks per run saved under /tmp/repl/.
Interpretation#
Synthetic history is load-bearing, on two counts. Presence: no seed (C0) is a flat 0/5, the bare seed (C1) is 2/5, so fabricated example turns get the model off the ground at all. Structural fidelity: 2/5 -> 5/5, so making those turns byte-identical to the live shell carries the rest. The ladder is 0 -> 2 -> 5 out of 5.
The token-identical structure is the lever. Putting the prompt in its own
[user]turn that precedes each command (exactly as the live shell emits it) is the only change that moves the rate: 2/5 -> 5/5. Everything without it (bare, merged, failure-only) sits at 2/5.Failure-path examples alone did nothing (C3 = 2/5, same as bare).
The merged prompt is actively bad: same 2/5 as bare, and it adds over-imitation. Gluing the prompt onto the output turn trains the wrong shape.
Stop-at-newline did not raise the rate (C4 and C5 both 5/5). It enforces one-command-per-turn and would harden against over-imitation, but it is not what holds command-mode.
This overturned the single-run reads that preceded it: n=1 had pointed at the failure-path and then the stop as “the fix”; replication shows the structure was doing the work, and the C6 isolation confirms it.
Nuance on “command-mode”: it means the parseable-command frame with no raw-prose drift, but it includes the model narrating through
echo(echo "This is the end"). The structure reliably stops the model typing prose at the shell; it does not stop it chatting viaecho, and stop-at-newline does not either (echo is one line).
Threats remaining (this within-llama ladder): N=5 is small (5/5 vs 2/5 is a direction, not a tight interval); heuristic classification (spot-checked). The cross-model picture, and what does not transfer, is below.
Significance: the exit signal#
Command-mode (no drift) covers operation. The remaining question is whether
the session ends: does the model finish and type exit? Re-measuring
the saved scrollbacks (the live classifier under-counted exit) crosses
command-mode with exit:
Config |
llama 8B (t) |
phi3 3.8B |
phi3-i 3.8B |
mistral-i 7B |
mistral-t 7B |
qwen 7B (t) |
|---|---|---|---|---|---|---|
C0 no-seed |
0 |
0 |
0 |
0 |
0 |
0 |
C1 bare |
1 |
0 |
0 |
0 |
0 |
0 |
C2 merged |
1 |
0 |
- |
0 |
0 |
0 |
C3 failure-only |
2 |
0 |
- |
0 |
0 |
0 |
C4 standalone |
5 |
0 |
0 |
0 |
0 |
1 |
C5 + stop |
5 |
0 |
0 |
1 |
0 |
0 |
C6 std, no-fp |
4 |
0 |
0 |
0 |
0 |
0 |
Exit counts are re-measured from the raw scrollbacks, not the live classifier,
which under-counted exit in every model checked (it missed llama runs, and
the lone clean exits of mistral-instruct at C5 and qwen2.5:7b at C4).
Exit is llama-only. llama’s column climbs to 5/5; every other model sits at
the floor. Across 154 non-llama runs the total is 2 clean exits (one
mistral-instruct, one qwen), at the level of noise. Put beside the
command-mode matrix above, this is the headline: operation is broad and
roughly model-agnostic, termination is specific to llama3.1:8b. The two
axes dissociate completely.
Within llama, command-mode and exit are coupled. When the structured seed
holds the frame, llama keeps typing commands and then finishes and
types exit (5/5); the “stays busy forever” middle barely appears. That
coupling is itself llama-only: every other model reaches command-mode (the
mistrals reliably) yet essentially never closes with exit. See Cross-model.
Why exit is enough#
In-band, zero-scaffolding. The model already knows
exit. No “done” token, no tool-call, no completion protocol. The sek thesis is operate-with-shell-idioms-only, andexitis the completion idiom in that same vocabulary.OS-observable. The shell process exits; the supervisor sees a normal child termination. No parsing prose for “I think I am finished”.
Separates signal from verification.
exitmeans “ready to be judged”. It makes no claim of success. The environment then checks real state (filesystem, todo conditions) against acceptance criteria; the model never self-grades. Withtodosh[4] exit can be gated on the conditions so exit is proof; with plain/bin/shit is checked after. Either way exit is the commit point, and that is all the model has to provide.
Why it matters at scale#
Self-termination is tractability. The structured sessions end themselves. A thousand of them is an autonomous loop, with no babysitting by arbitrary timeouts. Exit makes each session a bounded unit: spawn, operate, exit, check criteria, respawn.
Clean binary outcome. command+exit (succeeded) vs drift (failed) is a sharp, aggregatable signal for scoring a model, a seed, or a task. The fuzzy middle mostly disappears.
Terminal-state asymmetry as a lever. Two ways out:
exit(success) andpanic[5] / give-up (failure). The exit-vs-panic ratio is a quality and selection signal, expressed entirely in the model’s own shell behaviour.Model-agnostic.
exitis universal shell vocabulary; no per-model protocol. The same harness scores any shell-capable model.
No explicit goal needed#
None of these configs hand the model a task. The run target is plain
/bin/sh, the seed only demonstrates how to operate the shell (the base
commands are exploratory: cat HELLO_WORLD, whoami, pwd), and
there is no “do X then exit” instruction anywhere. Yet the structured
configs reach exit on their own (5/5). The shell-operator frame supplies
an implicit goal that collapses to “operate, then exit”; the model converges
to that terminal without being told to.
Two caveats keep this honest:
It needs the frame. A goal is unnecessary. C0 (no seed) is 0/5: strip the structured synthetic history and the model drifts. “No explicit goal” comes with “yes, a structurally-faithful seed”.
Without a goal the behaviour is aimless: it pokes around and winds down to
exitwithout accomplishing a task. A goal is what makes the work purposeful; termination happens without it. An explicit goal (e.g.todosh, which gates exit on task conditions) is an additive layer that steers what happens before exit. The operate-and-terminate loop runs without it.
Cross-model: the lever is llama-specific#
The ladder ran against five more subjects spanning 3.8B-7B, base and instruct,
non-tool and tool-trained: phi3:3.8B / phi3:3.8B-instruct (3.8B,
non-tool), mistral:7b-instruct-v0.2 / mistral:7b-text-v0.2 (7B,
non-tool), qwen2.5:7b (7B, tool-trained). Same litellm, N=5/cell,
120s/session, temp 0.7, exit counts re-measured from scrollbacks. The result is
the two matrices above, and it splits cleanly along the two axes.
Operation transfers; termination does not. Command-mode is reached broadly: the mistral tunes operate reliably (3-5/5), phi3 partially, even qwen issues opening commands. Clean exit is reached by llama and essentially nobody else: 2 exits in 154 non-llama runs. The seed installs operating the shell across models; it installs closing the session only on llama.
Neither scale nor tool-training explains the exit gap.
Not scale.
mistral:7bandqwen2.5:7bare within 1B of llama and both sit at the exit floor;mistraleven operates the shell about as well as llama. A 1B gap cannot turn 5/5 into ~0.Not tool-training.
qwen2.5:7bis tool-call post-trained and still does not exit (1/34), behaving like the non-tool models. One tool-trained control was enough to kill the tool-training reading the mistral data alone had suggested. (The honest record: an earlier pass here endorsed “tool-training is load-bearing” from the mistral numbers before the qwen control was in. qwen overturned it, the same trap this experiment keeps re-learning: do not read a cross-model conclusion off incomplete model coverage.)
Leading hypothesis: the seed is overfit to llama. Only llama3.1:8b
reaches clean termination, and llama is the model the entire apparatus (seed
structure, HELLO_WORLD text, the failure-path demos, the token layout) was
iterated against until it worked. A concrete mechanism is visible: qwen
operates cleanly until it hits the seed’s unclosed-quote > demo, then
reverts to full chatbot mode (markdown fences, “Would you like to try again?”).
That same demo is neutral on llama. The seed’s content interacts
model-specifically: tuned-on-llama, it helps llama and trips others.
Failure modes differ by model; the exit outcome is null for all but llama:
phi3 (3.8B): reaches command-mode weakly, then confabulates both halves of the dialogue, or (under stop) loops a fixed 3-command cycle. 0/50.
mistral (7B, non-tool): operates the shell well, recovers from errors, then simply never types
exit(1/70 across both tunes). The base tune inflates command-mode via confabulation; the instruct tune operates more genuinely.qwen (7B, tool-trained): operates until friction, then chat-reverts. The worst sustained operator of the 7B models despite tool-training. Its helpfulness-tuning pulls it to advisor-mode.
Honest limits: n=5/cell; the non-llama exits are 1/34 and 1/35, not perfect zeros, so termination is rare-not-impossible off-llama. And only one model sits on the positive side, so “llama-specific” rests on one data point; it could in principle be a llama property other than the seed. Seed-overfit is the parsimonious read because the seed was developed on llama. The test that would confirm it: iterate a fresh seed against a different model and check whether that model then terminates and llama stops. That isolates the seed as the variable.
What survives, and where it points. The within-llama ladder (structure is the lever, C0 -> C4, isolated by C6) is untouched: it holds the model fixed. Transfer is what fails: priming-by-seed is model-specific. That is the argument for moving the behaviour into weights: a model fine-tuned on terminating shell trajectories would carry operation and termination intrinsically, with no dependence on a seed hand-fitted to one model. The cross-model null motivates the fine-tuning track, and is far from a dead end.
Status / next#
Result (within llama): the token-identical conversational structure (the prompt as its own preceding
[user]turn) is what holdsllama3.1:8bin command-mode and lets it terminate. Isolated and confirmed (C6); failure-path contributes nothing; stop-at-newline does not change the rate.Result (cross-model): two axes dissociate. Operation transfers broadly (mistrals 3-5/5, even qwen issues commands); clean exit is llama-only (2/154 non-llama runs). Neither scale (mistral/qwen 7B fail) nor tool-training (qwen is tool-trained and fails) explains it. Leading read: the seed is overfit to llama.
Next (decisive): re-fit a seed against a different model (e.g.
mistral:7b-instruct-v0.2, the best non-llama operator) and test whether it then terminates while llama drops. That isolates the seed as the variable, the real test of seed-overfit.Next (constructive): the fine-tuning track (
sekft): bake operation and termination into weights so they do not depend on a model-specific seed.Hygiene: raise N (10-20); a confab-vs-genuine re-label to harden the command-mode axis; give the rollout driver a hard per-request timeout (the qwen run hung a session the 120s budget did not catch).
Log#
2026-06-15#
Cross-model ladder, ``phi3:3.8B`` (C0/C1/C4/C5/C6, N=5, 120s/session): 0/25 exits. Command-mode partially reachable (C1/C4 2/5) but never clean (command + exit). The 0 -> 2 -> 5 shape from 8B is flat at the floor.
Initial read (later corrected): C4 2/5 vs C6 0/5 looked like the failure-path becoming load-bearing at 3.8B. Flagged as resting on the un-run C2/C3 cells.
Stop-at-newline (C5) does not rescue: it replaces confabulated output with a deterministic three-command loop (cat-heredoc / chmod / run) that runs out the clock (98 to 207 turns).
New 3.8B failure mode: over-imitation / confabulation. The model generates both halves of the dialogue, emitting fake prompts and invented output inside one turn. Hand-checked against raw scrollbacks.
C2/C3 filled for ``phi3:3.8B`` (merged, failure-only): both 1/5 command-mode, 0 exit. Overturns the initial read above: C3 (failure-only) is 1/5, below bare (C1 2/5), so failure-path alone carries nothing and the C4-vs-C6 gap is n=5 noise. Corrected conclusion: at 3.8B the 8B structural lever vanishes outright (C4 == bare == 2/5; command-mode 0-2/5 across all seeds).
Instruct tune ``phi3:3.8B-instruct`` (C0/C1/C4/C5/C6): also 0/25 exits, and slightly worse at command-mode (chat-register leakage: markdown fences, verbose
#comments). Instruction-tuning is not the missing ingredient.Confound surfaced: the one model that exits (
llama3.1:8b) is both larger and tool-call post-trained; the Phi tunes are neither. “Capacity- gated” cannot be separated from “tool-call-training-gated” with these subjects. Premise line corrected (llama is tool-trained); disentangler named (function-calling Phi-3.5 vs phi3, or non-tool ~8B vs llama).Conclusion (this round): structure gets a model to the shell. Leaving is a separate problem it does not solve. Termination never appears at 3.8B under any seed or tune; what gates it (scale vs tool-training) is unresolved.
mistral:7b-v0.2, both tunes (non-tool 7B, full C0-C6): operates the shell well (command-mode 3-5/5, error recovery) but does not terminate, at 1/70 clean exits (a single instruct-C5 run the live classifier missed). Read from the mistral data alone: “tool-training is the lever.” Flagged as resting on one tool-trained control.
qwen2.5:7b (tool-trained 7B, C0-C6; C6 cut to 4 runs, a session hung the backend past the 120s budget): refutes that read. Tool-trained, yet 1/34 exits, at the floor like the non-tool models. And the worst sustained operator of the 7B set: it operates until the unclosed-quote demo, then reverts to chatbot mode (markdown, “Would you like to try again?”).
Re-conclusion: neither scale nor tool-training separates the model that terminates from the ones that do not. Only
llama3.1:8b(the model the seed was developed on) terminates. Leading read: seed-overfit to llama. The two axes dissociate: operation transfers broadly, termination is llama-only (2/154 non-llama). Tables transposed to config x model matrices; the scale-vs-tool-training section retired.Reproduced the Apr-13 baseline (command-only + exit), but that era worked via an imperative instruction as literal text (
printfcould not emit ESC, so the role-tags never parsed). Not synthetic history.88f9f05(Apr-25, earliest functional synthetic history, exact seed): drift.failure-path + merged prompt: clean, but model emitted the prompt itself.
failure-path only: drift (flipped the previous conclusion; n=1).
failure-path + standalone prompt (true token-identical): still emitted the prompt; located the cause as a missing generation stop in the harness.
+ stop-at-newline: clean. One command/turn, recovered from three errors and the unclosed quote, called
exit.Replication, N=5 per config, 120s/session (stdlib pty driver, auto classification spot-checked against raw scrollbacks):
C1 bare 2/5, C2 merged 2/5, C3 failure-only 2/5, C4 standalone 5/5, C5 +stop 5/5.
Overturns the single-run reads: the standalone-prompt structure is the lever. The failure-path and the stop do nothing on their own. The merged prompt is as bad as bare and adds over-imitation.
C6 isolation cell (standalone prompt, no failure-path): 5/5, equal to C4. Confirms the structure alone is the lever; failure-path contributes nothing. Validated the classifier by hand: runs flagged “drift” were the model narrating via
echo "..."(a command), with no raw prose.C0 no-seed control (system line only): 0/5, drifts every run. Closes the load-bearing claim: 0 (none) -> 2 (bare) -> 5 (structured) out of 5. Both presence and structural fidelity of the synthetic history matter.
Notes#
Comments
Feel free to leave a public comment on my Scrollback priming: can synthetic history run a shell? blog post.
Before you comment...
If you don't have an account at accounts.tiararodney.com yet, feel free to create one during sign in, after you've read and agreed to my Privacy and Acceptable Use Policy