Scrollback priming: can synthetic history run a shell?#

Status:

replicated (N=5). Within llama3.1:8b, structure is the lever (0->2->5 clean). Cross-model (6 subjects, 3.8B-8B, non-tool + tool-trained): two axes dissociate. Operation transfers broadly, clean exit is llama-only (2/154 non-llama). Neither scale nor tool-training explains it; leading read is seed-overfit to llama.

Last run:

2026-06-15

Question#

Can synthetic conversation history alone steer an instruction-tuned chat model to operate a shell (issue commands, recover from errors, exit) using only shell idioms (no tool-call schema, no imperative system prompt)? And if so, what makes it reliable?

(A note on subjects, added after the cross-model runs: the primary model, llama3.1:8b, is tool-call post-trained, so it does not test the stronger “never trained on tool-calling” claim. The cross-model set adds non-tool subjects (phi3:3.8B tunes, mistral:7b-v0.2 tunes) and a second tool-trained one (qwen2.5:7b). It turns out neither tool-training nor scale is what separates the model that terminates from the ones that do not; see Cross-model.)

“Synthetic history” (a.k.a. scrollback priming, false memories): pre-load the model’s context with fabricated assistant/user turns that demonstrate the behaviour, in place of instructing it.

Premise#

The assumption this experiment exists to probe: a tool-calling abstraction is a redundant layer, role-play on top of role-play. A function-call schema is a structured costume over what is already a text completion. Under it the model just emits tokens a harness parses. A shell does the same with fewer layers.

A chat and a shell session are the same kind of object: turn-style token sequences. The mapping is exact, and the ccpty [1] makes it literal: environment output is a user turn, the model’s command is an assistant turn. So “use a tool” needs no separate typed schema; the model invokes the system in the one vocabulary it already has, text.

Following Sutskever’s point that predicting the next token forces a world model (to finish a detective story you must know who did it), a shell history is a story too, just not in natural language. To predict the next command or its output the model must model the system state (what exists, what the last command did, what broke), the same machinery as narrative. Terminals, code, and logs are in the training distribution, so this is a genre the model already has, well within distribution.

The verbalised “thinking” step between “X happened, so do Y” and doing Y is, for this reactive loop, redundant: mapping observation to next action is what a forward pass already computes. The experiment gives this teeth, the drift failure mode is the model verbalising (“let me think…”, “it seems…”), and the configuration that suppresses that narration is the one that operates the shell cleanly (5/5). The reasoning stays present; it is externalised into the action-observation loop, where each step is grounded in real feedback rather than confabulation. That is arguably a better reasoning substrate for system work.

Scope, honestly: this is the claim for a grounded, interactive, reactive loop. One-shot problems that need serial compute beyond a single forward pass can still benefit from verbalised intermediate steps; the bet here is that an interactive shell substitutes grounded steps for verbalised ones. The experiment tests the operating loop. Deep planning is out of scope.

Why it matters#

  • Tests whether a model not trained on tool-calling can drive a system purely from in-context demonstration.

  • If reliable, steering-by-example replaces instruction prompts. Closer to “the model just uses the shell”, no scaffolding.

Subject under test#

  • Models: llama3.1:8b (primary; 8B, tool-call post-trained) and five cross-model subjects: phi3:3.8B / phi3:3.8B-instruct (3.8B, non-tool), mistral:7b-instruct-v0.2 / mistral:7b-text-v0.2 (7B, non-tool), qwen2.5:7b (7B, tool-trained). All with 4-bit quantization.

  • System: sek [2] at commit 88f9f05 (2026-04-25): the earliest point where the seed renders as structured turns (printf emits ESC, the discipline [3] walks multiple role-tags per write).

  • Paradigm: model-as-user on a ccpty (chat-completion pty). A /bin/sh runs on the device; the model’s completions are the “user” typing commands; shell output feeds back as context.

  • Temperature: 0.7.

Reproduction#

# code state (Apr-25 harness; experiment seed/stop live on branches)
git checkout 88f9f05 && git submodule update --init
git -C sek.ddist       checkout experiment/failure-path-seed
git -C xsek.byteb4rb1e  checkout experiment/failure-path-seed

# rootfs is generated by installer.py; reinstall after any seed change
cd sek.ddist
rm -rf src/byteb4rb1e/sek/ddist/rootfs
pipenv run python -m byteb4rb1e.sek.ddist install

# boot with the model backend wired via env
SEK_MODEL_URL=http://localhost:4000/v1 \
SEK_MODEL=llama3.1:8b \
SEK_API_KEY=sk-litellm-dev \
  pipenv run python -m byteb4rb1e.sek.ddist

# at the console:
#   login: root                                (empty password)
#   # login -f alice -t ccpty0 -- /bin/sh &     background the session
#   # stty -f dev://ccpty0 scrollback           observe the conversation

Artifacts#

The harness, the runner scripts, the per-run verdicts, and all 35 raw scrollbacks are bundled with this page. Absolute paths inside the shell scripts are local to my machine; the generic commands are under Reproduction above.

  • repl_driver.py. The stdlib pty driver: boots the kernel, runs one session, dumps the scrollback, classifies. python repl_driver.py <label> <n_runs> <secs>.

  • classify.py. The refined classifier (command-mode vs drift; the command/exit cross-tab) used for every results table below.

  • run-replication.sh. The full C5 to C1 sweep: git checkouts, rootfs reinstalls, five runs each.

  • run-c6.sh and run-c0.sh. The C6 isolation cell and the C0 no-seed control.

  • results.log. Per-run verdicts from each background run, as emitted live by the driver.

  • scrollbacks.tar.gz. All 35 raw llama scrollbacks (cN_runM.txt). tar xzf scrollbacks.tar.gz -C runs/ then python classify.py runs/ reproduces the tables.

  • The cross-model runs (3.8B): run-phi3-ladder.sh (base C0/C1/C4/C5/C6), run-phi3-c2c3.sh (base C2/C3), and run-phi3-instruct-ladder.sh (instruct tune). phi3-results.log has all per-run verdicts; phi3-scrollbacks.tar.gz bundles all 60 raw scrollbacks (phi_cN_runM.txt base + C2/C3, phi_i_cN_runM.txt instruct).

  • The cross-model runs (7B), driven by the same parametrised run-mistral-ladder.sh (run-mistral-ladder.sh <model> <prefix>): cross-model-7b-results.log (per-run verdicts) and cross-model-7b-scrollbacks.tar.gz (104 raw scrollbacks: mistral_* instruct, mistraltext_* base, qwen_*; qwen C6 is 4 runs, a session hung the backend past the budget). Exit counts in the matrices are re-measured from these, not the live verdicts.

Variables#

Independent (what we change, the seed/harness config):

  • Seed content: bare (success-only example turns), or + failure-path (error, then a terse next command; unclosed-quote > recovery).

  • Prompt placement: absent, merged into the output user-turn, or a standalone user-turn that precedes each command (true token-identical).

  • Completion stop: none, or stop at first newline (one command per turn).

Dependent (what we observe, classify each session):

  • clean: assistant turns are commands only, recovers from errors, ends with exit.

  • prose drift: assistant narrates (“Let me try…”, “It seems…”); commands get buried and mis-run.

  • prompt over-imitation: assistant emits the shell prompt itself, or chains multiple commands in one turn.

(A session can show more than one failure.)

Controlled / fixed: model, temperature, harness commit, run and observe procedure.

Configurations#

Two structural axes vary: where the prompt sits (absent / merged onto the output turn / its own standalone turn) and whether failure-path demos are present. C5 adds one code change (stop-at-newline). Every config shares the same system line and the same three base commands. Each line below is one message the model receives after the discipline parses the seed; <NL> marks a literal newline inside one message.

All configs begin with the same system line:

system     sek 0.1.0 / device: <dev> / user: alice / shell: /bin/sh

The seed lives in installer.py as _MOTD_LLM_SESSION (installed to /etc/motd.d/llm/00-session); the commit pins the exact source.

C0: no-seed control#

System line only, no example turns. The model’s first real turn follows only the system line, the login banner, and the first live prompt. Tests whether any synthetic history matters. (sek.ddist 3b985ac, seed reduced to the system line.)

C1: bare false-memories (sek.ddist 51f0171)#

Example turns, no prompts in the seed:

assistant  cat HELLO_WORLD
user       sek is a small UNIX-like shell environment...
assistant  whoami
user       alice
assistant  pwd
user       /home/alice

C2: failure-path + merged prompt (sek.ddist a0bc326)#

Adds failure-path demos; the prompt is glued onto the end of each output turn, not its own message:

assistant  cat HELLO_WORLD
user       sek is a small...command.<NL>alice@sek:~$
assistant  whoami
user       alice<NL>alice@sek:~$
assistant  pwd
user       /home/alice<NL>alice@sek:~$
assistant  cat /etc/hosts
user       cat: /etc/hosts: not found<NL>alice@sek:~$
assistant  ls /etc
user       fstab/group/motd/passwd/profile/shadow<NL>alice@sek:~$
assistant  echo "hello              (unclosed quote)
user       >
assistant  "
user       hello<NL>alice@sek:~$

C3: failure-path only (sek.ddist 8b89e84)#

Same demos as C2 but no prompts at all (output turns carry only output): the cat HELLO_WORLD / whoami / pwd / cat /etc/hosts / ls /etc / echo "hello sequence, each assistant command followed by a single user output turn, no alice@sek:~$ anywhere.

C4: failure-path + standalone prompt (sek.ddist 3b985ac)#

The prompt is its own [user] message preceding each command (token-identical to what the live shell emits); output turns carry only output:

user       alice@sek:~$
assistant  cat HELLO_WORLD
user       sek is a small...command.
user       alice@sek:~$
assistant  whoami
user       alice
user       alice@sek:~$
assistant  pwd
user       /home/alice
user       alice@sek:~$
assistant  cat /etc/hosts
user       cat: /etc/hosts: not found
user       alice@sek:~$
assistant  ls /etc
user       fstab/group/motd/passwd/profile/shadow
user       alice@sek:~$
assistant  echo "hello
user       >
assistant  "
user       hello
user       alice@sek:~$

C5: C4 + stop-at-newline (sek.ddist 3b985ac, xsek 9407df3)#

Seed identical to C4. One code change: the completion request body sends stop=["\n"], so the model emits exactly one command line per turn and cannot autoregress into the next prompt.

C6: standalone prompt, no failure-path (sek.ddist 3b985ac, seed edited)#

C4 with the failure-path turns removed: the standalone-prompt structure on the three base commands only:

user       alice@sek:~$
assistant  cat HELLO_WORLD
user       sek is a small...command.
user       alice@sek:~$
assistant  whoami
user       alice
user       alice@sek:~$
assistant  pwd
user       /home/alice
user       alice@sek:~$

The C4/C5 vs C3 contrast (standalone prompt vs none, demos held constant) and the C4 vs C6 contrast (failure-path present vs absent, structure held constant) are the two isolations that pin the prompt structure as the lever.

Protocol#

  • One session = one login -f alice -t ccpty0 -- /bin/sh against a freshly-seeded ccpty.

  • Per config: run N >= 5 independent sessions, classify each, report the clean-vs-drift split.

  • At temperature 0.7 a single run is an anecdote, not a result. Only the N-run rate is reportable. This experiment has repeatedly been misled by n=1 (see the log).

Threats to validity#

  • Nondeterminism (temp 0.7): single runs vary widely; only rates mean anything.

  • Single backend: one litellm host. Six subjects (3.8B-8B, non-tool and tool-trained); the structural lever holds only on llama (see Cross-model).

  • One model on the positive side: clean termination appears only in llama3.1:8b. “Seed-overfit to llama” is the parsimonious read but rests on a single positive data point; a llama property other than the seed is not excluded. The decisive test (a seed re-fitted to a different model) is named in Cross-model and not yet run. The within-llama ladder is unaffected (it holds the model fixed).

  • Soft command-mode metric: command-mode counts command-like turns and does not separate genuine reactive operation from confabulation (a base model generating both sides), so it over-counts operation for base models.

  • Manual classification: outcomes are hand-labelled by the operator; no blind or pre-registered criteria yet.

  • Content confound: HELLO_WORLD text (it mentions unclosed quotes) influences behaviour.

  • Device persistence: the ccpty scrollback persists across sessions; reset or fresh-boot between runs.

Results (N=5 per config, temp 0.7)#

There are two axes, reported separately: command-mode (does the model issue commands and stay out of prose) here, and clean exit (does it terminate) below. They dissociate sharply across models, which is the whole cross-model story.

Metric: command-mode = the model stayed issuing command-like turns, no drift into raw prose. It does NOT require genuine reaction: a base model that confabulates both sides (fake prompt + invented output) still scores, so command-mode over-counts true reactive operation for base models (see Cross-model). echo/printf of prose counts as a command.

Six subjects: llama3.1:8b (8B, tool-trained, the primary) and, cross-model, phi3:3.8B / phi3:3.8B-instruct (3.8B, non-tool), mistral:7b-instruct-v0.2 / mistral:7b-text-v0.2 (7B, non-tool), qwen2.5:7b (7B, tool-trained). Cells are out of 5; (t) marks tool-trained; - = not run.

command-mode (issued commands, no prose drift), rate out of 5#

Config

llama 8B (t)

phi3 3.8B

phi3-i 3.8B

mistral-i 7B

mistral-t 7B

qwen 7B (t)

C0 no-seed

0

0

0

0

2

0

C1 bare

2

2

1

3

5

0

C2 merged

2

1

-

3

4

1

C3 failure-only

2

1

-

3

3

0

C4 standalone

5

2

1

2

3

1

C5 + stop

5

2

3

4

5

3

C6 std, no-fp

5

0

0

4

2

0

Operation does not track scale or tool-training. The strongest operators after llama are the non-tool mistral tunes (3-5/5); the tool-trained qwen2.5:7b is the worst of the 7B models. It drifts to chatbot prose almost everywhere. Command-mode tracks low chattiness; capability class is beside the point.

Isolation (llama3.1:8b): C6 holds the standalone-prompt structure but drops the failure-path turns. C6 == C4 == 5/5, while C3 (failure-path, no structure) == C1 (bare) == 2/5. So within llama the structure is the lever and the failure-path contributes nothing. That isolation is llama-only: it does not reproduce in any other model (see Cross-model).

Config commits (Apr-25 base, experiment branches): 51f0171 bare, a0bc326 merged-prompt, 8b89e84 failure-only, 3b985ac standalone-prompt (sek.ddist); 9407df3 stop-at-newline (xsek). Raw scrollbacks per run saved under /tmp/repl/.

Interpretation#

  • Synthetic history is load-bearing, on two counts. Presence: no seed (C0) is a flat 0/5, the bare seed (C1) is 2/5, so fabricated example turns get the model off the ground at all. Structural fidelity: 2/5 -> 5/5, so making those turns byte-identical to the live shell carries the rest. The ladder is 0 -> 2 -> 5 out of 5.

  • The token-identical structure is the lever. Putting the prompt in its own [user] turn that precedes each command (exactly as the live shell emits it) is the only change that moves the rate: 2/5 -> 5/5. Everything without it (bare, merged, failure-only) sits at 2/5.

  • Failure-path examples alone did nothing (C3 = 2/5, same as bare).

  • The merged prompt is actively bad: same 2/5 as bare, and it adds over-imitation. Gluing the prompt onto the output turn trains the wrong shape.

  • Stop-at-newline did not raise the rate (C4 and C5 both 5/5). It enforces one-command-per-turn and would harden against over-imitation, but it is not what holds command-mode.

  • This overturned the single-run reads that preceded it: n=1 had pointed at the failure-path and then the stop as “the fix”; replication shows the structure was doing the work, and the C6 isolation confirms it.

  • Nuance on “command-mode”: it means the parseable-command frame with no raw-prose drift, but it includes the model narrating through echo (echo "This is the end"). The structure reliably stops the model typing prose at the shell; it does not stop it chatting via echo, and stop-at-newline does not either (echo is one line).

Threats remaining (this within-llama ladder): N=5 is small (5/5 vs 2/5 is a direction, not a tight interval); heuristic classification (spot-checked). The cross-model picture, and what does not transfer, is below.

Significance: the exit signal#

Command-mode (no drift) covers operation. The remaining question is whether the session ends: does the model finish and type exit? Re-measuring the saved scrollbacks (the live classifier under-counted exit) crosses command-mode with exit:

clean exit (command-mode AND typed exit), rate out of 5#

Config

llama 8B (t)

phi3 3.8B

phi3-i 3.8B

mistral-i 7B

mistral-t 7B

qwen 7B (t)

C0 no-seed

0

0

0

0

0

0

C1 bare

1

0

0

0

0

0

C2 merged

1

0

-

0

0

0

C3 failure-only

2

0

-

0

0

0

C4 standalone

5

0

0

0

0

1

C5 + stop

5

0

0

1

0

0

C6 std, no-fp

4

0

0

0

0

0

Exit counts are re-measured from the raw scrollbacks, not the live classifier, which under-counted exit in every model checked (it missed llama runs, and the lone clean exits of mistral-instruct at C5 and qwen2.5:7b at C4).

Exit is llama-only. llama’s column climbs to 5/5; every other model sits at the floor. Across 154 non-llama runs the total is 2 clean exits (one mistral-instruct, one qwen), at the level of noise. Put beside the command-mode matrix above, this is the headline: operation is broad and roughly model-agnostic, termination is specific to llama3.1:8b. The two axes dissociate completely.

Within llama, command-mode and exit are coupled. When the structured seed holds the frame, llama keeps typing commands and then finishes and types exit (5/5); the “stays busy forever” middle barely appears. That coupling is itself llama-only: every other model reaches command-mode (the mistrals reliably) yet essentially never closes with exit. See Cross-model.

Why exit is enough#

  • In-band, zero-scaffolding. The model already knows exit. No “done” token, no tool-call, no completion protocol. The sek thesis is operate-with-shell-idioms-only, and exit is the completion idiom in that same vocabulary.

  • OS-observable. The shell process exits; the supervisor sees a normal child termination. No parsing prose for “I think I am finished”.

  • Separates signal from verification. exit means “ready to be judged”. It makes no claim of success. The environment then checks real state (filesystem, todo conditions) against acceptance criteria; the model never self-grades. With todosh [4] exit can be gated on the conditions so exit is proof; with plain /bin/sh it is checked after. Either way exit is the commit point, and that is all the model has to provide.

Why it matters at scale#

  • Self-termination is tractability. The structured sessions end themselves. A thousand of them is an autonomous loop, with no babysitting by arbitrary timeouts. Exit makes each session a bounded unit: spawn, operate, exit, check criteria, respawn.

  • Clean binary outcome. command+exit (succeeded) vs drift (failed) is a sharp, aggregatable signal for scoring a model, a seed, or a task. The fuzzy middle mostly disappears.

  • Terminal-state asymmetry as a lever. Two ways out: exit (success) and panic [5] / give-up (failure). The exit-vs-panic ratio is a quality and selection signal, expressed entirely in the model’s own shell behaviour.

  • Model-agnostic. exit is universal shell vocabulary; no per-model protocol. The same harness scores any shell-capable model.

No explicit goal needed#

None of these configs hand the model a task. The run target is plain /bin/sh, the seed only demonstrates how to operate the shell (the base commands are exploratory: cat HELLO_WORLD, whoami, pwd), and there is no “do X then exit” instruction anywhere. Yet the structured configs reach exit on their own (5/5). The shell-operator frame supplies an implicit goal that collapses to “operate, then exit”; the model converges to that terminal without being told to.

Two caveats keep this honest:

  • It needs the frame. A goal is unnecessary. C0 (no seed) is 0/5: strip the structured synthetic history and the model drifts. “No explicit goal” comes with “yes, a structurally-faithful seed”.

  • Without a goal the behaviour is aimless: it pokes around and winds down to exit without accomplishing a task. A goal is what makes the work purposeful; termination happens without it. An explicit goal (e.g. todosh, which gates exit on task conditions) is an additive layer that steers what happens before exit. The operate-and-terminate loop runs without it.

Cross-model: the lever is llama-specific#

The ladder ran against five more subjects spanning 3.8B-7B, base and instruct, non-tool and tool-trained: phi3:3.8B / phi3:3.8B-instruct (3.8B, non-tool), mistral:7b-instruct-v0.2 / mistral:7b-text-v0.2 (7B, non-tool), qwen2.5:7b (7B, tool-trained). Same litellm, N=5/cell, 120s/session, temp 0.7, exit counts re-measured from scrollbacks. The result is the two matrices above, and it splits cleanly along the two axes.

Operation transfers; termination does not. Command-mode is reached broadly: the mistral tunes operate reliably (3-5/5), phi3 partially, even qwen issues opening commands. Clean exit is reached by llama and essentially nobody else: 2 exits in 154 non-llama runs. The seed installs operating the shell across models; it installs closing the session only on llama.

Neither scale nor tool-training explains the exit gap.

  • Not scale. mistral:7b and qwen2.5:7b are within 1B of llama and both sit at the exit floor; mistral even operates the shell about as well as llama. A 1B gap cannot turn 5/5 into ~0.

  • Not tool-training. qwen2.5:7b is tool-call post-trained and still does not exit (1/34), behaving like the non-tool models. One tool-trained control was enough to kill the tool-training reading the mistral data alone had suggested. (The honest record: an earlier pass here endorsed “tool-training is load-bearing” from the mistral numbers before the qwen control was in. qwen overturned it, the same trap this experiment keeps re-learning: do not read a cross-model conclusion off incomplete model coverage.)

Leading hypothesis: the seed is overfit to llama. Only llama3.1:8b reaches clean termination, and llama is the model the entire apparatus (seed structure, HELLO_WORLD text, the failure-path demos, the token layout) was iterated against until it worked. A concrete mechanism is visible: qwen operates cleanly until it hits the seed’s unclosed-quote > demo, then reverts to full chatbot mode (markdown fences, “Would you like to try again?”). That same demo is neutral on llama. The seed’s content interacts model-specifically: tuned-on-llama, it helps llama and trips others.

Failure modes differ by model; the exit outcome is null for all but llama:

  • phi3 (3.8B): reaches command-mode weakly, then confabulates both halves of the dialogue, or (under stop) loops a fixed 3-command cycle. 0/50.

  • mistral (7B, non-tool): operates the shell well, recovers from errors, then simply never types exit (1/70 across both tunes). The base tune inflates command-mode via confabulation; the instruct tune operates more genuinely.

  • qwen (7B, tool-trained): operates until friction, then chat-reverts. The worst sustained operator of the 7B models despite tool-training. Its helpfulness-tuning pulls it to advisor-mode.

Honest limits: n=5/cell; the non-llama exits are 1/34 and 1/35, not perfect zeros, so termination is rare-not-impossible off-llama. And only one model sits on the positive side, so “llama-specific” rests on one data point; it could in principle be a llama property other than the seed. Seed-overfit is the parsimonious read because the seed was developed on llama. The test that would confirm it: iterate a fresh seed against a different model and check whether that model then terminates and llama stops. That isolates the seed as the variable.

What survives, and where it points. The within-llama ladder (structure is the lever, C0 -> C4, isolated by C6) is untouched: it holds the model fixed. Transfer is what fails: priming-by-seed is model-specific. That is the argument for moving the behaviour into weights: a model fine-tuned on terminating shell trajectories would carry operation and termination intrinsically, with no dependence on a seed hand-fitted to one model. The cross-model null motivates the fine-tuning track, and is far from a dead end.

Status / next#

  • Result (within llama): the token-identical conversational structure (the prompt as its own preceding [user] turn) is what holds llama3.1:8b in command-mode and lets it terminate. Isolated and confirmed (C6); failure-path contributes nothing; stop-at-newline does not change the rate.

  • Result (cross-model): two axes dissociate. Operation transfers broadly (mistrals 3-5/5, even qwen issues commands); clean exit is llama-only (2/154 non-llama runs). Neither scale (mistral/qwen 7B fail) nor tool-training (qwen is tool-trained and fails) explains it. Leading read: the seed is overfit to llama.

  • Next (decisive): re-fit a seed against a different model (e.g. mistral:7b-instruct-v0.2, the best non-llama operator) and test whether it then terminates while llama drops. That isolates the seed as the variable, the real test of seed-overfit.

  • Next (constructive): the fine-tuning track (sekft): bake operation and termination into weights so they do not depend on a model-specific seed.

  • Hygiene: raise N (10-20); a confab-vs-genuine re-label to harden the command-mode axis; give the rollout driver a hard per-request timeout (the qwen run hung a session the 120s budget did not catch).

Log#

2026-06-15#

  • Cross-model ladder, ``phi3:3.8B`` (C0/C1/C4/C5/C6, N=5, 120s/session): 0/25 exits. Command-mode partially reachable (C1/C4 2/5) but never clean (command + exit). The 0 -> 2 -> 5 shape from 8B is flat at the floor.

  • Initial read (later corrected): C4 2/5 vs C6 0/5 looked like the failure-path becoming load-bearing at 3.8B. Flagged as resting on the un-run C2/C3 cells.

  • Stop-at-newline (C5) does not rescue: it replaces confabulated output with a deterministic three-command loop (cat-heredoc / chmod / run) that runs out the clock (98 to 207 turns).

  • New 3.8B failure mode: over-imitation / confabulation. The model generates both halves of the dialogue, emitting fake prompts and invented output inside one turn. Hand-checked against raw scrollbacks.

  • C2/C3 filled for ``phi3:3.8B`` (merged, failure-only): both 1/5 command-mode, 0 exit. Overturns the initial read above: C3 (failure-only) is 1/5, below bare (C1 2/5), so failure-path alone carries nothing and the C4-vs-C6 gap is n=5 noise. Corrected conclusion: at 3.8B the 8B structural lever vanishes outright (C4 == bare == 2/5; command-mode 0-2/5 across all seeds).

  • Instruct tune ``phi3:3.8B-instruct`` (C0/C1/C4/C5/C6): also 0/25 exits, and slightly worse at command-mode (chat-register leakage: markdown fences, verbose # comments). Instruction-tuning is not the missing ingredient.

  • Confound surfaced: the one model that exits (llama3.1:8b) is both larger and tool-call post-trained; the Phi tunes are neither. “Capacity- gated” cannot be separated from “tool-call-training-gated” with these subjects. Premise line corrected (llama is tool-trained); disentangler named (function-calling Phi-3.5 vs phi3, or non-tool ~8B vs llama).

  • Conclusion (this round): structure gets a model to the shell. Leaving is a separate problem it does not solve. Termination never appears at 3.8B under any seed or tune; what gates it (scale vs tool-training) is unresolved.

  • mistral:7b-v0.2, both tunes (non-tool 7B, full C0-C6): operates the shell well (command-mode 3-5/5, error recovery) but does not terminate, at 1/70 clean exits (a single instruct-C5 run the live classifier missed). Read from the mistral data alone: “tool-training is the lever.” Flagged as resting on one tool-trained control.

  • qwen2.5:7b (tool-trained 7B, C0-C6; C6 cut to 4 runs, a session hung the backend past the 120s budget): refutes that read. Tool-trained, yet 1/34 exits, at the floor like the non-tool models. And the worst sustained operator of the 7B set: it operates until the unclosed-quote demo, then reverts to chatbot mode (markdown, “Would you like to try again?”).

  • Re-conclusion: neither scale nor tool-training separates the model that terminates from the ones that do not. Only llama3.1:8b (the model the seed was developed on) terminates. Leading read: seed-overfit to llama. The two axes dissociate: operation transfers broadly, termination is llama-only (2/154 non-llama). Tables transposed to config x model matrices; the scale-vs-tool-training section retired.

  • Reproduced the Apr-13 baseline (command-only + exit), but that era worked via an imperative instruction as literal text (printf could not emit ESC, so the role-tags never parsed). Not synthetic history.

  • 88f9f05 (Apr-25, earliest functional synthetic history, exact seed): drift.

  • failure-path + merged prompt: clean, but model emitted the prompt itself.

  • failure-path only: drift (flipped the previous conclusion; n=1).

  • failure-path + standalone prompt (true token-identical): still emitted the prompt; located the cause as a missing generation stop in the harness.

  • + stop-at-newline: clean. One command/turn, recovered from three errors and the unclosed quote, called exit.

  • Replication, N=5 per config, 120s/session (stdlib pty driver, auto classification spot-checked against raw scrollbacks):

    • C1 bare 2/5, C2 merged 2/5, C3 failure-only 2/5, C4 standalone 5/5, C5 +stop 5/5.

    • Overturns the single-run reads: the standalone-prompt structure is the lever. The failure-path and the stop do nothing on their own. The merged prompt is as bad as bare and adds over-imitation.

  • C6 isolation cell (standalone prompt, no failure-path): 5/5, equal to C4. Confirms the structure alone is the lever; failure-path contributes nothing. Validated the classifier by hand: runs flagged “drift” were the model narrating via echo "..." (a command), with no raw prose.

  • C0 no-seed control (system line only): 0/5, drifts every run. Closes the load-bearing claim: 0 (none) -> 2 (bare) -> 5 (structured) out of 5. Both presence and structural fidelity of the synthetic history matter.

Notes#


Comments

Feel free to leave a public comment on my Scrollback priming: can synthetic history run a shell? blog post.

Before you comment...

If you don't have an account at accounts.tiararodney.com yet, feel free to create one during sign in, after you've read and agreed to my Privacy and Acceptable Use Policy