Semantic Execution Kernel#
A virtual, POSIX-adjacent micro-kernel whose most unusual user is a language model. It logs in on a chat-completion pty and drives a real shell: no tool-calling, no function schemas, no agent framework. The model operates the system in the one vocabulary it already has, text.
A test-bed, not a product. The claims it exists to sharpen:
Tool-calling is a harness artifact, not a capability. A model needs no function-call schema to operate a system; the schema is scaffolding. (shown: an 8B model, never tool-trained, drives a real shell from synthetic history alone.)
The model-to-system interface is a terminal, not an API. A clone device and a line discipline, not a tool protocol.
A goal is an attractor, not an input. Drop a model at a prompt and it converges to
exit; the frame supplies the goal. (shown.)A shell session is a chain of thought made of actions, grounded in real feedback rather than confabulated words, and self-documenting: the model says things by doing them (
echo "this is the end").
The kernel is the apparatus. . Roughly 1,000 tests across the packages below.
Where this goes#
Forward-looking, not yet shown. The bets the test-bed is built to settle.
Two-tier execution. Deep planning collapses to filling a todo. Split the work: an expensive planner (occasional, cacheable, sometimes human) decomposes a task into a todo; a cheap reactive executor (a small model) operates the shell until the todo is satisfied, then exits. The todo is the typed, inspectable contract between them. Today’s agents pay frontier prices for the whole loop; here you pay them only for the planning, which is the small part.
Operating a system is one-shot reactive, not deep reasoning. The reactive
loop needs no specialised model. Escalate to a bigger one only when execution
hits the unexpected (a panic to re-plan). The common case is cheap and
fast.
Benchmarks measure the harness, not the model. An agent score is a model-and-harness result reported as if it were a model result; swap the interface and the number moves more than swapping the weights, and almost nobody controls for it. A clone device plus a line discipline is a different harness class: the boundary is a terminal, not an API. A real benchmark would hold the harness fixed and vary the model (isolate capability), or hold the model fixed and vary the harness (measure the sensitivity everyone hides). The harness, not the model, may be the defensible asset.
Modules#
Package |
What it is |
Source |
|---|---|---|
|
The micro-kernel: VFS, device table, processes and syscalls,
scheduler, kernel log. The |
|
|
Userland: the coreutils, |
|
|
Extensions: the |
|
|
Default distribution: installer, manifest, bootable rootfs. |
|
|
The specification the implementation follows (spec-first). |
Tracking#
Repository: bitbucket.org/byteb4rb1e/cel, submodules per package above.
Issues: a MIME-TODO tracker in the root repo, synced to Bugzilla at bugs.code.tiararodney.com.
Workflow: spec-first, issue-driven gitflow; specifications published at specs.code.tiararodney.com.
Experiments#
From seed to weights: fine-tuning a shell operator (Monday, June 15, 2026)
Scrollback priming: can synthetic history run a shell? (Sunday, June 14, 2026)
Journal#
Working notes for the Semantic Execution Kernel: dead ends, regressions, and what each one taught me. The polished, replicated findings graduate to the experiments. Newest first.
Regressions… (Sunday, June 14, 2026)
New Impulses: The proof is in the pudding (Thursday, June 11, 2026)
Notes#
Comments
Feel free to leave a public comment on my Semantic Execution Kernel blog post.
Before you comment...
If you don't have an account at accounts.tiararodney.com yet, feel free to create one during sign in, after you've read and agreed to my Privacy and Acceptable Use Policy