Auditable memory for a frozen transformer — a latent crossbar between models, and a resident daemon that rewinds at the metal.

A 12B's KV cache made O(1) in context — 8k↔16k flat within ~50 MiB, the needle surviving at every depth. A 12B steered by direct KV-cache transplant, no tokens — 15/15, receipted. A resident 12B daemon that holds disciplined silence and shears its cache by an O(1) byte-exact rewind (48 layers, diffs = 0).

A memory-and-agency architecture that attaches to a frozen pretrained transformer and preserves its outputs — bit-exact when disabled. Proof-of-mechanism on one dev host: the 0.6B carries the memory ladder, the 12B (Gemma-3-12B QAT 4-bit) carries the XBAR and KAIROS results. Every number reproduces from a command, and the misses stay published (the composed 32k retrieval MISSed; ledger 01-R9).

Run the composed pipeline

git clone https://github.com/nihilistau/Position_Is_Arithmetic.git && cd Position_Is_Arithmetic/papers/01-two-ring-memory/repro
./run_r9_32k_needle.ps1 -Model Qwen3-0.6B-f16.gguf -Drive F:\ -Corpus wiki.test.raw

You'll watch the cold KV path serve a 32k context off a drive, in ~1.8 GB of RAM. Honest scope: at the full composed 32k budget (64× selection) the retrieval itself MISSed (ledger 01-R9, kept on purpose) — the 512-position retrieval is the proven claim, and the off-drive machinery (poison-gated, zero silent fallbacks over 16.3 h) is real either way. Correctness reproduces on any NVMe; the latency figure needs Optane.

Three walls, three mechanisms

Memory wall

910×

resident KV cache shrunk (8.3 MB vs 7.5 GB at 32k) via a two-ring offload to byte-addressable storage.

Intelligence wall

+0.69%

perplexity at 8× sparsification — four pinned attention-sink tokens; 2× and 4× go negative.

Compute wall

O(N)

a ±1 projection router + quickselect: directional recall at 32 bytes/token, linear selection.

The receipts

Claim	Number	Caveat
Quality at 8× sparsification	+0.69% PPL (2× −0.71, 4× −0.92)	0.6B, 2k, one corpus
Needle retrieval, no recency bias	HIT at depth 10 / 50 / 90	one model, one needle type
KV served off physical Optane	HIT off NVMe, poison-gated	512 proven; composed 32k MISSed at the 64× budget (01-R9)
Random-read latency	7.57 µs / read	Optane-specific
KV-RAM footprint	910× cache · 1.8 GB live	net ~8×, router-index-dominated
Bit-exact when disabled	argmax-identical to the stock model	the invariant under everything
12B GPU decode + quality, RTX 2060 12GB	26.1 tok/s at wikitext PPL 5.12	gated + citable (06-R10); llama.cpp 31.29 tok/s at PPL 192–506 — broken artifacts (06-R8)
gemma-4 ecosystem finding	true PPL 4.68 vs GGUFs 192–506	engine-independent (06-R8); fix tutorial in-repo
Latent crossbar probe (XBAR P1)	15/15 incorporation, 15/15 selectivity, 3.69 orders max rank pull	12B steered by direct KV transplant, no tokens; gold-instrument coherence (X-R1)
O(1) KV decoupled from context (12B)	learned 512×32 LSH router +0.47% PPL @8×; VRAM 8k↔16k flat ~50 MiB; needle survives 10/50/90%	oracle −0.08%, frozen +4.17%; frozen-router control MISSES; KV term is O(1) (X-R2)
Resident 12B daemon: silence + O(1) rewind	24-tick crucible perfect (0 false / 0 missed / 0 drift); rewind byte-identical 48 layers; metal 0.0073 vs grow 0.924 s/action	scripted tape; 0.6B control collapses; ≥24 h soak IN-FLIGHT, no verdict yet (KAIROS-01/02)
The Memo curator drives the crossbar autonomously	inert when off (PPL 4.6665 bit-identical); 256-bit LSH / integer-Hamming address; promote matched +0.000% / discard corrupted +40106%	float→discrete course-correction (sign-binarize collapses at r=32, ship r=256); 2-episode registry; Ring-2 verbatim recall (X-C2)
O(1) bit-exact rewind of latent memory	replay load-bearing (zeroed reads back all-zero); rewind resets prefix byte-identical (layer-diffs=0), 12B + E2B + SWA-ring wrap	the §4-trap guarantee made mechanical; O(1) in byte-count (latency slope = KAIROS-02/03) (X-222)
Parameter-free Ring-3 consolidation (VSA/HRR, zero training)	recall@1=1.0 to N=32; loss a step function (hit +0.000% / miss +8.04% caught by 2% gate); idle GC 349.8 MB → 16.3 KB index	retrieve-and-verify (P2.b top-5 honored); VSA retrieve host-numpy, Z_q/NTT port a follow-on; Path B budget-gated (X-R3VSA)
The organism breathes: real audio → episodic memory	audio-conditioned KV → canonical Ring-2 episode; signature separates (self 211/256, margin +79); round-trip clean	step 1; the +1989% deflection is foreign-by-design (cross-context reject signal), not an audio-recall quality claim (X-ORG)

The system: four tiers and a crossbar

The two-ring design grew into a four-tier hierarchy with an inter-model lane on top — each component carries its status:

Component	What it is	Status
Ring 1	working KV window — the live turn	PROVEN (the stock model path)
Ring 2	verbatim episodic store on byte-addressable storage — the "hippocampus"	PROVEN (on Optane, poison-gated; bounded — the 64×-budget 32k MISS is why)
Ring 2′ (shadow)	transient staging: the curator proposes, a coherence gate promotes or rewinds — the audit mechanism	WIRED (C1-lite curator, exercised on real recall)
Ring 3	adapter-compressed consolidated store — the "neocortex"	DESIGN (under the irreversible-aware G-R3-LOSS gate)
XBAR	the Auditable Latent Crossbar: Exec + Memo share memory through latent state, no tokens — every write receipted, gated, rewindable; the cache itself O(1) in context	PROVEN (probe P1 X-R1; O(1) KV + learned router X-R2)
KAIROS	the time / agency axis: a resident 12B daemon — disciplined silence (NO_OP) + coherent action, with an O(1) byte-exact cache rewind to cold-evict idle thoughts	PROVEN (crucible KAIROS-01; metal rewind KAIROS-02; ≥24 h soak IN-FLIGHT)
NIGHTSHIFT	offline idle-time consolidation: Ring 2 → adapter → Ring 2′ → gate → Ring 3 — a synthetic subconscious whose dreams are auditable	DESIGN

Full map with the architecture diagram: README — The system.

Honest scope

Proof-of-mechanism on one dev host (RTX 2060, 12 GB) — not a scaling study, not multi-model, not independently reproduced yet. The 0.6B (Qwen3-0.6B) carries the two-ring memory ladder; the 12B (Gemma-3-12B, QAT 4-bit) carries the XBAR and KAIROS results. The CPU decode is ~1.34× behind a tuned llama.cpp at the same quantization: the value here is the memory-and-agency envelope, not raw throughput. The ≥24 h KAIROS endurance soak is in-flight, not passed — we do not call a verdict from a mid-run log. We keep the unflattering numbers in the papers on purpose — a result with its caveats attached is one you can trust.

The systems paper →

Every mechanism as idea → receipt → payoff → implementation. The receipts, in full.

The algebraic companion →

The elliptic-curve framework that motivated the design. Not required to run or validate anything. A door for the curious.

Get involved