Long-context KV memory you can actually run.

A needle retrieved from 32,000 tokens of context, with the cold KV cache served off an NVMe drive, in ~1.8 GB of RAM. 8× KV sparsification at +0.69% perplexity.

A memory architecture that attaches to a frozen pretrained transformer and preserves its outputs. Proof-of-mechanism on one small model — every number reproduces from a command.

Run the headline

git clone https://github.com/nihilistau/Position_Is_Arithmetic.git && cd Position_Is_Arithmetic/papers/01-two-ring-memory/repro
./run_r9_32k_needle.ps1 -Model Qwen3-0.6B-f16.gguf -Drive F:\ -Corpus wiki.test.raw

You'll watch an out-of-distribution secret surface from a context that never fully lived in RAM. Correctness reproduces on any NVMe; the latency figure needs Optane.

Three walls, three mechanisms

Memory wall

910×

resident KV cache shrunk (8.3 MB vs 7.5 GB at 32k) via a two-ring offload to byte-addressable storage.

Intelligence wall

+0.69%

perplexity at 8× sparsification — four pinned attention-sink tokens; 2× and 4× go negative.

Compute wall

O(N)

a ±1 projection router + quickselect: directional recall at 32 bytes/token, linear selection.

The receipts

ClaimNumberCaveat
Quality at 8× sparsification+0.69% PPL (2× −0.71, 4× −0.92)0.6B, 2k, one corpus
Needle retrieval, no recency biasHIT at depth 10 / 50 / 90one model, one needle type
KV served off physical OptaneHIT off NVMe, poison-gated512 proven; 32k pending
Random-read latency7.57 µs / readOptane-specific
KV-RAM footprint910× cache · 1.8 GB livenet ~8×, router-index-dominated
Bit-exact when disabledargmax-identical to the stock modelthe invariant under everything

Honest scope

Proof-of-mechanism on one 0.6B model, one host — not a scaling study, not independently reproduced yet. The CPU decode is ~1.34× behind a tuned llama.cpp at the same quantization: the value here is the memory envelope, not throughput. We keep the unflattering numbers in the papers on purpose — a result with its caveats attached is one you can trust.

Read more

The systems paper →

Every mechanism as idea → receipt → payoff → implementation. The receipts, in full.

The algebraic companion →

The elliptic-curve framework that motivated the design. Not required to run or validate anything. A door for the curious.

Get involved

The biggest open question is scale: does 8×-at-<2% hold past 0.6B? Other one-person-can't-do-it asks: independent repros on your own drive (the Linux path especially), and compressing the ±1 router index (a sign-packed popcount form should cut its ~950 MB by ~32×). See CONTRIBUTING.