The model already found titles that score higher. Try one.
Over the last year, benchmarks like METR, SWE-Bench Pro, Terminal-Bench and newer long-horizon agent evaluations have quietly shifted the conversation around AI systems.
The interesting part is that the bottleneck is increasingly not the model itself.
METR’s latest work focuses on “task-completion time horizons” — effectively measuring how long an agent can sustain coherent autonomous execution before failing.
At the same time, SWE-Bench Pro explicitly moved toward “long-horizon tasks” involving multi-file coordination, state management, and execution consistency across extended trajectories.
And many independent analyses are converging on the same conclusion:
«“The harness determines how close you get to [the model ceiling].”»
or:
«“The next frontier is not single-model capability — it is orchestration.”»
This is exactly the direction we’ve been building toward with nano-vm.
nano-vm v0.7.0 and nano-vm-mcp v0.3.0 are evolving into a deterministic execution substrate where:
- FSM transitions are the source of truth
- execution is replayable
- state is externalized from the model
- projections isolate LLM/TRACE/TOOL views
- capability references replace raw plaintext state
- hydration/dehydration enables resumable execution
- governance and provenance are runtime primitives
Importantly, we no longer see this as “just an LLM runtime”.
The same execution model is now being integrated into real production business workflows:
- payments
- PDF/report pipelines
- Telegram Mini Apps
- multilingual UI/state synchronization
- governed tool execution
- concurrent stateful processes
The architecture direction is becoming increasingly clear:
[
Agent Capability
\neq
Model Capability
]
More realistically:
[
Capability =
f(
Model,
Runtime,
State,
Policies,
Tools,
Memory
)
]
or even simpler:
[
LLM
+
Runtime
+
Policies
+
State
]
The industry seems to be rediscovering something systems engineers already know:
state management, orchestration, replayability, and execution semantics matter more as systems become long-horizon.
LLMs are improving fast.
But runtime architecture is becoming the real differentiator.
ForesynWanna keep in touch?
Built this solo over a weekend. Soft-launching before the HN post on Monday. If you scored a draft and the prediction either nailed it or whiffed, I want to know.