The model already found titles that score higher. Try one.
Recent agent benchmarks seem to be converging on an interesting systems problem.
METR’s “time horizon” framing is effectively measuring how long an agent can maintain coherent execution before state drift, tool misuse, context corruption, or orchestration failures accumulate.
SWE-Bench Pro and newer long-horizon evaluations also appear to reward orchestration quality increasingly heavily:
- state management,
- execution consistency,
- tool coordination,
- recovery/retry behavior,
- context handling across long trajectories.
One line from a SWE-Bench analysis summarized it well:
«“The harness determines how close you get to the ceiling.”»
This is interesting because it shifts the bottleneck away from the model itself and toward runtime semantics.
Over the past year we’ve been building nano-vm around a fairly simple assumption:
[
\text{Agent Capability}
\neq
\text{Model Capability}
]
More like:
[
Capability =
f(
Model,
Runtime,
State,
Policies,
Tools,
Memory
)
]
The original goal was an LLM runtime, but under production constraints it gradually evolved into something closer to a deterministic execution substrate.
The current 0.7.0 / 0.3.0 architecture is centered around:
- FSM-based execution,
- replayable transitions,
- explicit hydration/dehydration,
- externalized state,
- capability references instead of raw plaintext state,
- projection layers for LLM/TRACE/TOOL isolation,
- deterministic AST execution instead of eval(),
- provenance-oriented execution envelopes.
Interestingly, the same runtime model is now being integrated into fairly ordinary business workflows:
- Telegram Mini Apps,
- PDF/report pipelines,
- payment-oriented flows,
- multilingual UI synchronization,
- governed tool execution.
At that point the distinction between:
- “agent runtime”,
- “workflow engine”,
- “business process orchestration”
starts to blur.
The architectural direction increasingly looks less like “better prompting” and more like:
- event sourcing,
- capability systems,
- replayable state machines,
- deterministic orchestration,
- execution provenance.
It feels like long-horizon agents are forcing AI systems back toward classic systems engineering problems:
state, transitions, replayability, isolation, and governance.
LLMs are improving rapidly.
But the systems around them are starting to matter more than most people expected.
ForesynWanna keep in touch?
Built this solo over a weekend. Soft-launching before the HN post on Monday. If you scored a draft and the prediction either nailed it or whiffed, I want to know.