The model already found titles that score higher. Try one.
Recent agent benchmarks are quietly converging on a classic systems problem.
METR’s “time horizon” metric, SWE-Bench Pro, and other long-running evaluations increasingly measure not just raw model intelligence, but how long an agent can stay coherent before state drift, tool misuse, context corruption, or orchestration failures take over. The harness, as one SWE-Bench analysis put it, determines how close you get to the ceiling.311248
This shifts the real bottleneck away from “bigger models” and toward runtime semantics: state management, execution consistency, tool coordination, recovery logic, and context handling across long trajectories.
For the past year we’ve been building nano-vm around a simple realization:
Agent capability ≠ model capability.
More accurately:
Capability = f(Model, Runtime, State, Policies, Tools, Memory)
What started as an LLM runtime evolved, under real production pressure, into something closer to a deterministic execution substrate. The current architecture (0.7.0 / 0.3.0) centers on:
FSM-based execution with replayable transitions
Explicit hydration/dehydration of state
Externalized state and capability references (instead of raw text blobs)
Projection layers isolating LLM, trace, and tool concerns
Deterministic AST execution instead of eval()
Provenance-oriented execution envelopes
The surprising part? The same runtime now powers fairly ordinary business workflows: Telegram Mini Apps, PDF/report pipelines, payment flows, multilingual UI sync, and governed tool execution.
At that point the lines blur between “agent runtime,” “workflow engine,” and “business process orchestration.”
The architectural direction feels less like prompt engineering theater and more like a return to proven systems ideas: event sourcing, capability-based security, replayable state machines, deterministic orchestration, and strong execution provenance.
Long-horizon agents are forcing AI systems back toward problems that systems engineers have been solving for decades: reliable state, clean transitions, isolation, recoverability, and governance.
LLMs keep getting better at an impressive rate. But the systems wrapped around them are starting to matter far more than most in the field expected—even a year ago.
The next leap in agent performance probably won’t come from another scaling curve. It will come from treating the agent as a system first, and a language model second. State machines (and their modern descendants) are having a quiet renaissance for a reason.
ForesynWanna keep in touch?
Built this solo over a weekend. Soft-launching before the HN post on Monday. If you scored a draft and the prediction either nailed it or whiffed, I want to know.