Why Warrant
Every vendor shipping agentic ops today — Splunk included — says the same sentence: “human-led, with the analyst in control.” It sounds responsible. It also quietly dodges the only question that decides whether any of this scales: when is the human allowed to let go?
If the human must approve every action forever, you haven’t automated operations — you’ve added a chatbot to your ticket queue. So in practice, teams do let go. But ask any of them what the graduation criteria were, and you’ll get the honest answer: the agent seemed fine for a few weeks.
That intuition has three failure modes, and they’re all silent:
| failure mode | what happens |
|---|---|
| the lucky streak | Five easy incidents in a row look identical to competence. Nobody computed whether five is enough. (It isn’t.) |
| confident nonsense | An agent that announces 95% confidence and is right half the time feels trustworthy right up until it matters. |
| the silent swap | The model behind the agent gets updated overnight. Every assumption about its behavior is now stale — and nothing in your ops stack noticed. |
Warrant doesn’t replace evals, guardrails, or approval flows — it’s the layer they all assumed someone else was building.
| layer | answers | revocable? | survives a model update? |
|---|---|---|---|
| evals & benchmarks | “How capable is this model, offline, once?” | n/a | no — the score silently goes stale |
| guardrails | “What must this agent never do?” | static rules | the rules survive; trust was never measured |
| human-in-the-loop | “Can a human babysit every action?” | n/a | yes, because it never trusted anything |
| SOAR automation | “Which runbook fires on this trigger?” | manual config | config survives; no notion of a brain at all |
| WARRANT | “How much autonomy has this exact brain earned, on this exact action — right now?” | yes — first violated prediction | licenses void on fingerprint change |
Before acting, the agent must state what the world will look like if it’s right — a forecast band the metric must return to. Not an LLM grading an LLM: a prediction that can fail. If it fails, the agent was wrong, no matter how good its reasoning sounded.
A Wilson lower bound over enough graded outcomes, plus Brier-scored calibration. One lucky pass can’t license an action, and an agent that’s confidently wrong fails even with a passing hit-rate.
Licenses are per action class, not per agent — being good at restarting services says nothing about rolling back deploys. One violated prediction in production suspends the license on the spot.
Licenses are pinned to a fingerprint: model ID + prompt version. Model updated overnight? Every license reverts to PROVISIONAL — before the new brain acts once, not after its first incident.
Ninety seconds. Four acts. One honest machine.