WARRANT · Licensing authority for AI agents

Why Warrant

The whole industry built agents that act.
Nobody built the DMV.

Every vendor shipping agentic ops today — Splunk included — says the same sentence: “human-led, with the analyst in control.” It sounds responsible. It also quietly dodges the only question that decides whether any of this scales: when is the human allowed to let go?

§1 · THE GAP

“Human-in-the-loop” is a posture, not a policy.

If the human must approve every action forever, you haven’t automated operations — you’ve added a chatbot to your ticket queue. So in practice, teams do let go. But ask any of them what the graduation criteria were, and you’ll get the honest answer: the agent seemed fine for a few weeks.

That intuition has three failure modes, and they’re all silent:

failure modewhat happens
the lucky streak Five easy incidents in a row look identical to competence. Nobody computed whether five is enough. (It isn’t.)
confident nonsense An agent that announces 95% confidence and is right half the time feels trustworthy right up until it matters.
the silent swap The model behind the agent gets updated overnight. Every assumption about its behavior is now stale — and nothing in your ops stack noticed.
§2 · THE LANDSCAPE

Everything else answers a different question.

Warrant doesn’t replace evals, guardrails, or approval flows — it’s the layer they all assumed someone else was building.

layeranswersrevocable?survives a model update?
evals & benchmarks “How capable is this model, offline, once?” n/a no — the score silently goes stale
guardrails “What must this agent never do?” static rules the rules survive; trust was never measured
human-in-the-loop “Can a human babysit every action?” n/a yes, because it never trusted anything
SOAR automation “Which runbook fires on this trigger?” manual config config survives; no notion of a brain at all
WARRANT “How much autonomy has this exact brain earned, on this exact action — right now?” yes — first violated prediction licenses void on fingerprint change
§3 · THE PRINCIPLES

Four rules, borrowed from a century of licensing humans.

I · Falsifiability

Reality is the judge.

Before acting, the agent must state what the world will look like if it’s right — a forecast band the metric must return to. Not an LLM grading an LLM: a prediction that can fail. If it fails, the agent was wrong, no matter how good its reasoning sounded.

II · Evidence over vibes

Trust is a statistic, not a feeling.

A Wilson lower bound over enough graded outcomes, plus Brier-scored calibration. One lucky pass can’t license an action, and an agent that’s confidently wrong fails even with a passing hit-rate.

III · Autonomy is a privilege

Granted narrowly. Revoked instantly.

Licenses are per action class, not per agent — being good at restarting services says nothing about rolling back deploys. One violated prediction in production suspends the license on the spot.

IV · Brains change

A track record belongs to the brain that earned it.

Licenses are pinned to a fingerprint: model ID + prompt version. Model updated overnight? Every license reverts to PROVISIONAL — before the new brain acts once, not after its first incident.

§4 · QUESTIONS A SKEPTIC SHOULD ASK

Asked and answered.

Isn’t this just another eval framework?
No. Evals measure capability, offline, once — Warrant governs authority, in production, continuously. An eval score can’t be revoked when the agent fails on Tuesday, and it doesn’t notice when the model behind the agent changes. A license does both. Evals tell you which agent to hire; Warrant decides what it’s allowed to touch.
What exactly is a “falsifiable prediction”?
Before acting, the agent commits to a measurable claim: “within 90 seconds of my fix, error_rate returns inside the control limit learned from healthy data.” After acting, Warrant reads the live metric back. Inside the band → graded correct. Outside → the agent was wrong — by its own stated standard, with no room to argue. The grading needs no LLM, no judgment call, no benefit of the doubt.
Real incidents are rare. How do you ever get enough samples?
That’s the cold-start paradox — an agent can’t earn trust without acting, and shouldn’t act without trust. The Proving Ground breaks it: manufactured incidents at varied severity and noise, run as accelerated exams. Fifteen graded outcomes in seconds, before the agent ever touches production. The Wilson bound is then built on a real sample, not five anecdotes.
What happens at the moment the agent is wrong?
The verify step catches the violated prediction, the agent rolls back its own action, escalates to a human with full context, and the license for that action class is SUSPENDED — it’s back to human-approved mode until it re-certifies. The demo shows this exact sequence live; it’s Act II.
Does my agent have to be built on Warrant to use it?
No — that’s the point of shipping Warrant as an MCP server. Any agent that can call a tool calls warrant_request_action before acting and warrant_report_outcome after. A SOAR playbook, a Splunk triage agent, a Claude agent — anything. The gate is infrastructure, not a framework you adopt.
Why Wilson lower bound and Brier score, specifically?
Wilson: a raw hit-rate of 3/3 looks like 100%, but the Wilson lower bound says “with this little evidence, you can only claim ~44%” — so a lucky streak can’t cross the threshold. Brier: it scores whether stated confidence matched outcomes, so an agent that says “95% sure” while being right half the time fails calibration even if its hit-rate passes. Together: enough evidence, honestly reported.
Can’t an agent just lie about its outcomes to farm a license?
Three defences. Trust-but-verify: over MCP, the agent passes a metric_url and Warrant fetches the metric itself and grades the outcome — the caller never touches the verdict. If an outcome arrives as bare word, it’s permanently flagged self-reported in the ledger, visible to any auditor. Identity: licenses are pinned to the calling agent’s fingerprint — a different brain asking for the same action is refused until it earns its own record. Tamper-evidence: the ledger is a sha256 hash chain (warrant_verify_ledger); editing history breaks the chain.
Is the demo real or staged?
The control loop, proving ground, certification math, MCP round-trips and fault-injection sandbox are all real, running code — clone it and run the same four acts yourself in two terminals. The hosted replay on this site is a capture of a representative run, so judges don’t depend on a localhost.

See the license get revoked,
live, by the agent’s own prediction.

Ninety seconds. Four acts. One honest machine.