Show Me the Miss Rate

AI-generated visual. Pixels, not photons.

If you can’t audit it, it’s not evidence. It’s a story.

Monday, 8:07 a.m. Portfolio gate. Thirty minutes, forty slides, five opinions.

One team calls their animal model “validated.”

Another calls their digital biomarker “compelling.”

A third points to an AI analysis that “confirms the pattern.”

Then someone asks the question that actually decides things: how often has any of this been wrong when we made this exact decision before?

Not “has it been useful.” Not “is it published.” Not “does it look like the last time.”

How often did it miss?

The pause that follows is not incompetence. It is a missing instrument. We track spend to the penny and timelines to the week. We almost never keep a running score of predictivity tied to decisions. And then we act surprised when attrition stays stubborn.

The thing we never measure

In most organizations, evidence gains authority the way people do. Through repetition, senior sponsorship, and looking impressive in a meeting.

I have watched “validated” quietly come to mean “familiar,” as if familiarity were a performance metric. That is not a character flaw. It is what happens when you do not price error. If an input is going to influence a go/no-go, dose selection, patient selection, or a safety monitoring plan, it should be able to answer three questions without a lot of ceremony:

What are you predicting, in plain language?
Under what boundary conditions does that claim hold?
What is your historical miss rate in comparable use?

If the only thing it can offer is confidence, it is not evidence. It is narrative with metrics attached.

The Predictivity Ledger

The Predictivity Ledger is a simple, slightly uncomfortable discipline: every decision-relevant input gets written down as an auditable claim. The mental model is five lines. If it takes ten, you do not have a clear decision.

Decision: the decision this evidence is allowed to influence.
Prediction: what it says will happen, and by when.
Boundary conditions: where it breaks.
Track record: false positives and false negatives in similar contexts, including your own internal misses.
Action when wrong: what gets paused, rerun, or escalated, and who owns the call.

In drug development the scorecard arrives late, so you build this in two passes: a retrospective after-action review on decisions where the outcome is already known, and a prospective entry for today’s decisions with an explicit revisit to score it when the outcome finally shows up. Retrospectives will surface plenty of plausible gaps; the ledger’s job is to rank them by decision impact and repeat-miss patterns, not to pretend we’ve proven causality overnight.

If you cannot fill those lines, the input can still be interesting. It just does not get to steer money or patients. Here is what a worked entry looks like when it is forced to be concrete.

Decision: proceed to first-in-human dose escalation beyond Cohort 3.
Prediction: “No clinically meaningful QT prolongation at exposures up to 3x projected Cmax.”
Boundary conditions: not valid with CYP3A inhibitors; not transferable across ECG vendor changes without recalibration.
Track record: in 9 comparable programs, overcalled risk twice; missed risk once in a specific chemical series.
Action when wrong: pause escalation; run confirmatory electrophysiology panel within 7 days; update monitoring plan before resuming.

Now the discussion changes. People can disagree, but they cannot stay vague. A second benefit shows up immediately. Once you can score evidence, you can finally subtract evidence.

Making subtraction defensible

Brian’s case for subtractive innovation is operationally overdue: stop adding steps to an already swollen process and start replacing steps that do not earn their keep. He points to real entry points, like replacing one of the two animal species in a traditional safety assessment package, or questioning the automatic march through Phase I healthy volunteers.

Where the Predictivity Ledger helps is in making that instinct easier to execute and easier to defend. Subtraction can sound like austerity unless you can show, in a decision-specific way, what you are replacing, what prediction you still have covered, and what your miss rate looks like.

So the operational move is simple: treat “replacement” like a contract.

A replacement proposal should put three ledger entries side by side for the same decision: the legacy step, the proposed replacement package, and the residual uncertainty plan. Each entry states what it predicts, where it breaks, its historical miss profile, and what you will do when reality disagrees. This is risk mitigation, not reaction: write it before the surprise, and force a choice between replacement plus residual uncertainty, not “new + old forever.”

That shifts the conversation from categories to performance. “Animal versus non-animal” is identity-driven. “Lower false negatives for this safety decision under these conditions” is legible to executives and regulators. And it creates the permission structure to delete the redundant step once the replacement has earned it. And once you can score an in vivo study against an in vitro stack, you can score an algorithm too. That takes us straight to Nick.

Nick’s AI framing is right, and it needs clear guardrails

Nick usefully frames AI through a risk lens that most of us now feel: you are exposed both ways. Not using AI can make you slow, incomplete, and blind to patterns. Using it can make you confidently wrong, especially when the output drifts out of scope or loses context. His advice lands where it should, with the individual: learn what the system was optimized to do and verify what matters.

For regulated, high-stakes decisions, I would add one more layer. Individual diligence is necessary, but it is not sufficient on its own. In governance settings, “everyone double-check it” can quietly create an accountability gap unless there is explicit ownership for scope, boundary conditions, and escalation when something looks off.

This is where the Predictivity Ledger is useful because it treats AI like any other instrument in the evidence chain. Not as a special category, and not as magic. It simply requires the same five lines: decision, prediction, boundary conditions, track record, and action when wrong, with an owner-of-record (role-based) and an explicit handoff when roles change.

Put AI into the same ledger as everything else and it becomes neither forbidden nor mystical. It becomes accountable. And once it is accountable, it can be safely used where it actually changes a decision, rather than decorating a slide.

There is another, quieter payoff. The ledger makes “one more analysis” expensive again, because every analysis has to name the decision it will change. That pressure is healthy.

The third view: predictivity as capital

Now combine Brian and Nick and you get a third view that is more useful than either alone. Brian is pushing the system to shed weight, but only if replacements are fit for purpose. Nick is warning that faster analytics will not save you if you accelerate the wrong claims.

The bridge is to treat predictivity as capital.

Inputs that consistently predict correctly for a defined decision earn the right to replace something else. Inputs that fail repeatedly accrue debt. That debt gets paid down through tighter boundary conditions, redesign, or retirement.

Consider a concrete scenario. This is a safety package decision, not an efficacy bet.

A translational team wants to cut cycle time on a cardiometabolic asset. The current safety plan includes two in vivo species plus a growing menu of mechanistic screens. The proposal is to replace one animal species with a stack: human-derived cardiomyocytes under stress conditions, a targeted functional panel, and an AI model that flags discordant patterns for human review. Without a ledger, the debate turns familiar. “Regulators will never accept it.” “The cells are more human.” “The model sees what we miss.” Everyone is partly right and nobody can decide.

With a ledger, the question tightens.

Decision: characterize dose-limiting cardiovascular safety liabilities and define a monitoring plan for first-in-human.
Legacy package: two species in vivo plus telemetry.
Proposed package: one species in vivo plus cardiomyocytes, targeted assays, and AI used only for discordance detection, within a defined context of use, as an interpretation QC/triage layer with predefined escalation triggers and human sign-off.
Residual uncertainty plan: enhanced ECG monitoring, pre-specified stopping rules, and confirmatory assays when signals disagree.

Then you score both packages against history you already have: what did each miss, what did each overcall, and under what boundary conditions. If the proposed package shows equal or lower false negatives for the decision, you have a rational basis to pilot it on one program with early regulatory engagement. If it does not, you stop calling it a replacement. You treat it as a research add-on until it earns predictivity. Same tools. Different accounting. Different behavior.

Close

The Predictivity Ledger is not a new initiative. It is a discipline that makes evidence behave like evidence. If you are serious about reducing attrition, you cannot keep letting models, biomarkers, and AI outputs borrow credibility from authority and novelty. They need to earn influence through auditable performance tied to decisions. So here is the decision.

Will you keep funding evidence you cannot score, or will you require a ledger entry before it can influence governance?

Pick one high-cost decision in one program and start keeping the score. The first time a “validated” input fails its own ledger, you will have your answer.

Actionable Takeaways

Require a one-page Predictivity Ledger entry for every input used in the next portfolio or governance decision memo.
Create a “replacement packet” rule: any proposal to remove a legacy study or model must include a ledger comparison and an explicit residual uncertainty plan.
For AI-assisted outputs that enter a decision process, assign a named owner and document scope, drift triggers, and escalation steps.
Run a 90-minute retrospective on the last five terminated programs: write down the miss for each key model or biomarker, then decide what gets redesigned or retired.
Pilot the ledger in one function (toxicology, translational biomarkers, or clinical operations) and publish a simple internal scorecard. If predictivity is invisible, it will not improve.

Show Me the Miss Rate

Reply

Keep Reading

Innovation to Impact;
Ruminations & Ramblings

Show Me the Miss Rate

Reply

Keep Reading

Innovation to Impact; Ruminations & Ramblings

Innovation to Impact;
Ruminations & Ramblings