Per-inference cost is the only AI metric that actually scales

Heads up: placeholder draft for AJ to rewrite in his own voice. References the paper at /practical-framework-for-estimating-AI-costs-beyond-the-cloud.pdf.

When boards talk about AI cost, the conversation is usually about the cloud bill last month. That’s the wrong unit. The metric that actually drives decisions at scale is cost per inference — and once you start measuring it, three things get easier.

1. You can compare options honestly

A 70B open-weight model on a local H200 and a closed API endpoint look completely different on a cloud bill. Per-inference cost flattens them into the same row of a spreadsheet. Suddenly you can see that the open model is 4× cheaper at the utilization your business actually has, or that the API is in fact a steal at your volume.

2. CAPEX becomes a strategy question, not a procurement one

The dominant variable in local inference is hardware amortization. Per-inference cost forces you to write down your expected daily utilization — and that’s where most decisions actually live. Underutilized H200s are the most expensive way to produce tokens; saturated H200s are by far the cheapest.

3. ROI becomes provable

It’s hard to argue with finance about “AI value.” It’s much easier to argue the inference cost of this workflow is $0.00011 and it saves a human 90 seconds, valued at $0.40. Per-inference cost is the bridge between the engineering team and the people who fund them.

If you want the full framework — the math, the worked example on an H200, and the parts I think the cloud calculators get wrong — the paper is on the writing page. This note is the executive summary of the executive summary.