Metrics that matter in AI projects

In every AI project committee I've sat in, someone presents a dashboard. Most of them are beautiful, full of charts and rising numbers. And most don't say what matters. They measure activity, not outcome. Usage, not impact. What's easy to measure, not what costs money.

Let's tidy up this territory. The metrics worth watching in AI projects fall into four layers, each answering a different question. If you confuse the layers, you'll make bad decisions.

Layer 1: Real adoption, not active licenses

The first layer answers "are people using this?". But you need to be careful with what we mean by "using".

What to measure

Weekly active users (not monthly). The monthly metric hides drop-offs, the weekly one reveals them.
Average usage frequency per active user. Saying "we have 200 users" without knowing how many times a week each one uses it is propaganda.
Depth of use: variety of cases per user. Are they only doing one task always or exploring different uses?
Cohort retention curve. Whoever started three months ago — are they still using it the same or have they dropped off?

What NOT to measure as adoption

Active licenses. Created accounts. Logins. Installations. That's IT department stuff and says nothing about value delivered. I've seen projects with 1,000 licenses, 100 real users and 80% dormant licenses that the dashboard counted as success.

An inactive license is a lie on your dashboard. It costs the same as an active one and produces nothing.

Layer 2: Output quality

The second layer answers "is what the AI produces useful?". This is where metrics most committees avoid live, because they require subjective quality judgment, but they're the ones that move the result the most.

What to measure

Acceptance rate of AI suggestions. When the AI proposes, what percentage of the time does the user accept without edits, edit, or discard? This metric alone tells you whether the model is aligned with your case.
Honest "I don't know" rate. A copilot that says "I don't know" when it doesn't know is far better than one that invents with a confident voice. This rate shouldn't be zero.
Serious errors detected per month. Not just the count, also the severity. Three minor errors weigh less than one error that reached the customer.
Human review time per output. If a person needs 10 minutes to validate what AI produced in 30 seconds, the savings are smaller than they look.

The observer test

Once a month, a human evaluator reviews a random sample of 30-50 outputs and scores them on a simple scale: useful, partially useful, irrelevant, harmful. Without that qualitative signal, the quantitative numbers are flying blind.

Layer 3: Business impact

The third layer answers "is this moving the number we care about?". This is where most projects get stuck, because connecting AI to a business metric requires discipline.

What to measure

Average time per task before and after. With rigorous sampling, not self-reports. People overestimate their savings between 1.5x and 3x when asked.
Cost per unit of work (ticket, lead, report, call). Before and after AI. This brutal metric is what brings everyone back to reality.
Volume processed at equal or lower resource. If your support team closes twice the tickets without growing, that's value.
End-customer quality indicators. NPS, CSAT, churn rate. AI shouldn't improve productivity at the expense of service quality. If it does, you have to adjust.

The control trick

When possible, keep a control group that doesn't use AI for the first three months. Comparing the AI group's evolution against the control group is the only clean signal. Without a control, all improvements get attributed to AI and many aren't its doing.

Layer 4: Cost and efficiency

The fourth layer answers "how much does this really cost us?". Almost every dashboard underestimates this side.

What to measure

Total monthly cost: licenses + API + infrastructure + dedicated human hours. The full sum, not just the vendor invoice.
Cost per successful interaction. Divide total cost by useful interactions, not total ones. If you have 10,000 conversations and 6,000 are useful, divide by 6,000.
Cost-per-unit trend over time. Is it going down with learning and optimization, or going up out of control?
Avoided non-AI cost. If you hadn't done this, what would you have had to hire or outsource? That's the basis for the return.

The hidden cost almost nobody counts

The hours from PMs, IT and business experts who spend time curating prompts, tuning topics and reviewing outputs. In serious projects, this can be 30-50% of total real cost. If you don't count it, your return is inflated and you'll make decisions on false ground.

The minimum viable dashboard

If I had to reduce everything to five numbers for an executive committee, my dashboard would be this:

Weekly active users and their trend.
Output acceptance rate (proxy for quality).
Business metric moved (the one you committed to before starting).
Total monthly cost and cost per useful interaction.
Top 3 problems on the improvement waiting list.

Five numbers, not fifteen. A committee that looks at five numbers can make decisions. A committee that gets fifteen panels ends up looking at the prettiest one and forgetting the important one.

The most common traps in AI dashboards

I've seen enough dashboards to have catalogued the tricks.

The "ideas generated" count

Teams that measure how many ideas AI helped produce. The idea isn't value. Value is the executed idea. Count what reached production, not what was brainstormed.

"Projected savings"

Multiplying minutes by users by days by 220 days a year, and presenting a six-figure number as "annual savings". That's fantasy with a calculator. Measure real savings with sampling, not projected with multiplications.

NPS without context

An NPS of 60 can be excellent or disappointing depending on the baseline. Always ask for it with a comparable: NPS vs. what it was before, NPS vs. alternatives, NPS by segment of use.

When to change the metrics

A metric that doesn't change behavior shouldn't exist. Each quarter it's worth asking: what decisions have been made looking at this number in the last 90 days? If the answer is "none", that metric is decorating, not informing. Drop it. The mental space it frees is worth it on its own.

The mistake I see most often

The mistake I see most often is measuring what's easy. Easy is counting users, counting conversations, counting tokens. Hard is measuring real time saved, output quality and movement of the business number. And since hard costs more, projects default to easy. But committees that only see easy numbers end up making bad decisions, because activity rises and the business doesn't move.

The rule I apply: in any serious AI project, before starting, I write on a sheet the single business metric that's going to move. If we can't commit to one, the project isn't ready to start. That discipline, which seems small, separates teams that deliver real value from those very busy with what looks like value.

The metrics that matter in AI projects aren't the ones that generate the most data, they're the ones that change the most decisions. Real adoption, output quality, business impact and honest total cost. Those four layers, with discipline, tell you at any moment whether your project is alive, dying or performing theater. The rest, however pretty, is decoration.

Metrics that matter in AI projects

Layer 1: Real adoption, not active licenses

What to measure

What NOT to measure as adoption

Layer 2: Output quality

What to measure

The observer test

Layer 3: Business impact

What to measure

The control trick

Layer 4: Cost and efficiency

What to measure

The hidden cost almost nobody counts

The minimum viable dashboard

The most common traps in AI dashboards

The "ideas generated" count

"Projected savings"

NPS without context

When to change the metrics

The mistake I see most often

Related service

Found this useful?