← Insights
·4 min read

Evidence Over Enthusiasm: A Framework for Evaluating AI Tools

G

Ger Perdisatt

Founder, Acuity AI Advisory

The AI tools market is saturated and the vendor claims are frequently untethered from verifiable evidence. Here is a practical framework for evaluating AI tools before you spend money on them.

The AI tools market has a structural problem: the incentive to claim capability significantly outweighs the incentive to demonstrate it rigorously. Vendors present impressive demonstrations in controlled conditions. Reference customers are selected for enthusiasm rather than representativeness. Case studies report outcomes without baselines. Productivity statistics are sourced from the vendor's own research.

None of this is unique to AI. But AI has an additional layer of complexity: many buyers do not yet have the conceptual framework to interrogate vendor claims effectively. They do not know what questions to ask, so they do not ask them. The demonstration works. The contract gets signed.

The framework below is designed to change that dynamic. It is drawn from what rigorous evaluation actually looks like — the questions that separate genuine capability from demonstration theatre.

Before the evaluation: be clear on the problem

No evaluation framework can compensate for an unclear problem definition. Before you assess any AI tool, write down — in plain language, not in the vendor's terminology — the specific operational problem you are trying to solve. Name the process. Describe the current state. Quantify the cost or constraint if you can.

If you cannot write that down before the vendor demonstration, the vendor will write it for you during the demonstration. That is not evaluation. That is adoption dressed up as evaluation.

Questions to ask vendors

What is the evidence base for the productivity claims you are making? Ask for the methodology, the sample size, the baseline against which improvement was measured, and whether the study was independent. Vendor-funded research is not worthless, but it should not be treated as equivalent to independent evidence. A vendor that cannot produce a methodology should be marked down significantly.

Where does my data go, and who can access it? This is non-negotiable due diligence. For any tool that processes your organisation's data, you need clear written answers about data residency, whether your data is used to train models, what access controls exist, and what happens to your data if you terminate the contract.

What are the known failure modes of this tool? Competent vendors know where their tools fail. A vendor who cannot describe failure modes is either not being honest or does not know their product well enough. Specific failure modes matter because they determine where human oversight is required in your workflow.

What does a successful deployment look like, and what are the common reasons deployments underperform? The answer to this question tells you more about implementation risk than any case study. Vendors who have seen their tools fail know exactly why. The reasons are usually specific and often relate to preconditions — data quality, process maturity, change management — that you can assess against your own situation.

How to run a proper pilot

A pilot that is designed by the vendor is not a pilot. It is an extended demonstration. A genuine pilot has the following characteristics.

It tests the tool on a real operational problem, not a synthetic use case constructed to show the tool at its best. It has a defined baseline — the current performance of the process the tool is intended to improve, measured before the pilot begins. It runs long enough for novelty effects to dissipate. Staff who are enthusiastic about new technology often produce inflated early performance numbers that normalise over time. And it involves the people who will use the tool in production, not a specially selected champion user.

If the vendor resists any of these conditions, that is information.

What metrics to use

The metrics depend on the use case, but there are some universal principles. Measure outputs, not usage. A tool that is heavily used but not producing better outcomes than the previous approach is not delivering value. Measure against the baseline established before the pilot, not against an abstract improvement claim. Include the cost of the tool itself in the return calculation, including implementation costs, training time, and ongoing management overhead.

Avoiding sunk cost bias

The point in an AI tool evaluation where organisations most commonly lose their objectivity is after a significant investment has been made in implementation. At that point, the question changes from "is this tool delivering value?" to "how do we make this work?" The latter question is not always wrong, but it is a different question, and confusing the two is how organisations end up managing failed deployments for years rather than cutting their losses.

Establishing in advance — before implementation begins — the criteria under which you would conclude the tool is not working is the most effective protection against this bias. Write down those criteria. Review against them at defined points. Give people explicit permission to report negative findings.


At Acuity AI Advisory, evidence over enthusiasm is not a slogan — it is how we work. We help Irish organisations evaluate AI tools and investments rigorously, before commitments are made. Talk to us before your next AI procurement decision.

sme strategy