The Gap Between AI Spend and AI Results

A pattern repeats in almost every conversation we have had with prospective clients over the last year. They have spent a meaningful amount of money on AI subscriptions, on copilots, on chatbots, on a handful of proofs of concept. Some of these investments came from vision, others from a perfectly reasonable fear of being left behind. What is consistently missing is not effort or budget, it is results. Or more precisely, the kind of results that can be defended to a board, to a finance team, or to a customer, with numbers attached.

This is not only our observation. IBM's 2026 CEO Study reports that nearly 80% of executives expect AI to drive significant revenue by 2030, but only 24% know where that revenue will come from. Deloitte's 2026 State of AI survey puts a finer point on it: only 20% of organizations report actually achieving revenue growth from AI, while 74% say they hope to. The expectation is in place, the path to it is not. There is also a useful paradox worth naming: the companies building AI are simultaneously the ones investing most aggressively in human consulting partnerships. Anthropic, to take one example, is actively recruiting partner success managers and building ecosystems of consultancies and systems integrators around Claude. If the technology could deploy itself into organizational reality, none of that investment would make sense.

This is the conversation we keep ending up in, and it is the one we want to write down here. To describe how we envision AI transformation when we walk into a client's context, what we typically find, and how we try to start them on something that produces concrete outcomes, then help them expand from there into more areas of the business. The internal politics and power structures are theirs to handle. We adapt the technology around those.

1. The two outcomes that actually justify the spend

When clients describe what they want from AI, we hear two distinct kinds of expectation, and they are worth separating because they require different engineering postures.

The first is cost reduction. This is the older, more intuitive narrative. Automate work that humans currently do, and pay less for it. Customer support, document processing, internal search, manual data entry. These remain perfectly legitimate goals, and they often produce the easiest wins to measure, because the baseline already exists. You know what one ticket costs to resolve today, and you can compare.

The second expectation is newer and, in our experience, the more interesting one. It is about new revenue and new capabilities, things that simply were not feasible before AI made them tractable. An airline using agents to rebook flights and reroute bags automatically, freeing human agents for the complex cases that actually need judgment. Decision support inside workflows where having an analyst in the loop was previously impossible. Products that shape themselves around customer context in ways that used to require a custom build per account. This is the part of AI transformation that is not just about doing the same things more cheaply, it is about doing things that did not exist on the menu before.

We see many clients spend heavily on the first kind of outcome, and some are now starting to wonder how they can move on to the second pillar, as the AI infrastructure around them gets more mature. Part of our work is helping them see that the cost cutting story, while useful, is not the ceiling. The new revenue story is harder to scope, harder to estimate, and harder to deliver, but it is also where the meaningful advantage usually lives. Before clients can pursue either outcome seriously, there is a mental model that usually needs to shift first.

2. Chatbots are not the current frontier

When a non-technical stakeholder thinks about AI, the first-hand experience they usually have is a chatbot. Sometimes it is an agent in one of its more accessible, ready-made forms, a code assistant or a desktop LLM client, but they very rarely have hands-on experience with a true agent, one with its own harness, system integration, tool capabilities, and real autonomy, in the way they do with chatbots. That is the artifact they have used, the artifact they have shown to their colleagues, the artifact they will reference when they say "we should do something like that." This is not a criticism, it is just the calibration that comes from what is widely deployed and easy to try.

Part of what we end up doing in early conversations is gently reframing this picture. The current state of the art is not a chatbot, it is an agent, or more often a small system of agents working together, implementing a workflow by chaining simple actions and responsibilities together. The difference is not cosmetic. A chatbot answers. An agent acts. An agent can call tools, retrieve information, take steps, ask for confirmation, and bring something back while being evaluated at various levels. A chatbot keeps a conversation going. An agent ends a task and ideally accomplishes an outcome.

This distinction matters because it changes what AI can be put in charge of. The economic argument shifts. Chatbots compress the cost of answering. Agents compress the cost of doing. The second one is where most of the new revenue and new capability stories actually live. Another way to think about it is that every major infrastructure shift follows the same pattern: the raw technology lands and while impressive it does not transform anything. The applications built on top of it do. Electricity needed appliances, the internet needed software, AI needs workflows and agents. Having access is not the same as using it productively.

That said, agents are not a clean win. They also fail, and they fail in ways that are harder to detect than traditional software failures. Gartner predicts that by 2027, 40% of organizations will demote or decommission AI agents because of governance struggles. The root cause is that most organizations treat agent governance as binary, either fully locked down or fully trusted. Over-restriction slows delivery and drives shadow development. Under-restriction increases operational and compliance risk. Only one in five companies, according to Deloitte, has a mature governance model for autonomous agents. This is not an argument against agents, it is an argument for building them with observability, guardrails, evals, and proportional governance from the start. Which is where engineering comes in.

3. Where engineering becomes a build, not only a slide deck

Agents that work in demos and agents that work in production are not the same thing. There is a part of AI transformation where consulting and slideware stop being enough, and you need actual engineering. We bring two things into that part of the engagement.

The first is depth in offline models and reinforcement learning training. Not every problem is best served by a hosted frontier model accessed over an API. Many of our clients have data they cannot send off premise, latency budgets that public APIs cannot meet, or economics that fall apart at scale unless inference is local, tuned, and operates at a justifiable cost. For those cases, we work on smaller, specialized models, on fine tuning and reinforcement learning approaches, and on the operational machinery to keep them honest in production. This is not a religious position about local versus hosted, on the contrary it is a position that the right answer depends on the workload and what the customer actually needs. The fact that we can do both gives us an edge.

The second is Evolvable.ai, our agent orchestration platform. We built it because we were repeatedly running into the same pattern at clients, a few standalone bots, a few scripts, a RAG pipeline, a few workflow automations, none of them aware of each other, all of them losing context at hand-offs. Evolvable.ai gives us speed. Bring-your-own-key model access, customized offline models, built-in observability, tenant isolation, and human-in-the-loop controls are already solved problems inside the platform, which means we are not rebuilding orchestration infrastructure from scratch on every engagement. That is the advantage, time to first working agent, not platform lock-in.

Solving the engineering part though is only half of the problem. Once agents start completing tasks, things start to shift in how the organization works. In the pre-AI era, tasks and people were welded together. Managing people was how you managed work, and the proxy held. In the agentic era, that proxy is starting to break. Task outcomes can be monitored directly, fast, with timing, evidence, and clear records of success or failure. This close monitoring of people, especially in more creative industries, does not have the same results. Recent research points the same direction: MIT's Iceberg Index proposes measuring AI exposure at the task and skill level rather than the job level, precisely because traditional people-centered metrics cannot see where the real overlap between human and AI capability occurs. The implication is that leaders need different instruments for each: observability and guardrails for task pipelines, trust and judgment for the people who design, supervise, and improve them.

Gartner's May 2026 research confirms this as well, the organizations improving ROI from autonomous technologies are not those eliminating people, but those investing in skills, roles, and operating models that allow humans to guide autonomous systems. A downstream effect we are already observing is that team structures flatten because scope-of-contribution broadens. More people can touch more parts of the work, and the skills that rise in value are taste, judgment, and the ability to frame problems well enough that another person, or an agent, can understand and act on them.

4. How we start

Everything above describes what we know and what we can build. None of it matters if we cannot prove it in a client's actual context. This is why we never start with a large commitment.

When we engage with a client, we ask them a simple question: which problem can we solve first to display value, for which user, with which constraints, and how will we know it worked. We choose something small and doable. Not a demo on someone else's data, not a slide. We are looking to identify a specific situation where an agent can start doing a real piece of their work, end to end, with the rough edges visible.

For task automation specifically, we instrument three things before we begin: the baseline cost or time of the current process, the same metric after the agent is running, and the error or escalation rate that tells us whether quality held. If we cannot define those three before the engagement starts, we treat the problem as not yet ready for AI. Conway's law is real after all, systems end up reflecting the communication structures of the organizations that build them, and AI architectures are no exception. This is why we shape the engagement around the organization as it actually exists, not as we wish it were.

That early win is how we earn the trust that our way of working produces tangible benefits, before scope expands into anything larger.

5. The cost of not starting

The organizations that delay this work do not stay still. They accumulate disconnected AI tools, uncoordinated agents, and organizational assumptions that were designed for a world where tasks and people were the same thing. Gartner projects task-specific AI agents will jump from less than 5% of enterprise applications in 2025 to 40% by the end of 2026. The question for most organizations is no longer whether agents will arrive in their workflows, it is whether they will arrive in a structured and controllable way, with governance, measurement, and intent, and thus helping businesses achieve their goals, or whether they will arrive chaotically, leading to distrust, skepticism, and eventually rejection.

Each quarter that passes without addressing this makes the eventual changes more challenging. Not because the technology gets harder to embrace, but because the communication lines, the structures, and the expectations calcify around old assumptions and habits. By the time the gap between what the tools can do and how the organization uses them becomes visible, it is already a harder problem to solve.

If you want to pressure-test where you are, we can usually tell within a single working session whether your current AI investments are producing measurable outcomes or whether the measurement apparatus itself is missing. That session costs nothing except honesty about what has actually worked so far.