Stop Running GTM AI as an Experiment — It's Infrastructure Now
AI in GTMBusiness StrategyRevOpsOperationsB2B SaaS

Stop Running GTM AI as an Experiment — It's Infrastructure Now

T. Krause

When a bank reclassifies its AI spend from experimental R&D to core infrastructure, it isn't an accounting footnote. It's a decision about how the work is owned, funded, and held accountable. Most GTM orgs still run their AI on the experiment budget — and it shows.

A go-to-market leader described her AI budget to me in April with a phrase that gave the whole thing away: "innovation funding." Her AI tools — outbound agents, content systems, a revenue intelligence layer — were paid for out of a discretionary experiment pool, reviewed annually, owned by no single person, and justified each cycle by the question "is this still worth trying." Two years in, her team depended on these tools every working day. The CRM ran on them. The pipeline ran on them. And they were still, on paper and in process, an experiment.

That mismatch — production dependence funded as discretionary experiment — is the quiet dysfunction inside a lot of go-to-market orgs right now. It produces tools nobody fully owns, funding that can be cut in a tight quarter, and a standard of accountability appropriate for a pilot but dangerous for something the revenue engine depends on.

The contrast is what large, disciplined operators have started doing. When a major bank formally reclassifies its AI investment from experimental R&D to core infrastructure — and puts a roughly $19.8 billion technology budget and 2,000 dedicated AI staff behind that reclassification — the headline is the number, but the decision is the category. Moving AI from "experiment" to "infrastructure" changes how it is owned, funded, staffed, and held accountable. Most go-to-market orgs haven't made that move, and the experiment framing is now actively costing them.

Experiment and Infrastructure Are Two Different Operating Categories

"Where the AI budget sits" sounds like an accounting question. It is actually a decision about how the work runs. The two categories impose two different operating models.

An experiment is funded to learn; infrastructure is funded to run. Experiment money buys an answer to "does this work" and is rightly reviewed against that question. Infrastructure money buys a capability the business runs on, and is reviewed against reliability and performance. When something the team depends on daily is still funded to answer "does this work," it is funded to be cancellable — and a daily dependency should not be cancellable on a discretionary review.

An experiment can be owned loosely; infrastructure cannot. A pilot can be a side project, a committee, a champion's enthusiasm. Infrastructure requires a named owner accountable for uptime, performance, and maintenance — because when infrastructure fails, work stops. AI tools left in the experiment category inherit experiment-grade ownership, which means effectively none, which is why they degrade quietly.

An experiment is judged on novelty; infrastructure on reliability. The standard for a pilot is "did we learn something interesting." The standard for infrastructure is "did it run, consistently, at the quality the business needs." A team applying the experiment standard to a production AI system tolerates flakiness it would never accept from its CRM — because the category it filed the tool under told it to.

What Breaks When Production AI Stays on the Experiment Budget

The miscategorization isn't cosmetic. It produces specific, recurring failures.

Funding instability under something load-bearing. Experiment budgets are the first cut in a tight quarter — that is what discretionary means. When the load-bearing AI tools sit in that pool, a routine cost-control exercise can knock out capability the revenue engine depends on. The business takes an outage from a decision that was never meant to be an outage decision.

No one owns the uptime. Experiment-category tools have experiment-category ownership: diffuse, optional, no one paged when it breaks. So when a production AI system degrades — quality drifts, an integration silently fails — there is no owner whose job is to notice. It degrades for weeks. The team feels the drag and can't name the cause.

No maintenance, because experiments don't get maintained. Infrastructure has a standing maintenance budget — monitoring, retraining, updating. Experiments don't; they end. A production AI tool funded as an experiment gets no maintenance line, so it slowly rots in place while the team keeps depending on it, and the rot is mistaken for the tool simply "not being that good."

An accountability standard too loose for the stakes. Experiment-grade accountability — "interesting, let's keep going" — applied to a system that touches every customer record and every piece of outbound is a genuine risk. The tool is held to a pilot's standard while doing production's job. The gap between the two standards is unmanaged exposure.

Where This Shows Up in Practice

Budget reviews. The annual review treats the AI line as discretionary and asks whether to renew the experiment. But the team can't operate without it — so the "review" is theatre, and everyone knows it, which corrodes the whole budgeting exercise. Worse, in a real squeeze, the theatre becomes real and the tool gets cut. Infrastructure does not belong in a conversation that can end it on a whim.

Incidents. A production AI tool fails — bad outputs at scale, a broken sync. Because it's experiment-category, there's no on-call owner, no runbook, no escalation path. The incident runs long and the response is improvised, because the tool was never set up to be operated, only to be tried.

Vendor management. Experiment-category tools get experiment-grade vendor terms — light SLAs, loose security review, casual data handling. Then the tool becomes load-bearing and the contract underneath it is still a pilot contract, with no guarantees proportionate to the dependency the business has built on it.

Team structure. No one's job description includes "operate the GTM AI." It's a thing people attend to between other duties, because the experiment framing never created the role. Infrastructure framing does — it forces the question of who runs this, and answers it with a name.

What to Actually Do About It

Audit which AI tools are actually load-bearing. Go through the AI stack and mark each tool: genuine experiment, or production dependency the team can't work without. Be honest. Most go-to-market orgs will find the majority of their AI is load-bearing and still filed as experiment. That list is the reclassification list.

Move the load-bearing tools to a stable budget line. Take the production-dependency tools out of the discretionary pool and give them a real operating budget — one that isn't first to be cut in a squeeze. Funding stability is the practical meaning of "infrastructure." If it can be cancelled on a quarterly whim, it isn't infrastructure yet, whatever you call it.

Assign a named owner to each load-bearing system. Every production AI tool gets one accountable person — for uptime, performance, maintenance, and the vendor relationship. Not a committee. A name. The name is the difference between a system that is operated and a system that merely exists until it doesn't.

Fund maintenance explicitly. Add the standing line for monitoring, retraining, and updating. Infrastructure that isn't maintained becomes unreliable infrastructure, which is arguably worse than no infrastructure because the team is depending on it. The maintenance budget is not optional once the tool is load-bearing.

Re-paper the vendor terms to match the dependency. For every tool you reclassify, revisit the contract. The SLA, security commitments, and data terms should reflect that the business now runs on this, not that it once piloted it. A pilot contract under a production dependency is a gap waiting to be discovered at the worst time.

The Stakes

Go-to-market orgs that keep their AI in the experiment category are running production capability on a pilot's foundation. It works, until a budget cut removes it, or a silent degradation drags on the numbers for a quarter before anyone traces it, or an incident runs long because no one was ever assigned to operate the thing. None of these failures are exotic. They are the predictable cost of a category error, and they all arrive eventually.

Orgs that reclassify their load-bearing AI as infrastructure get stable funding, named owners, real maintenance, and contracts that match the dependency. The tools don't get more capable from the reclassification — they get more reliable, because they're finally being operated like something the business depends on, which is what they are. The reclassification is not paperwork. It is the decision to stop pretending the experiment ended without anyone noticing.

The experiment phase of GTM AI is over for most teams — it ended the day the tools became load-bearing, whether or not the budget caught up. Running production capability on experiment funding, experiment ownership, and experiment accountability is not caution. It is a structural fragility you've chosen by default. Find the load-bearing tools and move them. Infrastructure that's still filed as an experiment is just an outage you haven't scheduled.