News

The AI Gold Rush is Real, But Most People are Looking in the Wrong Spots

Photo of author

Mark Jackson

Photo Credit: FreePik

Everyone’s chasing the AI gold rush, but most are digging in the wrong places. S&P Global found 42% of firms abandoned most AI initiatives in 2025, up from 17% a year prior, as flashy demos stall at security and cost reviews.

The map that works isn’t experiments; it’s revenue-adjacent automations and AI agents in business. This guide reveals the production stack, governance checklists, and model choices to build over your own data, slash token spend, and ship AI that delivers ROI—not just another pilot.

1. The AI Gold Rush Is Real: 2025 Proof

The AI Gold Rush Is Real: 2025 Proof
Photo Credit: FreePik

Enterprise adoption is no longer the debate—it’s the plan. McKinsey reports 78% of companies are using AI in at least one business function. For budget owners, this means AI line items are moving from “innovation” to operating spend, with pressure to prove EBITDA impact, not demos. Expect CFO review on model choice, data sources, and measured lift.

Costs collapsed, so more use cases clear the hurdle. The Stanford AI Index shows GPT-3.5-level inference fell from $20 → $0.07 per million tokens between Nov 2022 and Oct 2024—about 280× lower. That resets what’s viable: high-volume text classification, summarization, templated drafting, and retrieval-grounded Q&A now pencil out at scale. The economic story of 2025 isn’t new tricks; it’s cheap tokens.

Infra ramp is tangible—and priced. You can rent serious GPU capacity by the hour. Example: CoreWeave lists 8× H100 nodes at $49.24/hr and 8× H200 at $72.67/hr; larger racks and B200/GB200 options exist for heavier inference. If you just need single-GPU on-demand, Lambda Cloud advertises H100 starting around $1.85–$2.40/GPU-hr and B200 $2.99–$3.79/GPU-hr depending on term/cluster. This matters because you can right-size compute to the job instead of over-committing to hyperscalers.

2. Where Value Accrues: Data, Distribution & Inference (Not Training)

Where Value Accrues: Data, Distribution & Inference (
Photo Credit: FreePik

Inference is the profit center. NVIDIA’s Blackwell family emphasizes lower cost per million tokens. It emphasizes higher throughput versus Hopper. Posts and MLPerf entries show large step-ups in efficiency. Translation: the money is in serving smarter, not training bigger. Your “picks & shovels” in 2025 are batching, caching, quantization, and tuned backends. These backends include TensorRT-LLM and vLLM, not custom pretraining.

Distribution beats novelty. Apple’s ChatGPT integration illustrates how platform distribution sets user expectations. It shifts default behavior. People will tap what is built into the OS, office suite, CRM, or browser. Ship where your users already live. This means Gmail/Outlook add-ins, CRM side-panels, ticketing plug-ins, and iOS/Android share sheets.

Data advantage is greater than model tinkering for most teams. For enterprise tasks, proprietary workflow data plus retrieval drives accuracy and trust. Retrieval-augmented generation (RAG) lets you ground answers in your docs, tickets, logs, and product data. This is done without risky model drift. Fine-tune later, after you have proven stable gains.

Use production-minded evals. Examples are TruLens’s RAG Triad (context relevance, groundedness, answer relevance) and Arize Phoenix traces/evals. This moves you beyond vibes. Compute market reality: you are not stuck with one cloud. Live list prices show competitive options. These include CoreWeave hourly for H100/H200/B200. Lambda Cloud posts H100/B200 at aggressive rates.

Use spot/committed terms only after measuring steady-state load. This matters because you can right-size compute to the job instead of over-committing to hyperscalers. What this means for AI monetization strategies: build for usage and unit economics. Tie agents to revenue or cost events. Log cost-per-resolution (support). Log cost-per-lead-qualified (sales). Log cost-per-spec-doc (engineering). Improve those with inference efficiency, not bigger pretrains.

3. Wrong Places People Are Digging (and Why They Fail)

Wrong Places People Are Digging
Photo Credit: FreePik

Training your own foundation model without scale/data. Token prices fell at the API layer, not the training layer. Unless you have unique data and real scale, bespoke pretraining burns cash and time with little differentiation. Stanford’s 2025 Index quantifies the dramatic inference cost decline—that’s where your advantage should compound.

Generic chat UIs with no system integration. Sensor Tower shows AI usage and spend booming, but winners ride distribution and real tasks. “Chat in a box” without calendar, CRM, or policy context sees weak retention. Build task-specific flows and deep integrations, not just a chat window.

Pilots that don’t touch revenue or cost. McKinsey highlights a gap between experimentation and bottom-line impact. Pilots stuck in innovation labs don’t pay back. Tie each build to a P&L lever and measure it.

4. Right Places to Dig: Revenue-Adjacent Automations (With Proof)

Right Places to Dig: Revenue-Adjacent Automations
Photo Credit: FreePik

Klarna reports its AI assistant now handles ~2/3 of customer chats. Resolution occurs in under 2 minutes. Productivity equals 700 full-time equivalents (FTEs). This represents sustained operating leverage, not just a demo. A Microsoft-GitHub randomized trial found developers finished tasks 55.8% faster with Copilot. Use that blueprint for internal return on investment (ROI) stories. Define the tasks clearly.

Randomize the trial participants. Measure the time-to-complete and the quality. Then, extrapolate the findings to loaded costs. Forrester’s Total Economic Impact (TEI) on Microsoft 365 Copilot estimated 116% ROI. Concrete value levers included fewer switch-costs. Faster drafting was another lever. Better meeting summaries contributed as well.

Treat this as executive-friendly evidence now. Collect your own internal telemetry data. Beyond support, Klarna also reported multi-function savings. Marketing copy and creative efficiencies were highlighted in the press. Use cross-functional agents to stack impact.

Include support, marketing, and finance operations agents. To run it next month, pick one revenue-adjacent workflow. Examples are support replies, sales notes, or renewal nudges. Ground the agent with your data. Use Retrieval-Augmented Generation (RAG) over policies, Stock Keeping Units (SKUs), and contracts. Define clear acceptance criteria.

Criteria include accuracy, latency, guardrails, and cost per interaction. Log the outcomes of the process. Record resolution time and escalation rate. Measure Customer Satisfaction (CSAT) and cost per ticket. Iterate the process weekly. Expand the scope only when all the key metrics hold steady.

5. Model Strategy 2025: Use the Smallest Model That Hits KPIs

Model Strategy 2025: Use the Smallest Model That Hits KPIs
Photo Credit: FreePik

Start cheap, move up only if blocked. Use cost-effective models that meet your acceptance criteria, then graduate. Current public pricing examples:

  • Claude Sonnet (4/4.5/3.7 family): $3 input / $15 output per 1M tokens; prompt caching/batching lowers this further. Anthropic+1
  • Google Gemini 2.5 Pro (dev API): published ranges show $0.625–$1.25 per 1M input tokens depending on context size and $5–$7.5 per 1M output; lower-cost 2.5 Flash and Flash-Lite are much cheaper. Google AI for Developers
  • OpenAI API: pricing varies by model generation; check the live page before locking budgets.

Pull every pricing lever before “going bigger”. Use context caching, continuous batching, KV-cache reuse, and speculative decoding. Vendors now document these levers directly (e.g., Gemini context caching; Anthropic prompt caching; TensorRT-LLM KV-reuse). These often beat jumping to a pricier model.

Keep vendor optionality. Abstract your client, keep eval gates per task, and swap models without regressions. Track cost-of-pass (cost to get a correct answer). Research in 2025 formalizes this metric—use it in your dashboards.

6. Build a Data Advantage: Retrieval > Fine-Tuning for Most Workflows

Build a Data Advantage: Retrieval > Fine-Tuning for Most Workflows
Photo Credit: FreePik

Start with Retrieval-Augmented Generation (RAG) for enterprise knowledge. Most business questions and answers need your existing documents. Policy answers and case resolution require your internal documentation. You do not need to rely on training new model weights initially. Begin with clean content preparation. Focus on proper chunking and metadata application.

Ensure you have strong retrieval mechanisms. Fine-tune the model only after proving stable gains. Do this only if RAG alone cannot achieve those gains. Evaluate the system as if it were already in production. Tools like TruLens provide the RAG Triad. The RAG Triad includes context relevance, groundedness, and answer relevance.

This allows you to gate releases based on evidence, not just sentiment. Arize Phoenix offers traces for debugging. It provides evaluation comparisons for models. The tool assists with dataset curation for repeatable science. It is essential to close the feedback loop. Log all incoming queries and retrieved data chunks. Record user votes or corrections made.

Capture the ultimate outcomes, such as solved or not solved. Only promote new prompts and retrievers when groundedness passes thresholds. Answer-relevance must also pass defined thresholds. This rigorous process builds a durable data moat over time.

7. Ship, Observe, Improve: LLM Observability in Practice

Ship, Observe, Improve: LLM Observability in Practice
Photo Credit: FreePik

What to log: prompt/response traces, tool calls, retrieval contents, latency, token spend, and hallucination flags. Add task outcomes (resolved sale lead? correct refund?).

Tools to use now:

  • WhyLabs (LangKit): monitors prompt risk/safety and output quality signals.
  • Arize Phoenix: open-source traces, evals, comparisons, and regression testing.

Cadence that works:

  • Nightly offline evals on a fixed test set (prevent silent regressions).
  • Weekly business KPI review (e.g., cost per ticket, time-to-resolution, CSAT).
  • Monthly security/compliance audit of prompts, data sources, and PII handling.

8. Stay Legal: The 2025–2027 AI Compliance Timeline (EU AI Act) + ISO 42001

Stay Legal: The 2025–2027 AI Compliance Timeline
Photo Credit: FreePik

EU AI Act key dates to plan around:

  • Feb 2, 2025: prohibitions and AI literacy provisions begin.
  • Aug 2, 2025: GPAI (general-purpose AI) obligations start.
  • Aug 2, 2026: broader application and most provider duties.
  • Aug 2, 2027: high-risk embedded systems deadlines.
    Action: risk-classify your use cases now and map obligations per date.

ISO/IEC 42001 is the first AI management system standard. Use it as the backbone for your program: policy, risk, data governance, monitoring, and improvement cycles. An auditable structure helps you pass vendor reviews and procurement.

Practical checklist: maintain model cards with eval evidence, data lineage, DPIAs where needed, human-in-the-loop points, and incident response. Treat this as “guardrails that let you ship,” not paperwork that blocks progress.

9. Compute Strategy That Doesn’t Sink the P&L

Compute Strategy That Doesn’t Sink the P&L
Photo Credit: FreePik

Pick the right Graphics Processing Unit (GPU) tier for the task. Match the H100, H200, or B200 to the specific workload. Ensure the GPU choice meets the latency promises. Be realistic about the current market prices. CoreWeave posts per-node hourly rates for their services. An example is 8x H100s at $49.24 per hour. Lambda Cloud lists H100 and B200 on-demand and cluster prices.

Their pricing is competitive with major hyperscaler providers. Quote the most current numbers in your internal documentation. Revisit and update these prices on a monthly basis. Use software economics in your planning, not just hardware Stock Keeping Units (SKUs). Techniques like batch inference can cut cost and latency. Key-Value (KV)-cache reuse is another effective technique. Speculative decoding also contributes to efficiency.

Quantization is a critical optimization method. NVIDIA’s TensorRT-LLM documents detail real performance wins. Use vLLM’s PagedAttention for better throughput. Continuous batching also raises throughput effectively. These methods use memory more intelligently. Track all these measured gains in a dedicated cost dashboard.

This optimization pays because token prices are already low. Throughput and cost-per-Megatoken (MTok) are your new margins. Treat the entire inference optimization process as a product. Ship measured performance profiles, not just unverified guesses.

10. Distribution & Partnerships: The Fastest Way to Durable Moats

Distribution & Partnerships: The Fastest Way to Durable Moats
Photo Credit: FreePik

Ship where users already work. The Apple-ChatGPT tie-up proves platform channels move the needle. Mirror that: integrate into suites (M365/Google Workspace), CRM, help desk, browsers, and mobile share sheets. People click what’s one tap away.

Partner for trust and reach. Co-sell via cloud marketplaces for procurement, billing, and security reviews. Publish clear data-handling docs and model cards to pass vendor risk quickly.

Playbooks that compound:

  • Office add-ins that draft, summarize, and file with correct metadata.
  • CRM side-panels that fetch context, log notes, and suggest next steps.
  • Support tools that answer, cite, and escalate with guardrails.
Flipboard