Every model generation unlocked capabilities that were not possible before. Here is the concrete, specific journey from research prototypes to 7 production AI agents managing a £6.4M transformation.
I have been building data products since the late 1990s. I have seen every hype cycle. So when Claude 3 Opus launched in early 2024, I did not get excited — I tested it against actual business problems.
The specific test was pricing research. I fed it six months of transaction data from a mid-market retailer and asked it to identify price elasticity patterns by category. Previous models had produced plausible-sounding but numerically unreliable analysis. Claude 3 Opus was the first model that consistently got the directional economics right. It correctly identified that the client's accessories category was price-inelastic (customers bought at full price) while their basics category was highly elastic (small discounts drove disproportionate volume).
I also used it for early customer segmentation prototypes. The model could reason about behavioural patterns — distinguishing between a high-value customer who had stopped buying (lapsing) and a low-value customer who had never bought much (low-engagement). That distinction is critical for CDP design because you treat them completely differently.
The limitation was speed and cost. Running these analyses at scale was prohibitively expensive, and the latency meant real-time applications were out of the question. But as a proof of concept, Claude 3 Opus convinced me that serious AI-powered business analysis was viable. That was the moment I started designing what would become MarginOps.
Claude 3.5 Sonnet solved the economics problem. It was faster and cheaper than Opus 3 while being nearly as capable for the specific analytical tasks I cared about. That combination — quality sufficient for production, cost low enough for scale — was the unlock.
With 3.5 Sonnet, I could run multiple concurrent agent deployments without the API costs eating the business case. The first production prototype was a customer segmentation agent that processed the full customer database nightly. At Claude 3 Opus pricing, running that agent would have cost more than the £40K vendor it was designed to replace. At 3.5 Sonnet pricing, the annual API cost was under £8K. The economics worked.
I also started building the pricing analysis pipeline during this period. The agent could process 15,000 products in a batch, running category-level elasticity analysis and generating markdown recommendations. The quality was good enough for human review — a senior merchandiser could scan the recommendations and approve or override in minutes rather than building the analysis from scratch. That hybrid workflow — AI generates, human validates — became the template for how MarginOps operates.
The coding capabilities in 3.5 Sonnet were also a step change. I started using it to write data pipeline code, ETL scripts, and monitoring dashboards. It was not yet good enough to build complete production systems autonomously, but it dramatically accelerated the development of the infrastructure that the agents would eventually run on.
Opus 4 was the inflection point. Everything before it was prototypes and human-in-the-loop workflows. Opus 4 was the first model I trusted to run autonomously in production.
The agentic capabilities were the difference. Opus 4 could execute multi-step workflows reliably: retrieve data from a database, analyse it, make a decision, execute an action, verify the outcome, and handle errors gracefully if something went wrong. Previous models could do each step individually, but the end-to-end reliability was too low for unsupervised operation. Opus 4 pushed that reliability above the threshold where autonomous operation made business sense.
This is when the 7 production agents became viable. The pricing agent started running its 7-check monitor every 15 minutes without human supervision. The CX agent began handling first-line customer support queries autonomously, escalating only when confidence was low. The DevOps agent started monitoring infrastructure and executing routine optimisations without intervention.
The frontier coding quality also meant I could build production systems, not just prototypes. The database optimisation that delivered a 59,000x query speedup was built during this period. Opus 4's ability to reason about query execution plans, identify bottlenecks, and generate optimised SQL was genuinely impressive — it found optimisations that experienced DBAs had missed.
Curious what your margin opportunity looks like?
Free Tool
How much margin are you leaving on the table?
Answer 6 questions. Get a personalised margin estimate in under 2 minutes.
Take the Free Margin AuditOpus 4.5 was an evolution, not a revolution, but the improvements landed exactly where MarginOps needed them. SWE-bench verified hit 80.9%, and that number translated directly into my ability to ship production code faster and with fewer bugs.
The cloud migration was the showcase. We migrated the client from an over-provisioned infrastructure to a right-sized architecture, achieving 60% cost reduction. This involved rewriting deployment configurations, optimising database queries, restructuring caching layers, and updating monitoring. With Opus 4.5 powering Claude Code, I could execute these changes with confidence. The model understood infrastructure-as-code, could reason about the cascading effects of configuration changes, and generated production-ready code that passed CI pipelines on the first attempt more often than not.
The database optimisations also accelerated during this period. Opus 4.5 could analyse a Snowflake query execution plan, identify that a particular join was triggering a full table scan on a 400M-row table, and generate an optimised query with appropriate clustering keys and materialised views. The 59,000x speedup on the critical product analytics query — from 47 seconds to 0.8 milliseconds — was the headline result, but the model delivered dozens of smaller optimisations that collectively saved hours of daily compute time.
Workplace task performance also improved noticeably. Report generation, email drafting, meeting summarisation, project documentation — the operational overhead of running a consulting engagement dropped significantly. More time on analysis and delivery, less time on administration.
Opus 4.6 is the model that made the complete MarginOps transformation programme possible at scale. Two capabilities changed everything: agent teams and the 1M token context window.
Agent teams let me deploy a coordinator agent that manages sub-agents across all 7 workstreams simultaneously. Before this, I was the coordinator. I would context-switch between pricing analysis and cloud costs and customer segmentation and warehouse operations, manually synthesising the cross-workstream dependencies. Now the coordinator agent handles that synthesis, and it finds connections I would have missed — like the relationship between vendor API call volume and cloud infrastructure costs that saved an additional £40K annually.
The 1M token context window means I can load an entire business into a single session. Two years of P&L data, the full vendor contract portfolio, the complete product catalogue, the customer database schema, the infrastructure architecture. No more chunking, no more lost context between sessions. The transformation audit went from three weeks to three to four days. Not because the analysis is less thorough. Because the model does not forget what it read two hours ago.
Finance Agent, TaxEval, and BigLaw Bench leadership means the analytical quality matches the scale. When I am running 119 EBITDA-tracked initiatives across 7 workstreams, every analysis needs to be trustworthy. Opus 4.6 delivers that consistently — our internal accuracy rate on financial analysis improved approximately 14% over Opus 4.
The results of the full programme: +77% revenue from AI pricing. 60% cloud cost reduction. CSAT from 59% to 80%. Seven production agents. 119 initiatives tracked to EBITDA. £6.4M in deployed transformation value. All on Claude.
Each generation did not just add incremental capability. Each one unlocked an entirely new category of work. Claude 3 proved the analysis was possible. 3.5 Sonnet proved the economics worked. Opus 4 proved agents could run in production. Opus 4.5 proved complex infrastructure work could be AI-assisted. Opus 4.6 proved that full-scale transformation programmes could be AI-powered.
If the trajectory continues — and the pace of improvement suggests it will — the next generation will unlock capabilities I cannot fully anticipate. What I can say is that operators who adopt each generation first compound their advantage. I have been building on Claude since Opus 3. That head start means MarginOps has production agent architectures, tested workflows, and proven deployment patterns that would take a newcomer months to develop.
The lesson for anyone evaluating AI for operations work: start now, with the best model available, on real problems. Do not wait for the "perfect" model. Every generation you skip is compounding advantage you are giving to competitors who did not wait.
We deploy the current frontier model against your actual operations data. No benchmarks. No demos. Real P&L impact.
We go into businesses and make them permanently more profitable. Every initiative is EBITDA-tracked.