April 20, 2026

What it takes to build AI agents at scale

Up until a few months ago, everyone wondered whether models were capable enough to handle complex workflows autonomously. Today, we know they are. But what does it take to scale this agentic infrastructure?

This question was the throughline at a panel VC firm Lux Capital hosted in New York, with builders from Modal, turbopuffer, and Ramp.

The conversation covered building deterministic evals, scaling compute for reinforcement learning (RL), and maintaining design simplicity to scale rapidly.

The eval suite evolves with the product

Moderator Grace Isford, partner at Lux Capital, posed a question every builder wrestles with: how do you know when an agent is ready to ship?

Alex Shevchenko, Head of Applied Research at Ramp, said there isn’t a one-size-fits-all answer.

He traced Ramp’s own arc of AI waves to illustrate. The first generation of AI at Ramp included structured extraction: feeding a model a document and having it pull out specific fields. That task, he said, is "close to unit-testable.” When the input and output are both knowable in advance, you can build an eval suite, run it when something changes, and check whether anything breaks.

But Shevchenko noted that agentic workflows are harder to evaluate. Unlike with structured extraction, the model isn't pulling out fields, but deciding how to act across a sequence of steps. "Right now, we are letting the LLM decide on the policy," Shevchenko said. "Our goal is to steer the model with a soft prompt in the right direction."

At Ramp, the eval suite starts with a human expert — often an accountant — who writes down how the task should go. A frontier model then stress-tests it, surfacing edge cases or the scenarios the expert didn't think of. Then, users in beta provide feedback on whether the agent got it right. This mix of human and automated feedback helps improve the agent's instructions.

To Shevchenko, the harder question is knowing when there's enough eval coverage to remove the human from the loop altogether. For highly specific tasks like a month-end close with reconciliation logic that no frontier model has been trained on, Ramp is exploring training small RL models. The idea: keep a large frontier model as the orchestrator, and offload the domain-specific work to a specialized sub-agent.

Beyond eval coverage, cost is a variable Shevchenko tracks closely, with model calls as the biggest line item. To manage costs, Shevchenko approaches model selection sequentially: start with the most capable model, iron out the edge cases, and step down to something cheaper once the behavior is stable.

The goal is to get a smaller, cheaper model behaving like the large model. Shevchenko explains that this is not done by training the small model on how the large model thinks, but by confirming that the outputs are right and stepping down from there.

Shevchenko concluded that in finance, where regulations often require a human in the loop, oversight evolves with the product. Early on, every receipt might get reviewed by hand. Eventually, there will be a shift to a system with a human in the loop but not at every decision point.

Bubna (Modal), Benesch (turbopuffer), and Shevchenko (Ramp)

RL and background agents are exploding demand for sandboxes

Scaling agent infrastructure, whether for RL or coding agents, demands compute. That’s where Modal comes in: a compute platform designed for today's AI use cases, like inference, training, batch processing and sandboxes.

One of Modal's core products, sandboxes (secure, isolated testing environments), are especially important for agent development and RL. In Ramp's own background coding agent, Inspect, each session runs in a sandbox on Modal.

Akshat Bubna, Modal’s co-founder and CTO, walked through the evolution of sandbox usage. Early on, sandboxes were used for code interpreter-style workloads: an LLM makes a tool call, runs code somewhere isolated, and returns a result. With the vibe-coding wave, every session needs a corresponding sandbox to iterate in.

Today, Modal is seeing a new level of scale driven by the rise of RL across the platform.

RL demands massive parallelization: running different trajectories in parallel, each in an isolated sandbox. Such parallelization and elastic compute are crucial both for inference and the sandboxed environments in which the agents run. And according to Bubna, running the model repeatedly to generate the trajectories it learns from uses more GPUs than the training step itself.

Some customers at Modal are already running 100,000 concurrent sandboxes to run trajectories in parallel. As agent infrastructure scales across inference, training, and sandboxes, Bubna only sees this demand for compute power growing.

Simplicity scales

Running parallel RL trajectories at scale means searching over enormous amounts of data. This is a problem turbopuffer, a search engine designed for AI applications, aims to solve.

Nikhil Benesch, CTO at turbopuffer, summarized the design principle that governs the company’s architecture: “simplicity scales, and it's essentially the only thing that does.”

This principle comes from the turbopuffer co-founders, who spent nine years at Shopify, watching the platform grow from 1,000 requests per second to over 1 million per second. Their learning: anything built with too much complexity failed to scale. “The systems that survived were the ones that ruthlessly pursued simplicity at every single turn,” Benesch summarized.

In practice, this philosophy means turbopuffer stores all its data in object storage — cost-effective data storage technology like S3, GCS, or Azure Blob, to manage large volumes of unstructured data. The tradeoff is search latency, but turbopuffer uses NVMe SSDs (fast local drives) to reduce latency to tens of milliseconds for frequently accessed data. "It almost feels like we're cheating," Benesch said, "because there isn't a huge downside here."

Still, new business demands have stress-tested turbopuffer's simplicity principle. The company’s original architecture was built around a SaaS model, with multi-tenant data split across independent namespaces. But newer requests, like "search across all of the web" had no neat tenant-specific boundary.

When turbopuffer needed to scale a single namespace to 100 billion vectors, Benesch said, they split namespaces across distributed clusters, a network of interconnected nodes working together, rather than running on a single system. It came at a cost to their margins, but scaled capacity 100x in about a month. 

They're now doing the harder work: routing queries to only the machines that are likely to hold relevant data, recovering efficiency without adding complexity.

The context bottleneck and truly agentic systems

AI applications are only as good as the context they have, Benesch said. The limitation isn't model quality, but the ability to get the right data to the model at the right moment. Today, most data is still siloed, gated, or too expensive to search at scale.

To Ramp’s Shevchenko, the ideas that are easy to build — human-in-the-loop workflows, simple automations — are already underway. The next frontier is removing the human from the loop entirely. That requires eval coverage tight enough to catch edge cases, verifiable signals to keep long-running agents on track, and infrastructure that can absorb the scale.

All three companies agreed that picking the best model isn't enough to build successful agents. The winners will be the teams that can run experiments cheaply, retrieve the right context reliably, and prove agent behavior well enough to operate fully autonomously.

Get The AI Digest delivered straight to your inbox each week.
Unsubscribe anytime.
Gayatri SabharwalContent Marketing
Gayatri covers the latest trends shaping finance and AI to help businesses move faster and work smarter. A New Delhi native, she previously worked in policy and strategy at the World Bank and UN Women.
Ramp is dedicated to helping businesses of all sizes make informed decisions. We adhere to strict editorial guidelines to ensure that our content meets and maintains our high standards.